DEV Community

Cover image for How CascadeFlow Cut Our Review Cost Without Hurting Quality
Vishal Somaraju
Vishal Somaraju

Posted on

How CascadeFlow Cut Our Review Cost Without Hurting Quality

*Most AI dev tools have the same bad habit: *
send everything to the biggest model available and hope the bill feels reasonable later.
That works for demos. It doesn't work when you're actually building something.
A five-line hello world should not cost the same as reviewing auth middleware touching JWTs, SQL, async flows, and environment variables. But a lot of tools treat every snippet like a production incident. Same model, same cost, every time — regardless of what the code actually is.
That was the problem we kept running into while building Refyn. We were working under real time pressure, and the fastest way to burn both budget and reliability was dumb routing. We'd already watched Gemini fail completely — four supposedly valid API keys from three different Google accounts, all returning "API key not valid." To this day we don't know why. Then a Groq model got decommissioned mid-testing, mid-session, with a 400 error and no warning. After that, "just send everything to one model" stopped sounding simple. It started sounding fragile.
CascadeFlow became our answer: score the code first, then decide how much model you actually need.

What Refyn Is :
Refyn is a browser-based VS Code-style code review workspace — React, Vite, Monaco Editor, Tailwind, Framer Motion on the frontend, Node.js and Express on the back.
You paste code, hit Analyze, get structured issues back with severity tags, a quality score, and a Smart Fix button. Standard stuff. The part that isn't standard is what happens before the model call.
Refyn has two layers most review tools skip entirely.
The first is Hindsight memory. It remembers your recurring patterns across sessions and injects that history into future prompts. By your third review, it's leading with your known weak spots instead of treating you like a stranger.
The second is CascadeFlow-style routing. Before any model runs, Refyn scores the code on a 0–100 complexity scale. Simple code takes the cheap fast path. Security-sensitive or structurally dense code escalates automatically.
Both of these came out of necessity, not ambition. Gemini was removed entirely after it became unusable. Groq became primary. OpenRouter became the backup layer with four free model fallbacks. Piston replaced Judge0 for execution because Docker overhead was too heavy locally and too expensive in production. The architecture we shipped was shaped by what kept breaking.

The Routing Problem :
The math is simple: most code is not equally hard to review, but most LLM integrations price it as if it is.
If you always call the expensive path, costs drift upward even when quality doesn't improve. A tiny utility function doesn't need the same review budget as an authentication service. But without routing logic, it gets the same one anyway.
Refyn's cost model is explicit in cascadeService.js. Groq runs at roughly $0.00002 per 1K tokens. The premium baseline for savings comparison is $0.000075 per 1K tokens. On a 100-token review:

Premium baseline: 100 / 1000 * 0.000075 = $0.0000075
Groq route: 100 / 1000 * 0.00002 = $0.000002
Savings: ~73%

That's why the stats bar matters. Cost control becomes visible instead of theoretical. You're not trusting a dashboard you check monthly — you see it on every single review.
The reliability half of the problem is messier. During our build, one Groq model name worked fine, then returned a 400 saying it had been decommissioned. OpenRouter's first free model list started returning 404s because the free tier had quietly rotated. If your routing assumes providers stay stable, your app breaks the moment they act like providers.

How cascadeService.js Works :
The scorer doesn't try to be academically perfect. It tries to be fast, explainable, and useful enough to make better runtime decisions.
jsexport const scoreComplexity = (code, language) => {
let score = 0;
const lines = code.split("\n").length;

if (lines > 200) score += 30;
else if (lines > 100) score += 20;
else if (lines > 50) score += 10;
else score += 5;

const securityPatterns = [
/eval\s*(/,
/exec\s*(/,
/subprocess/,
/os.system/,
/child_process/,
/dangerouslySetInnerHTML/,
/innerHTML\s*=/,
/SQL|mysql|postgres|sqlite/i,
/password|secret|token|apikey|api_key/i,
/.env/,
/crypto./,
/jwt/i,
/auth/i,
];
const securityHits = securityPatterns.filter((p) => p.test(code)).length;
score += securityHits * 8;
It starts with length — a 220-line file is a different review problem than a 12-line snippet. Then it adds heavier weight for risky patterns: eval, SQL, auth, JWT, crypto, secrets, environment handling. Short code with security signals still escalates. That's intentional.
After that it adds signals for async density, Promise usage, classes, try/catch depth, recursion hints, and import count. Final score capped at 100.
The routing tiers are intentionally blunt:
jsif (complexityScore >= 60) {
model = availableModels.includes("mixtral") ? "mixtral" : "groq";
} else if (complexityScore >= 30) {
model = availableModels.includes("groq") ? "groq" : "openrouter";
} else {
model = availableModels.includes("groq") ? "groq"
: availableModels.includes("mixtral") ? "mixtral"
: "openrouter";
}
Low complexity goes to Groq — fast, cheap, good enough. Medium complexity also prefers Groq for the speed/cost balance. High complexity or security-sensitive code escalates to Mixtral.
Then aiRouter.js handles the actual runtime fallback:
jsconst FALLBACK_ORDER = ["groq", "mixtral", "openrouter", "ollama"];

for (const model of FALLBACK_ORDER) {
if (model === routing.model) continue;
try {
const result = await MODEL_FNmodel;
if (result.success) {
return enrichResult(result, model, startTime, code, memoryData, {
...routing,
model,
reason: ${routing.model} unavailable — fell back to ${model},
});
}
} catch (e) {
console.error([Cascade] ${model} threw:, e.message);
}
}
That fallback chain exists because the real world is messy. We needed it after Gemini became unusable, after Groq changed model availability mid-session, and after one genuinely embarrassing moment where the backend returned "All models failed to analyze" because our .env file was sitting in the wrong folder. We spent a solid chunk of debugging time blaming providers before realizing config was the actual bug. Classic.
When OpenRouter takes over, it cycles through a list of free models:
jsconst FREE_MODELS = [
"openrouter/free",
"meta-llama/llama-3.3-70b-instruct:free",
"qwen/qwen3-coder:free",
"google/gemma-4-31b-it:free"
];
That list exists because our first version had models that had already rotated out of the free tier. Hardcoded model names age fast. Lesson learned the annoying way.

The Demo Moment :
The best part of Refyn is that the routing is visible.
Paste a small utility function and the stats bar shows:
Model: Groq | Complexity: 12/100 | Cost: $0.000002 | Saved: 72% vs Mixtral | Latency: 340ms
That one line explains the product better than anything else we built.
Then paste something security-sensitive — auth middleware, token parsing, a SQL query builder, anything touching .env or jwt. The complexity score jumps. The selected model changes without you touching a dropdown. The UI makes the decision legible: this review was escalated because the code looked riskier.
That transparency was the thing we kept coming back to. We'd already lived through the opposite — a system that looked smart in screenshots but hid all the runtime decisions. Once model, cost, latency, and savings were visible on every review, the routing story stopped being abstract.

What Hindsight Adds :
CascadeFlow decides which model to use. Hindsight decides what that model should remember.
In aiRouter.js, Refyn loads memory before analysis, builds a context string from past patterns, passes that into the selected model, and saves new patterns after a successful review. Memory is part of the runtime path, not a side feature.
We earned that the hard way. Our first memoryService.js was calling a /memories endpoint that didn't exist — every request silently died in 404s. The fix was switching to the official @vectorize-io/hindsight-client SDK and using recall() and retain() properly. Then we hit the second bug: the SDK returns { results: [...] }, not a plain array, so our parsing was quietly wrong the whole time.
Once that was fixed, the memory panel started populating correctly. Pattern count climbed to 14 across sessions. Then we noticed it reset to zero on page refresh — React state only, not reloaded from Hindsight. Fixed with a useEffect that calls the backend on mount. Small bug, big trust issue. If memory disappears on refresh, it feels fake even when it isn't.

What We'd Build Next :
GitHub PR integration — Refyn works well as a paste-and-review workspace. The natural next step is routing decisions happening file by file inside pull requests instead of one snippet at a time.
Team memory banks — Individual memory is already useful. Shared memory across a team means recurring security issues, preferred patterns, and project-specific conventions survive across contributors, not just sessions.
Budget enforcement — Refyn already calculates cost and savings. Hard limits, escalation warnings, and cost caps per session are the next layer that makes routing operational rather than just visible.

Closing :
Building Refyn taught me that AI cost control is mostly a routing problem disguised as a model problem.
We didn't win by finding one perfect provider. Gemini failed. Groq changed underneath us. OpenRouter free models rotated. Our own .env setup broke everything once. What held up was a system that could score the work, choose the cheapest reasonable path, and recover when something went wrong.
That's the part I trust now. Not the model names. The routing.

Top comments (0)