CascadeFlow Helped After Gemini Failed Four Times

#ai #api #gemini #llm

How CascadeFlow Saved the Project After Everything Else Failed
There's a specific kind of exhaustion that hits when you're staring at a backend error at 11 p.m. and it says something like All models failed to analyze. It's not dramatic. It's just flat. And somehow that makes it worse.
That was the moment Refyn either had to become a real product or get shelved as a promising prototype. Gemini had already burned through four API keys across three Google accounts, every single one bouncing back as invalid. Groq stepped in and worked fine — until the model we'd been testing with quietly got decommissioned. OpenRouter looked like a solid backup until half the free models we'd chosen started returning 404s because the provider had rotated them out. And just to round things off, there was a stretch where nothing worked at all because the .env file was sitting in the wrong folder.
Not one clean failure. Just a pile of small, boring ones stacking up.
Refyn only got interesting once we stopped trying to fight that reality and started building around it.
What We Were Building
Refyn is a browser-based code review workspace. Visually it borrows from VS Code — Monaco editor in the center, results panel on the right, output at the bottom, model controls and file navigation on the sides. The stack is React, Vite, Tailwind, and Framer Motion on the frontend, with Node and Express on the backend. You paste or write code, analyze it, run it, spot issues, and apply fixes.
But the actual goal was never "another code reviewer."
The idea was a reviewer that gets smarter over time. Not just sharper prompts. Not just a model picker. Something that actually remembers what you tend to miss — and doesn't spend the same amount of money reviewing a three-line utility function as it would on auth code touching your database.
That idea eventually became two concrete systems.
The first is Hindsight — it stores recurring review patterns from past sessions and pulls them into future ones, so the agent isn't starting cold every single time.
The second is CascadeFlow — it scores the complexity of your code before touching any model, routes simple code to the fast, cheap path, and automatically escalates anything riskier.
Neither of these came from a planning doc. Both came from things that kept breaking.
The Model Reliability Problem
The original codebase we built on top of had a straightforward waterfall approach: try one provider, hope it works, fall back if it doesn't. That sounds reasonable until providers start behaving the way they do in practice.
Gemini was supposed to be the primary model. It never really got there. After more than four API keys from three separate Google accounts — including freshly generated ones — every attempt came back invalid. We never figured out the actual cause. At some point debugging Gemini started feeling less like engineering and more like superstition, so we cut it entirely.
Groq took over as the primary path. That went better, until llama3-70b-8192 got decommissioned mid-session and we had to scramble to swap in llama-3.3-70b-versatile. The current service code is basically a record of that moment.
It was right around then that CascadeFlow stopped being a nice-to-have optimization and became something closer to infrastructure. Before making any model call, Refyn now scores the code — factoring in length, imports, async usage, and anything security-sensitive like eval, jwt, auth, crypto, SQL references, or .env access. That score determines the route. Simple code goes to Groq. More complex or risky code escalates to Mixtral. If the chosen model fails, the router walks a fallback chain through Groq, Mixtral, OpenRouter, and finally Ollama.
Once you've watched a provider disappear mid-build or a free model list go stale underneath you, "intelligent routing" stops sounding like a product feature and starts sounding like basic self-defense.
The Stateless Review Problem
This one was quieter, but more important in the long run.
Most AI review tools have no memory. Every session starts from scratch, which means every session feels like it starts from scratch. If you consistently miss null checks in async code, or keep stumbling on auth handling, the tool doesn't notice. It just hands you the same isolated comment again, like it's meeting you for the first time.
That felt wrong for something meant to be a workspace. Good reviewers remember your habits.
So we built Hindsight. The first version failed in an avoidable way — memoryService.js was trying to call a /memories endpoint that didn't exist. It should have been using the official SDK from the start. We rewrote it around @vectorize-io/hindsight-client, with recall() before analysis and retain() after.
One line in particular matters: results?.results. We initially parsed the SDK response like a plain array. It wasn't one. After the review completes, Refyn extracts patterns from the analysis and saves them back. Before the next review, aiRouter.js loads that memory and injects it as prompt context — so the model can lead with what the developer tends to miss, rather than pretending the last session never happened.
The moment this felt real wasn't in backend logs. It was when the memory panel in the UI started actually showing patterns. We had one session where the navbar count climbed to 14 patterns — then dropped to zero on refresh, because the count was living only in React state. The fix was simple: load GET /api/memory/:userId on mount. But it mattered. If memory disappears when you refresh, nobody trusts that the memory feature is real.
The Execution Engine
Same theme here — the original solution worked until it had to work in real life.
The codebase we started from used Judge0 for code execution. Judge0 is fine, but locally it wanted Docker, and hosted it pushed toward cost and setup overhead we didn't want. That made it a bad fit for a fast feedback loop.
So we flipped the order. Execution now goes Piston first, then Judge0, then a local runner as a last resort. Piston is free, requires no API key, supports a wide range of languages, and doesn't drag Docker into local setup. Judge0 stays as a fallback. Local execution is the emergency option for Python and JavaScript.
That one decision removed a lot of friction immediately. It also fits the broader logic of how Refyn works: cheapest, simplest path first — heavier fallback only when you actually need it.
What It Looks Like Now
The thing that's genuinely satisfying about Refyn at this point is that it doesn't hide what it's doing.
You paste code into the editor, hit Analyze, and the full chain runs. The backend routes through CascadeFlow. The response comes back with usedModel, latency, cost, savings, complexityScore, and routingReason. The frontend puts that in the stats bar. You can see exactly which path your review took.
The memory loop works too. The analysis hook sends a userId from local storage, the backend loads memory before running the model, and the results sidebar shows recalled patterns under "Your Patterns." The navbar reloads the pattern count on mount, which fixed the refresh problem.
Execution runs through the fallback chain. The output panel shows runtime errors, highlights the failing line when it can parse one, and offers a "Fix with AI" button directly from the error state.
Smart Fix needed its own cleanup after Gemini got removed — it was still conceptually tied to that dead path and had to be rerouted through OpenRouter first, then Ollama as a final fallback.
The full flow is coherent now: analyze, remember, route, execute, patch. And it doesn't fall apart the moment one provider has a bad day.
What's Next
The most obvious next step is GitHub PR integration. Refyn works best as an interactive workspace right now, but the routing and memory systems would be considerably more useful if they could follow code into pull requests automatically.
After that, team memory. Individual pattern memory is already useful. Shared memory across a team would be more interesting — recurring security issues, project-specific conventions, the kind of institutional knowledge every codebase builds up whether or not anyone writes it down.
Third is budget enforcement. Refyn already calculates cost and savings per review. The next step is letting teams set actual review budgets, surface warnings on expensive routes, and treat cost as a first-class concern rather than a number that shows up in a stats bar and gets ignored.
And the boring infrastructure stuff. Provider churn, env file mistakes, response shape mismatches, Windows shell differences. Those aren't side quests. They're the product.
Closing Thought
Building Refyn made something clear: the bugs aren't separate from the architecture. They are the architecture, if you're paying attention.
Gemini failing pushed us to stop depending on a single provider. Groq changing underneath us forced better routing. The Hindsight 404s forced us to use the real SDK. The memory count resetting on refresh forced us to make persistence something you can actually see. Judge0 friction forced a simpler execution path.
None of those failures were interesting. Most were mildly embarrassing.
But that's why the product is better. We didn't design it around a perfect demo. We designed it around the things that actually broke.

Top comments (1)

Ayush Dwivedi • Jun 29

This hits close to home. The silent provider failures mid-run are the worst kind because you don't catch them until your output is wrong.

After dealing with this across 3 months of running my own proxy, Cerebras ended up being the most consistently available when other providers were acting up. Also never silently dropped a request.

I built Janux around exactly this problem: failover chains that route away from unstable providers automatically, across DeepSeek, Qwen, MiniMax M3, StepFun 3.7 Flash, Cerebras and others. janux.studiotrx.in if you want a more stable routing layer than OpenRouter for production use.