DEV Community

Sai Srujan
Sai Srujan

Posted on

How Hindsight Turned Refyn From a Tool Into a Workspace

The Problem That Kept Bugging Me

Most AI code review tools have a memory problem, and once you notice it, it's hard to ignore.

You open a pull request, paste in some code, and get a comment: "check for null before accessing this value." Fine. Then two days later, different file, same comment. Then again the following week. The tool isn't wrong, exactly — it's just got no sense of pattern. It has no idea this is the sixth time you've made basically the same mistake.

At some point that stops feeling like assistance and starts feeling like autocomplete dressed up in a code review costume.

That gap is what I kept circling back to while building Refyn. A good human reviewer remembers how you work. They notice you keep forgetting async error handling in Node, or that your auth edges are always a little under-validated, or that your Python scripts could really use better secret handling. Most AI review tools don't do any of that. Every session starts from zero, like the tool's never met you before.

So we built memory into the review loop itself — not chat history, not a bigger prompt stuffed with context, but actual memory that persists across sessions.

So What Is Refyn, Exactly

Refyn is a browser-based code review workspace, built on React, Vite, Monaco Editor, Tailwind, Framer Motion, Node.js, and Express. It's meant to feel less like a chatbot and more like a lightweight VS Code tab — you paste or write code, run a review, dig into the issues it finds, and check stats like which model handled it, latency, cost, and how much you saved.

The whole idea is that Refyn should act more like an environment you work in than a tool you ping once and forget.

Two things make that possible.

The first is CascadeFlow, our routing layer. It looks at how complex a piece of code is and decides which model should handle it — simple stuff goes down the cheap, fast path, while anything security-sensitive or structurally dense gets bumped up to something stronger.

The second is Hindsight memory. After a review finishes, Refyn pulls out recurring issue patterns and stores them. Before the next review starts, it goes and recalls those patterns, then feeds them into the prompt as developer history.

So the tool isn't just looking at the code sitting in front of it. It's looking at that code in the context of how you tend to write code. That's really the moment this stopped being "a smart feature" and started being a workspace.

The Amnesia Problem

Here's the thing about most AI review tools right now: they still behave like stateless request handlers.

You send code in, they send feedback back, the session ends, and whatever the tool picked up about you just evaporates. That feels normal because plenty of AI products have trained us to expect it. But it doesn't really match how feedback works between actual humans. A reviewer you've worked with for a while builds a mental model of you. They know you overuse shared mutable state, or that you sometimes skip edge-case validation, or that your async cleanup is hit-or-miss. Their reviews get sharper over time because they've stopped treating every diff like it dropped in from a stranger.

A stateless tool can't do that, no matter how good its model is.

If you keep missing async null checks, it'll keep flagging the same symptom forever, never once connecting the dots into "hey, you do this a lot." If your code keeps leaking secrets through sloppy config handling, it'll point at today's instance without ever saying "pay extra attention here, this is a pattern for you."

This became really obvious to us while we were actually building Refyn, under deadline pressure, fixing real breakages as they came up. The same handful of categories kept showing up over and over: environment misconfiguration, dead model paths, bad assumptions about how some external API behaved, state that looked fine until you refreshed the page. These weren't one-off bugs. They were patterns repeating themselves. Once I started seeing it that way, memory stopped being a nice-to-have and started feeling necessary.

How Hindsight Actually Works Inside Refyn

The memory logic lives in backend/services/memoryService.js. And honestly, we got this wrong the first time around.

Our initial attempt just called a /memories endpoint directly — which, it turns out, doesn't exist. That gave us a string of 404s and wasted a chunk of time before we finally slowed down and actually read the Hindsight docs properly. The fix was dropping the raw HTTP calls entirely and switching over to the official @vectorize-io/hindsight-client SDK.

The service kicks off by spinning up a singleton client:

javascriptimport { HindsightClient } from "@vectorize-io/hindsight-client";

const MEMORY_BANK = "default";
const MAX_MEMORIES_TO_INJECT = 5;

let _client = null;

const getClient = () => {
if (_client) return _client;
const apiKey = process.env.HINDSIGHT_API_KEY;
if (!apiKey) {
console.warn("[Memory] HINDSIGHT_API_KEY not set — memory disabled");
return null;
}
_client = new HindsightClient({
baseUrl: "https://api.hindsight.vectorize.io",
apiKey,
});
return _client;
};

Simple enough on its face, but it matters in practice — if the API key's missing for whatever reason, memory just quietly turns itself off instead of taking the whole review pipeline down with it.

The actual flow runs in two phases.

Before analysis even starts, Refyn pulls up whatever patterns exist for that user:

javascriptexport const loadMemory = async (userId) => {
if (!userId) return { memories: [], sessionCount: 0 };
const client = getClient();
if (!client) return { memories: [], sessionCount: 0 };

try {
const results = await client.recall(
MEMORY_BANK,
code review patterns and issues for developer ${userId},
);

const rawResults = results?.results || [];

const memories = rawResults
  .map((r) => (typeof r === "string" ? r : r.content || r.text || ""))
  .filter(Boolean);

return { memories, sessionCount: memories.length };
Enter fullscreen mode Exit fullscreen mode

} catch (err) {
return { memories: [], sessionCount: 0 };
}
};

The annoying bug here was the shape of the response. We'd assumed it would come back as a plain array, but it was actually wrapped in an object with a results field — so our parsing was silently failing until we tracked it down and fixed it with const rawResults = results?.results || [];. One of those bugs that doesn't throw an error, it just quietly gives you nothing.

After a review finishes, Refyn extracts the patterns and writes them back:

javascriptexport const saveMemory = async (userId, analysisData, language) => {
if (!userId || !analysisData) return;
const client = getClient();
if (!client) return;

const patterns = extractPatterns(analysisData, language, userId);
if (patterns.length === 0) return;

try {
const memoryText = patterns.join(" | ");
await client.retain(MEMORY_BANK, memoryText);
} catch (err) {
console.warn("[Memory] Retain failed (non-fatal):", err.message);
}
};

extractPatterns() groups issues by category and turns them into short memory entries — things like recurring security issues, or the same kind of language-specific mistake showing up again. Then buildMemoryContext() takes the most relevant handful of those and injects them into the next prompt as developer history.

That prompt enrichment step is really the whole point of this. Memory's only worth anything if the next review can actually use it.

The handoff between memory and analysis happens over in backend/services/aiRouter.js. The router loads memory before it ever calls a model, builds a context string out of it, hands that off to whichever analyzer it picked, and then saves fresh memory once the review succeeds. So memory isn't bolted onto the side of the pipeline — it's sitting right in the middle of it.

Watching It Play Out Over a Few Sessions

The easiest way to actually get a feel for this is to walk through three sessions back to back.

Session one looks like any other decent AI reviewer. You submit your code, get a solid review back, but there's no history to draw on yet. The memory panel's empty. Pattern count's at zero. The feedback is generic, mostly because there's nothing yet that would make it otherwise.

Session two is where it starts getting interesting. Say the first review flagged weak auth validation and sloppy secret handling in some Python code. Once that review wraps up, those patterns get saved. Come back for another round, and the memory panel might now say something like "recurring security issues in Python flagged 2x." The review is still anchored to whatever code you're looking at right now, but it's no longer working blind.

By session three, the tone shifts again. Instead of just reacting to whatever's new, Refyn can open with your known weak spots before getting into the fresh issues. That's a lot closer to how a real reviewer actually works — it's not reciting the same static checklist, it's saying "you tend to trip up here, so I checked there first."

We also had to chase down one annoying frontend bug to make this whole thing convincing. The pattern count in the navbar was living only in React state, so it looked great mid-session — climbed up to 14 patterns at one point — but reset right back to zero on refresh. The memory itself was fine, sitting safely in Hindsight the whole time; the UI just wasn't bothering to reload it. Fixed it with a useEffect that hits the backend on mount and pulls the real number back in. Small bug, but it's the kind of thing that quietly wrecks trust — if your memory looks like it vanished, people assume it was never real to begin with.

Where CascadeFlow Fits In

Hindsight handles the memory side, but it's a lot more useful because CascadeFlow is sitting underneath it.

Refyn doesn't just remember your habits and then run every review through the same model regardless. It scores the code's complexity first and routes accordingly — simple stuff takes the fast, cheap path, while anything security-sensitive or structurally heavy gets escalated.

That matters for memory because your remembered weak spots are worth the most when they're paired with the right depth of review. If Refyn already knows you've got a history of auth mistakes, and the snippet in front of it touches JWT handling, SQL access, or .env files, the routing layer can push that review onto a stronger model instead of treating it like just another trivial snippet.

End result: memory isn't only personalization here. It feeds straight into how the review gets prioritized. Refyn remembers what you tend to miss, then uses runtime routing to actually spend its budget where that history says it matters.

What's Next

The next obvious move is team memory banks.

Right now Refyn's memory works well for one person, but engineering is rarely a solo act. A shared memory layer could pick up on patterns across a whole team — common auth mistakes, performance pitfalls, sketchy database habits, or project-specific quirks that keep showing up no matter who's writing the code.

After that, GitHub PR integration feels like the natural next step. Refyn already works well as a standalone browser workspace, but memory gets a lot more powerful once it follows you into actual pull requests instead of waiting for you to paste code in. Meet developers where they already are, basically.

The third idea is using memory for targeted practice. If Refyn already knows your weak spots over time, it could turn that into interview drills, review checklists, or small exercises built around your actual recurring issues — instead of yet another random array problem that has nothing to do with what you really struggle with.

Closing Thought

Building Refyn convinced me that memory is the piece a lot of developer tooling is still missing.

Without it, AI review stays technically useful but kind of shallow. It can spot problems in the code right in front of it, but it never builds any real sense of the person writing that code. Which means it never actually gets better at reviewing you — only at reviewing whatever's currently on screen.

Once Hindsight went in, the whole product changed shape. It stopped feeling like a stateless API wrapper and started feeling like something that actually learns as you use it.

Getting there meant fighting through a bunch of 404s, wrong assumptions about the SDK, and even a UI bug that made real memory look fake after a refresh. But honestly, that's kind of the point — those failures made it obvious that if we wanted this to feel real, memory couldn't just be decoration. It had to actually work.

Top comments (0)