Rohan

Posted on Mar 20

How We Built an AI Developer Coach That Actually Remembers You

#showdev #productivity #ai #learning

I want to tell you about the moment I realised most developer tools are lying to you.

Not maliciously. Just structurally. They give you a score, you close the tab, and the next time you open it the platform has completely forgotten you ever existed. Your streak, your weak spots, the three times you struggled with dynamic programming last month — gone. You start from zero. Again.

That realisation is what became Meridian.

Where It Started

We were sitting in a hackathon briefing, staring at a problem statement about AI coding mentors, and the conversation kept circling the same frustration. Every tool we had used as students — LeetCode, GitHub stats dashboards, coding trackers — answered exactly one question: where are you right now?

Nobody was answering the questions that actually matter. Where have you been? Where are you heading? What pattern keeps showing up in your work that you have not noticed yet?

The gap was not intelligence. The tools were smart enough. The gap was memory. They were stateless by design, and that statelessness was the entire problem.

So we decided to build something that remembers.

The Idea Behind the Name

A meridian is a reference line. It is something you keep coming back to, something that tells you exactly where you are on the map and orients you toward where you are going. It does not move. You do.

That is the product in one word. Every session you run, every week you return, Meridian is the fixed reference point that shows you your own movement. Without a fixed reference, you cannot measure distance. Without memory, you cannot measure growth.

We kept the name because it felt earned.

The Architecture Decision That Changed Everything

Early in the build we had a straightforward plan: connect GitHub, run some scoring, show a dashboard. Pretty standard hackathon scope.

Then we discovered Hindsight.

Hindsight is a persistent memory SDK — it exposes two core operations, retain() and recall(), and what they do is deceptively simple. You call retain() at the end of a session and pass it everything: scores, weak areas, coaching notes, timestamps. You call recall() at the start of the next session and get back everything you ever stored, structured and queryable.

The moment we understood what that meant architecturally, the entire product changed shape.

Suddenly the first session was not the product. It was just the starting point. The tenth session was where it got interesting. The thirtieth session was where it got genuinely useful in a way nothing else could replicate — because nobody else had the history.

We restructured the entire backend around a single principle: every session feeds the next one. Nothing is thrown away. Every sub-score, every weakness pattern, every coaching note the AI generated — all of it goes into Hindsight. And every time you come back, the agent recalls it all before generating anything new.

This is the architectural decision that made Meridian different. Not the algorithm. Not the UI. The fact that we committed fully to memory as the primary feature, not a secondary one.

Building the MSTS Algorithm

The scoring algorithm was the hardest purely technical problem we faced.

We called it the Multi-Signal Triangulation Score — MSTS — because the whole point was that a single metric is gameable and meaningless. Stars on GitHub are gameable. LeetCode streak is gameable. Commit count is meaningless. We needed signals that, taken together, were resistant to gaming and actually correlated with engineering quality.

We landed on six:

Task Complexity measures how hard the problems you take on actually are — looking at issue labels, lines changed, files touched, and how many other developers were attempting the same problem.

Code Survival Rate is the one that surprised people most when we explained it. It measures what percentage of your original code is still present in the codebase after time has passed. If you write 1,000 lines and 850 are still there untouched a month later, that is an 85% survival rate. High survival means your code was good enough that nobody needed to rewrite it. Low survival means your code got replaced. It is a brutal but honest signal.

Issue Resolution Velocity measures not just whether you close issues but how fast you close them relative to their complexity. A P0 bug closed in two hours scores very differently from a minor feature request that took three weeks.

PR Review Depth is a social signal. How many review comments do you leave on other people's PRs? Quality engineers review other people's code and leave substantive comments. This signal rewards that behaviour.

Cross-Repo Bonus rewards engineers who contribute across multiple repositories rather than staying siloed in one codebase.

Temporal Decay applies an exponential decay function to all activity so that recent contributions matter more than contributions from six months ago. Active maintainers score higher than people coasting on old commits.

The formula is a weighted composite, clamped between 0 and 100. But the important thing is that each sub-score is stored individually in Hindsight — so you can track your trajectory on each dimension separately. That is what enables the coaching to get specific: "Your Code Survival is declining three weeks in a row" is a very different and more useful statement than "your overall score dropped."

The Challenge of Three Data Sources

We pull from GitHub, LeetCode, and Codeforces simultaneously. Each source has completely different authentication models, rate limits, data structures, and reliability characteristics.

GitHub is OAuth-based and well-documented. LeetCode has no official public API, so we use the GraphQL endpoint that the LeetCode frontend uses — unofficial, undocumented, and occasionally unreliable. Codeforces has a clean public REST API that is fast but has aggressive rate limits.

The hardest part was not the individual integrations. It was making the three data sources tell a coherent story together. A developer's LeetCode category performance and their GitHub Code Survival rate should be correlated — a developer who struggles with graph problems should show that weakness in both their competitive programming history and the kinds of PRs that get the most review comments on their GitHub. Surfacing that triangulated insight is what we called Confirmed Skill Gap detection — the same weakness appearing across two independent platforms gets flagged as a Priority Gap, not just a data point.

That detection only works because Hindsight is retaining the historical data. A one-time signal could be noise. The same signal across six sessions is a pattern. The same signal across two platforms is a confirmed gap.

The AI Layer and the Rate Limit Problem

We use Groq as our primary LLM provider because the hackathon recommended it and frankly it is fast — inference is noticeably quicker than other providers at the same model size.

But we hit a wall during testing. Groq's free tier has rate limits and during development we were hammering the API with test sessions, running into 429 errors constantly. Individual users would not hit these limits in normal usage but during development it was breaking our testing workflow.

We solved it by building a dual-key router. We registered two Groq API keys and wrote an aiRouter.ts module that tries the primary key first and automatically falls back to the secondary key on a 429 response. This was a one-afternoon engineering problem that turned into a resilience feature we are genuinely proud of — the AI layer now never fails silently.

We also discovered partway through that Gemini had better performance on certain types of structured output generation — specifically the learning path and challenge generation endpoints where we needed JSON output in a very specific schema. We added Gemini as an optional routing target for those specific endpoints while keeping Groq as the primary.

What Building the Organisation Dashboard Taught Us

The organisation feature was the last major piece we built and it revealed something about the product we had not fully understood until then.

The value for organisations is not individual scores. Individual scores are easy. Any tool can produce those. The value is cohort memory — the ability to watch a group of developers over time, track trajectories across the entire cohort, and detect who is declining before it becomes a visible problem.

An at-risk developer is not someone with a low score. A low score is a fact. An at-risk developer is someone whose trajectory is negative — declining for two or more consecutive weeks across one or more signals. The difference matters because a developer with a low but stable score knows where they are and is working from a baseline. A developer with a declining trajectory on Code Survival and Issue Velocity simultaneously is heading somewhere bad and does not necessarily know it yet.

Building that alert system required Hindsight cohort memory — storing the organisation as its own memory namespace and retaining cohort-level snapshots alongside individual sessions. That architectural choice from the beginning made this feature almost trivially easy to build. We just recalled the cohort history, ran the trajectory calculation, and flagged the declining signals.

Without persistent memory, this feature is impossible. You cannot detect a two-week decline if you do not remember what two weeks ago looked like.

The Hardest Day of the Build

There was a specific afternoon where nothing was working and everything felt wrong.

The LeetCode GraphQL endpoint was returning inconsistent data for some usernames. The Hindsight recall was returning sessions in an unexpected order that broke our trajectory calculation. The Groq router was routing to the fallback key too aggressively because we had misconfigured the 429 detection logic. And the frontend was showing a loading state that never resolved because an API endpoint was returning a 200 with an empty body instead of the expected JSON.

Four separate problems, none obviously related, all manifesting at the same time.

The way out was simple in retrospect but hard in the moment: stop trying to fix all four simultaneously. Pick the one closest to the data source, fix it completely, verify it independently, then move up the stack. LeetCode normalisation first. Hindsight sorting second. Router logic third. Frontend last.

It sounds obvious. When you are in the middle of a hackathon with a deadline and everything is broken at once, obvious is the hardest thing to remember.

What Surprised Us

Three things surprised us during the build.

First, the Code Survival Rate signal got more attention from people we demoed to than any other feature. We expected the AI coaching to be the headline. Instead, people kept asking "wait, it actually checks whether my code got rewritten?" and wanting to see their own survival rates immediately. It turns out developers have strong intuitions about the quality of their own code and very few tools give them external validation of those intuitions. The survival rate does that.

Second, the persistent memory changed how we thought about our own product. About three weeks into the build, one of us ran a real session on our own GitHub account. Two weeks later, ran it again. The trajectory delta — seeing your own Code Survival rate had declined between the two sessions, and seeing the AI acknowledge that decline and connect it to a pattern in the PR history — felt qualitatively different from any other developer tool either of us had used. It felt like the tool actually knew us. That is the user experience we were building toward but experiencing it firsthand made the stakes feel real.

Third, the organisation feature was much harder to think about correctly than to build. The engineering was straightforward once the architecture was in place. The hard part was figuring out what information is useful to an org admin versus what is noise. A table of MSTS scores is noise. Declining trajectories with specific signal breakdowns are signal. That distinction — between data and insight — is the product design problem at the heart of Meridian, and it does not have a technical answer. It requires judgment about what developers and organisations actually need to know.

What We Would Do Differently

If we built Meridian again from scratch, we would invest earlier in the data normalisation layer. Each of the three data sources returns data in a different shape, with different null states, different error modes, and different freshness characteristics. We built normalisation logic incrementally as we discovered edge cases and the result is functional but fragmented. A unified data ingestion contract defined upfront would have saved us significant debugging time in the second half of the build.

We would also think harder about the first session experience. Right now the first session is genuinely useful but it is also a lot of information at once — six sub-scores, LeetCode breakdown, Codeforces data, coaching notes, challenges, learning path. For a new user who has not yet built up any historical data, the dashboard can feel overwhelming. The product is designed for returning users and we should do more to guide new users toward their first return session, because that is where the memory starts to matter.

The One Thing We Want People to Take Away

Persistent memory is not a feature you bolt onto an AI product. It is the architectural foundation that determines which features are even possible.

Every genuinely differentiated capability in Meridian — trajectory analysis, confirmed skill gap detection, silent pattern recognition, cohort-level alerts, four-week score prediction — requires memory. Not just storage. Structured, queryable, session-aware memory that the agent can recall and reason about.

Without Hindsight, Meridian is just another scoring dashboard. With it, every session makes the next one sharper. The gap between those two products is not engineering complexity. It is a single architectural commitment made early in the build.

Make that commitment. Build something that remembers.

Meridian is built with Hindsight (Vectorize), Groq, Google Gemini, GitHub OAuth, LeetCode GraphQL, Codeforces REST API, Next.js, Express, TypeScript, and Supabase. Built for the Hindsight Hackathon — Theme: AI Agents That Learn.

Try Meridian, read the Hindsight docs at vectorize.io, and check out the Hindsight GitHub to understand what persistent agent memory actually looks like in production.

DEV Community