Pavel Gajvoronski

Posted on Apr 13

I Built a Voice AI GMAT Tutor with Long-Term Memory in 6 Weeks — Here's the Full Stack

#ai #llm #nextjs #webdev

samiwise.app — live now

GMAT prep tutors charge $150–200 per hour. For a 3-month prep period, that's $5,000–10,000. Most people preparing for an MBA simply can't afford that — or can't find a good tutor available at 11pm when they finally have time to study.

So I built SamiWISE — a voice AI GMAT tutor that remembers every session, adapts to your weak spots, and explains material in real time using RAG over official GMAT materials. This is the story of how it was built, what I learned, and the technical decisions that made it work.

The three things no competitor combines: voice + memory + real GMAT content

The Core Problem I Was Solving

Every GMAT prep tool I looked at had the same fundamental issue: they start from scratch every single session. You explain your weak spots again. You get generic explanations that don't account for what confused you last Tuesday. There's no continuity.

A good human tutor doesn't do this. They remember that you always mess up Data Sufficiency with inequalities. They know that analogies work better for you than abstract explanations. They track your trajectory over weeks.

I wanted to build that — but accessible to everyone, available 24/7, at $49/month.

The Architecture

The system has four main layers:

User (voice)
  → Deepgram STT (~1s)
  → Orchestrator Agent — Groq llama-3.3-70b (~200ms routing)
  → Specialist Agent — Claude Sonnet + RAG from Pinecone (~3-5s)
  → ElevenLabs TTS (~1s)
  → User hears response

Total latency: 5–8 seconds. Not perfect, but feels natural — like a real tutor pausing to think.

The Agent System

The most interesting architectural decision was the multi-agent routing system.

Instead of one monolithic AI tutor, there are five specialist agents and an invisible orchestrator:

Agent	Specialization
`quantitative`	Problem Solving + Data Sufficiency
`verbal`	Critical Reasoning + Reading Comprehension
`data_insights`	Table Analysis, MSR, Graphics Interpretation, TPA
`strategy`	Timing, exam psychology, study planning
`orchestrator`	Routes messages — user never sees this

The orchestrator runs on Groq (llama-3.3-70b) because it needs to be fast — 200ms routing decisions. Specialist agents run on Claude Sonnet because they need to be smart.

Routing prompt returns structured JSON:

{
  "route": "quantitative",
  "confidence": 0.94,
  "detected_topic": "data sufficiency with inequalities",
  "difficulty": "hard",
  "notes": "user has struggled with DS inequalities in past 3 sessions"
}

The user always hears the same voice — Sam. Transitions between agents are completely invisible.

Every other GMAT tool treats you like a stranger on every visit. Sam carries your entire learning history.

The Memory System — The Hard Part

This is where most AI tutors fail. Building long-term memory that actually improves tutoring quality took the most iteration.

After every session, a Memory Agent runs in the background. It reads the full session transcript and extracts a structured learner profile:

interface GmatLearnerProfile {
  weak_topics: string[]
  strong_topics: string[]
  effective_techniques: string[]      // what explanation styles worked
  ineffective_approaches: string[]    // what didn't land
  insight_moments: string[]           // "aha" phrases that clicked
  common_error_patterns: string[]     // e.g. "misreads DS question stem"
  learning_style: string
  next_session_plan: string
  score_trajectory: string
  time_pressure_notes: string
}

This profile gets stored in Supabase as a JSON field on the User model. At the start of every session, the full profile is injected into the specialist agent's system prompt.

The result: Sam says things like "Last week you struggled with probability in DS — let's approach this one differently than before" without you having to explain anything.

RAG — What I Indexed and Why

The knowledge base lives in Pinecone with 7 namespaces:

gmat-quant       → Quantitative problems and methods
gmat-verbal      → Verbal problems and methods
gmat-di          → Data Insights problems
gmat-strategy    → Strategies, timing, test psychology
gmat-focus       → GMAT Focus Edition specific content
gmat-errors      → Common error patterns

Free sources I used:

deepmind/aqua_rat — 97,467 GMAT/GRE algebra problems with rationales (Apache 2.0)
allenai/math_qa — Math word problems with annotated formulas (Apache 2.0)
mister-teddy/gmat-database — DS, PS, CR, SC questions in JSON (MIT)
ReClor paper — 17 CR question types with examples (research)
Manhattan Review free PDFs — Strategy guides openly distributed

The RAG pipeline uses @xenova/transformers for embeddings (runs locally, no API cost) and retrieves top-5 chunks with reranking before passing to the specialist agent.

The Tech Stack

Frontend:     Next.js 14 + TypeScript + Tailwind CSS
Auth:         Supabase Auth
Database:     Supabase PostgreSQL + Prisma 6
Vector DB:    Pinecone (7 namespaces)
LLM Router:   Groq (llama-3.3-70b) — fast, cheap
LLM Agents:   Anthropic Claude Sonnet — smart, consistent
STT:          Deepgram (Whisper)
TTS:          ElevenLabs
Memory:       Custom Memory Agent → Supabase JSON
Payments:     Paddle (Merchant of Record, handles US tax)
Deploy:       Vercel (frontend) + Railway (agents backend)

Why split Vercel + Railway?

Vercel has an 800-second serverless function limit. A 30-minute voice tutoring session would time out. Railway runs persistent containers — no limits, no cold starts for agents.

Practice Mode with FSRS

Beyond voice sessions, I built a visual practice mode where users can work through GMAT questions in exam format.

The interesting part: I implemented FSRS (Free Spaced Repetition Scheduler) — the same algorithm used by Anki. After each answer, the system records:

Was it correct?
How long did it take?
What was the difficulty?

Then it schedules the next review using an exponential forgetting curve. Questions you answered wrong come back sooner. Questions you mastered disappear for weeks.

This means the practice queue automatically prioritizes your weak spots without you having to manage anything.

The Study Journal

Every session automatically updates a daily journal entry:

interface StudyJournalEntry {
  date: Date
  totalMinutes: number
  questionsTotal: number
  accuracy: number
  topicsCovered: string[]
  errorTypes: Record<string, number>
  samInsight: string          // AI-generated daily summary
  milestones: string[]        // "100 questions solved", "5 hour week"
  streakDay: number
}

The streak counter turned out to be unexpectedly powerful for retention — users don't want to break their streak. Same psychology as Duolingo, but for GMAT prep.

What Doesn't Work Yet

Being honest about where things stand:

Voice pipeline not live yet — Deepgram + ElevenLabs keys configured but need production testing with real users
RAG not indexed — scripts are ready, Pinecone account set up, but haven't pushed the data yet
No real users — launching next week, zero feedback so far

The architecture is built. The UI works. The agents respond correctly. Next step is connecting all the APIs and getting real people to use it.

Lessons Learned

1. Multi-agent routing is worth the complexity.
A single "GMAT tutor" prompt produces mediocre results across all topics. Specialist agents with deep domain prompts are significantly better. The routing overhead is minimal.

2. Memory quality matters more than memory quantity.
I originally tried to store everything — full transcripts, every message. The prompts became too long and performance degraded. The Memory Agent that extracts structured insights (not raw content) works much better.

3. Split your infrastructure early.
I almost deployed everything to Vercel. The 800-second limit would have killed voice sessions. Railway for long-running processes saved the architecture.

4. Free datasets are better than I expected.
The deepmind/aqua_rat dataset has 97,000 high-quality GMAT-style problems with step-by-step rationales. Apache 2.0 license. This single dataset provides more practice material than most paid prep courses.

5. Paddle for payments if you're targeting the US market.
They handle sales tax across all 50 states automatically. As a Merchant of Record, they handle chargebacks and disputes. The 5% + $0.50 fee is worth it for the peace of mind.

What's Next

Get first 10 beta users from r/GMAT and GMAT Club
Connect production APIs (Deepgram, ElevenLabs, Pinecone)
Run the RAG indexing scripts
Collect feedback on voice experience quality

If you're interested in trying it or have feedback on the architecture, I'd love to hear from you. The product is live at samiwise.app — 7-day free trial, no credit card required.

Built with Next.js, Claude Sonnet, Groq, Pinecone, Deepgram, ElevenLabs, Supabase, Paddle, and Railway. Full stack TypeScript.

Top comments (3)

kanta13jp1 • Apr 14

Really enjoyed this write-up. What stood out to me most was the way you separated “memory” from raw transcript storage and turned it into a structured learner profile — that feels much closer to how a real tutor actually builds continuity over time.

I also liked the orchestration split: fast routing on Groq, deeper reasoning on Claude, and the infra decision to keep long-running agent work off Vercel. A lot of people hand-wave those details, so it was refreshing to see them explained clearly.

Also, respect for being honest about what’s not production-ready yet. That made the whole post feel more credible. Looking forward to seeing how the voice experience performs once real users are in the loop.

Pavel Gajvoronski • Apr 14

Thanks for the thoughtful read — you picked up on exactly the things that took the most iteration to get right.
The memory piece was the biggest learning. My first version stored raw transcripts and injected them directly into the prompt. Performance degraded fast — too much noise, context window pressure, and the agent would get confused by irrelevant old content. Switching to structured extraction (the Memory Agent that pulls specific fields like weak_topics, effective_techniques, common_error_patterns) made a huge difference. Less is more when it comes to what you actually pass forward.
The Groq/Claude split was also not obvious at first. I initially ran everything through Claude, but the routing decisions don't need that level of reasoning — they just need to be fast. Groq at ~200ms for orchestration vs 3-5s for Claude on the actual explanation felt like the right tradeoff.
On the honesty about what's not production-ready — I think that's just the right way to write these posts. People can tell when you're overselling, and it kills the credibility of everything else you say.
Looking forward to sharing how the voice experience lands with real users. First beta testers this week.

kanta13jp1 • Apr 17

That makes a lot of sense. The shift from raw transcript injection to structured extraction feels like one of those changes that sounds small, but actually changes the whole quality of the system.

What I’d be most curious to hear from the first beta users is whether they feel the continuity — not just that the system remembers facts, but that it adapts in a way that feels tutor-like rather than chatbot-like. That seems like the real test of whether the memory layer is working.

Looking forward to hearing how the first week goes.