DEV Community

Pavel Gajvoronski
Pavel Gajvoronski

Posted on

I Built a Voice AI GMAT Tutor with Long-Term Memory in 6 Weeks — Here's the Full Stack

samiwise.app — live now


GMAT prep tutors charge $150–200 per hour. For a 3-month prep period, that's $5,000–10,000. Most people preparing for an MBA simply can't afford that — or can't find a good tutor available at 11pm when they finally have time to study.

So I built SamiWISE — a voice AI GMAT tutor that remembers every session, adapts to your weak spots, and explains material in real time using RAG over official GMAT materials. This is the story of how it was built, what I learned, and the technical decisions that made it work.

The three things no competitor combines: voice + memory + real GMAT content


The Core Problem I Was Solving

Every GMAT prep tool I looked at had the same fundamental issue: they start from scratch every single session. You explain your weak spots again. You get generic explanations that don't account for what confused you last Tuesday. There's no continuity.

A good human tutor doesn't do this. They remember that you always mess up Data Sufficiency with inequalities. They know that analogies work better for you than abstract explanations. They track your trajectory over weeks.

I wanted to build that — but accessible to everyone, available 24/7, at $49/month.


The Architecture

The system has four main layers:

User (voice)
  → Deepgram STT (~1s)
  → Orchestrator Agent — Groq llama-3.3-70b (~200ms routing)
  → Specialist Agent — Claude Sonnet + RAG from Pinecone (~3-5s)
  → ElevenLabs TTS (~1s)
  → User hears response
Enter fullscreen mode Exit fullscreen mode

Total latency: 5–8 seconds. Not perfect, but feels natural — like a real tutor pausing to think.


The Agent System

The most interesting architectural decision was the multi-agent routing system.

Instead of one monolithic AI tutor, there are five specialist agents and an invisible orchestrator:

Agent Specialization
quantitative Problem Solving + Data Sufficiency
verbal Critical Reasoning + Reading Comprehension
data_insights Table Analysis, MSR, Graphics Interpretation, TPA
strategy Timing, exam psychology, study planning
orchestrator Routes messages — user never sees this

The orchestrator runs on Groq (llama-3.3-70b) because it needs to be fast — 200ms routing decisions. Specialist agents run on Claude Sonnet because they need to be smart.

Routing prompt returns structured JSON:

{
  "route": "quantitative",
  "confidence": 0.94,
  "detected_topic": "data sufficiency with inequalities",
  "difficulty": "hard",
  "notes": "user has struggled with DS inequalities in past 3 sessions"
}
Enter fullscreen mode Exit fullscreen mode

The user always hears the same voice — Sam. Transitions between agents are completely invisible.


Every other GMAT tool treats you like a stranger on every visit. Sam carries your entire learning history.

The Memory System — The Hard Part

This is where most AI tutors fail. Building long-term memory that actually improves tutoring quality took the most iteration.

After every session, a Memory Agent runs in the background. It reads the full session transcript and extracts a structured learner profile:

interface GmatLearnerProfile {
  weak_topics: string[]
  strong_topics: string[]
  effective_techniques: string[]      // what explanation styles worked
  ineffective_approaches: string[]    // what didn't land
  insight_moments: string[]           // "aha" phrases that clicked
  common_error_patterns: string[]     // e.g. "misreads DS question stem"
  learning_style: string
  next_session_plan: string
  score_trajectory: string
  time_pressure_notes: string
}
Enter fullscreen mode Exit fullscreen mode

This profile gets stored in Supabase as a JSON field on the User model. At the start of every session, the full profile is injected into the specialist agent's system prompt.

The result: Sam says things like "Last week you struggled with probability in DS — let's approach this one differently than before" without you having to explain anything.


RAG — What I Indexed and Why

The knowledge base lives in Pinecone with 7 namespaces:

gmat-quant       → Quantitative problems and methods
gmat-verbal      → Verbal problems and methods
gmat-di          → Data Insights problems
gmat-strategy    → Strategies, timing, test psychology
gmat-focus       → GMAT Focus Edition specific content
gmat-errors      → Common error patterns
Enter fullscreen mode Exit fullscreen mode

Free sources I used:

  • deepmind/aqua_rat — 97,467 GMAT/GRE algebra problems with rationales (Apache 2.0)
  • allenai/math_qa — Math word problems with annotated formulas (Apache 2.0)
  • mister-teddy/gmat-database — DS, PS, CR, SC questions in JSON (MIT)
  • ReClor paper — 17 CR question types with examples (research)
  • Manhattan Review free PDFs — Strategy guides openly distributed

The RAG pipeline uses @xenova/transformers for embeddings (runs locally, no API cost) and retrieves top-5 chunks with reranking before passing to the specialist agent.


The Tech Stack

Frontend:     Next.js 14 + TypeScript + Tailwind CSS
Auth:         Supabase Auth
Database:     Supabase PostgreSQL + Prisma 6
Vector DB:    Pinecone (7 namespaces)
LLM Router:   Groq (llama-3.3-70b) — fast, cheap
LLM Agents:   Anthropic Claude Sonnet — smart, consistent
STT:          Deepgram (Whisper)
TTS:          ElevenLabs
Memory:       Custom Memory Agent → Supabase JSON
Payments:     Paddle (Merchant of Record, handles US tax)
Deploy:       Vercel (frontend) + Railway (agents backend)
Enter fullscreen mode Exit fullscreen mode

Why split Vercel + Railway?

Vercel has an 800-second serverless function limit. A 30-minute voice tutoring session would time out. Railway runs persistent containers — no limits, no cold starts for agents.


Practice Mode with FSRS

Beyond voice sessions, I built a visual practice mode where users can work through GMAT questions in exam format.

The interesting part: I implemented FSRS (Free Spaced Repetition Scheduler) — the same algorithm used by Anki. After each answer, the system records:

  • Was it correct?
  • How long did it take?
  • What was the difficulty?

Then it schedules the next review using an exponential forgetting curve. Questions you answered wrong come back sooner. Questions you mastered disappear for weeks.

This means the practice queue automatically prioritizes your weak spots without you having to manage anything.


The Study Journal

Every session automatically updates a daily journal entry:

interface StudyJournalEntry {
  date: Date
  totalMinutes: number
  questionsTotal: number
  accuracy: number
  topicsCovered: string[]
  errorTypes: Record<string, number>
  samInsight: string          // AI-generated daily summary
  milestones: string[]        // "100 questions solved", "5 hour week"
  streakDay: number
}
Enter fullscreen mode Exit fullscreen mode

The streak counter turned out to be unexpectedly powerful for retention — users don't want to break their streak. Same psychology as Duolingo, but for GMAT prep.


What Doesn't Work Yet

Being honest about where things stand:

  1. Voice pipeline not live yet — Deepgram + ElevenLabs keys configured but need production testing with real users
  2. RAG not indexed — scripts are ready, Pinecone account set up, but haven't pushed the data yet
  3. No real users — launching next week, zero feedback so far

The architecture is built. The UI works. The agents respond correctly. Next step is connecting all the APIs and getting real people to use it.


Lessons Learned

1. Multi-agent routing is worth the complexity.
A single "GMAT tutor" prompt produces mediocre results across all topics. Specialist agents with deep domain prompts are significantly better. The routing overhead is minimal.

2. Memory quality matters more than memory quantity.
I originally tried to store everything — full transcripts, every message. The prompts became too long and performance degraded. The Memory Agent that extracts structured insights (not raw content) works much better.

3. Split your infrastructure early.
I almost deployed everything to Vercel. The 800-second limit would have killed voice sessions. Railway for long-running processes saved the architecture.

4. Free datasets are better than I expected.
The deepmind/aqua_rat dataset has 97,000 high-quality GMAT-style problems with step-by-step rationales. Apache 2.0 license. This single dataset provides more practice material than most paid prep courses.

5. Paddle for payments if you're targeting the US market.
They handle sales tax across all 50 states automatically. As a Merchant of Record, they handle chargebacks and disputes. The 5% + $0.50 fee is worth it for the peace of mind.


What's Next

  • Get first 10 beta users from r/GMAT and GMAT Club
  • Connect production APIs (Deepgram, ElevenLabs, Pinecone)
  • Run the RAG indexing scripts
  • Collect feedback on voice experience quality

If you're interested in trying it or have feedback on the architecture, I'd love to hear from you. The product is live at samiwise.app — 7-day free trial, no credit card required.


Built with Next.js, Claude Sonnet, Groq, Pinecone, Deepgram, ElevenLabs, Supabase, Paddle, and Railway. Full stack TypeScript.

Top comments (0)