Beyond Chatbots: Building a Stateful Interview Coach That Remembers Your Failures

#ai #architecture #career #interview

Most AI interview bots are glorified search engines with a chat interface. They ask a question, give a generic textbook answer, and—worst of all—they suffer from total amnesia. If you struggled with hash maps last week, the agent has no idea. It resets every session.

That isn't coaching; that’s just asking questions.

I set out to build something different: an interview coach that treats each session as an iteration, tracks your progress, and manages API costs intelligently.The Architecture

To solve the memory and cost problem, I didn't want to build a complex, brittle backend. I kept the stack focused:
Memory Layer: Hindsight (for long-term state retention).
Intelligence Layer: cascadeflow (for runtime model routing).
LLM Provider: Groq (for low-latency inference).
The Flow:

User Input → Memory Check → Router (Route to cheap/expensive model) → LLM Response → PersistenceHow It Works1. Hindsight for Memory

Standard RAG (Retrieval-Augmented Generation) is often too shallow for personal growth. I needed something that could track patterns in failure. I used Hindsight to maintain a persistent bank of user struggles (e.g., "recursion," "concurrency").

When a session starts, the agent pulls that history and injects it into the system prompt. It moves the conversation from "How was your day?" to "Last session, you stumbled on the hash map collision logic. Let’s revisit that."2. Cascadeflow for Intelligence

I couldn't justify routing every "hello" or "I'm ready" through a flagship model like gpt-4o. It is expensive and unnecessary.

I implemented a runtime router. Simple conversational inputs route to llama-3.1-8b (low cost), while complex technical queries route to llama-3.3-70b or gpt-4o. This creates an audit trail where I can track my spending in real-time.The "Aha!" Moment: Fixing the Memory Leak

My first iteration was broken. I initially tried to force memory into the user message:
Python

The "Bad" Approach

if "what was" in question.lower():
memory = MEMORY_BANKS[bank_id][-1]
prompt = f"User previously struggled with {memory}. {question}"
The AI just got confused—it thought the user wrote "User previously struggled with recursion" and just repeated it back to me. It didn't learn.

The Fix: I moved the memory context into the System Prompt, where it belongs. I also added safety checks to ensure we don't crash on new users.
Python
memory_context = ""
if MEMORY_BANKS.get(bank_id) and "what was" in question.lower():
memory = MEMORY_BANKS[bank_id][-1]
memory_context = f"Context: User failed at {memory} before. Mention this."

response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": SYSTEM_PROMPT + memory_context},
{"role": "user", "content": question}
]
)
By separating the "memory" from the "user intent," the agent actually interprets the feedback and changes its behavior.Results: Before vs. After

The difference is night and day.

Before (Stateless Bot):You: Explain polymorphism.

AI: Polymorphism is a core concept in OOP where...

(Cost: $0.02)

After (Intelligent Agent):[HINDSIGHT] Recall: Found feedback from session #1

[CASCADEFLOW] Routing to: llama-3.3-70b-versatile | Reason: Technical

AI: In session #1, you were shaky on v-table lookups. Let's dig deeper: Explain the runtime cost of polymorphism.

(Cost: $0.008)Lessons Learned
Memory requires code, not just prompts. You can't just tell an AI to "remember." You have to build a persistence layer. Hindsight forced me to structure my data properly.

Log your costs. If you don't explicitly log routing decisions (as I did with Cascadeflow), you have no idea how much money you’re leaking on simple queries.

System prompts are for context, User prompts are for intent. Once I separated these, the agent stopped hallucinating and started coaching.