Eight months ago, a CS exam forced me to write pseudocode when I already knew how to code. Instead of studying, I rage-built an app.
Today examintelligence.app is live. Here’s exactly how I got here—from vibe-coded POCs to a production hybrid AI pipeline—without the curated startup gloss.
The Philosophy Behind the Build
I’ve always believed studying for marks ≠ actually learning.
When I was first introduced to organic chemistry, I hated it. Then I ran into GNNs in Machine Learning with PyTorch and Scikit-Learn, paired with the MoleculeNet dataset. Suddenly, everything clicked. I wanted to learn everything about it.
That’s the core problem: exams optimize for pattern recognition, not curiosity. You’re forced down one prescribed path, and it rm -rfs the fun of learning in most cases.
So one week before my first prelims, I decided to build exam intelligence.
The plan was simple: introduce brutal efficiency using AI for what it’s actually built for: pattern recognition
Parse every past paper, mark scheme, and examiner report. Distill it down to precisely what matters. Free up time for coding and creative work.
Vibe-Coding the POC (and Why It Collapsed)
I’m generally against vibe-coding. It’s unreliable, hard to maintain, and a security nightmare. But with prelims staring me in the face, I had no choice.
I opened Claude and vibe-coded it module by module. The only code review I had time for was checking for suspicious os.system or subprocess calls. That was it. I shipped anyway.
Initial stack:
- Gemini API (no agent frameworks, no LangGraph)
- Streamlit frontend
- PostgreSQL
It validated my idea but functionally, it barely held together.
After prelims, I finally looked at what the AI had actually built:
- Dashboard showing random stats
- Asked Gemini for a JSON response with 5 keys, saved only 2
- Randomly created DB tables while trying to read subjects
The kind of code you end up with when you let an AI cook unsupervised for a week.
So I did the only reasonable thing: opened Neovim and rebuilt from scratch.
Stack shift:
Streamlit → Django.
Raw API calls → proper LangGraph workflow with auto-retries.
Spaghetti → something I could actually reason about.
Hot take: letting a coding agent write your core architecture is a recipe for unmaintainable code. To build something real, the foundation has to be real.
My Agent Strategy (and How I Spent $0)
I now code the base architecture myself. Not by dumping decisions into a .md file and walking away, but by writing actual code in Neovim with minimal AI reliance, leaning on docs and classic debugging (and of course stackoverflow).
Then I hand it to the agent with a real starting point.
Example workflow:
- I build the login page
- Agent takes it, turns it into signup
- Modifies precisely - instead of generate slop from scratch
Consistent patterns. Reviewable diffs. Code I can actually reason about.
Agents I use:
- Precise edits → Claude web
- Cloud models → OpenCode
- Local models → Pi coding agents
And yes, I spent $0 on agents. Not by hopping free tiers, but by making every edit targeted and intentional. Tokens spent = value shipped.
This didn’t slow me down. The 8 months wasn’t lost to agent strategy—it was spent on experiments: fine-tuning, hybrid ML/LLM systems, guaranteeing quality at a level I actually wanted to ship. The agent strategy just kept the codebase clean enough to run those experiments.
The Experimental Mess (Everything I Tried To Ensure Quality & Precise Notes)
Once the base app was stable, I went heads-down on the AI core. This is where things got messy.
I tried moving the processing pipeline from Gemini 2.5 Flash to local Llama weights. Spent a week benchmarking Python PDF-to-text parsers (PyMuPDF, pdfplumber). Results were great for raw text, but all it took was one complex structural diagram to break them.
Switched strategies to small multimodal models (Llama-3.2-3B-Instruct). Ran a sliding context window: passing page-by-page image inputs alongside cumulative state JSON from previous pages. Turns out, thinking like GIL was a mistake.
I burned my Lightning AI GPU budget on inference and Unsloth fine-tuning loops. It took way longer than cloud endpoints. Output was pure garbage in, garbage out—the data pipeline accidentally trained on unfiltered, broken v1 outputs.
Pivoted back to Gemini. But over-engineering the UI engine (JS dynamically rendering complex JSON schemas) broke the model's JSON outputs. With prelims 2 approaching, I had to git restore and reset.
Production Push: The Hybrid Architecture That Actually Worked
After prelims 2, I did a complete fresh rebuild from the ground up. Same stack, but this time with production in mind. Fixed branding, color theme, typography, UX flows, and simplified the complex schema requirements for the notes UI.
Once frontend was out of the way, I returned to the AI core:
- Speculative Decoding: Paired Qwen 3.5 0.8B & 2B models. Both gave quality responses independently, but pairing them completely broke the outputs (probably a sampling logic bug). Dropped it due to time constraints & prod risk.
-
Auto-Research Setup: Tasked OpenCode to run a prompt sweep using local
Gemma 4andQwen 3.5models to break down PDFs into our required structured layout. Slight prompt-tuning pushed outputs close to Gemini’s quality & required format, but processing time (text prompt + images of all pages → JSON) was unreasonably slow for production. -
Final Decision: Hybrid architecture.
-
Gemini 3.1 Flash-Litefor heavy multimodal/PDF parsing steps -
Qwen 3.6 35Bfor routing and text-only steps
-
Delivers production-grade quality at speeds I can actually stand behind.
The app officially pushed to prod on June 5th.
If you're an IGCSE student looking for a tool that actually respects your time as much as your marks, the waitlist is open (early access for 1st 500 signups):
🔗 https://examintelligence.app/
Follow along if you want to see how this iterates next.
Top comments (1)
Already drafting the next deep dive: how I actually prevented hallucinations and forced local LLMs to follow strict constraints inside my Pi coding harness. But I’ll let you drive — should I breakdown that or the developer bias disadvantage in prod