Mohammed Ibrahim Khan

Posted on Jun 28

How I Built an AI Exam App in 8 Months to outsource studying

#python #ai #webdev #showdev

Eight months ago, a CS exam forced me to write pseudocode when I already knew how to code. Instead of studying, I rage-built an app.

Today examintelligence.app is live. Here’s exactly how I got here—from vibe-coded POCs to a production hybrid AI pipeline—without the curated startup gloss.

The Philosophy Behind the Build

I’ve always believed studying for marks ≠ actually learning.

When I was first introduced to organic chemistry, I hated it. Then I ran into GNNs in Machine Learning with PyTorch and Scikit-Learn, paired with the MoleculeNet dataset. Suddenly, everything clicked. I wanted to learn everything about it.

That’s the core problem: exams optimize for pattern recognition, not curiosity. You’re forced down one prescribed path, and it rm -rfs the fun of learning in most cases.

So one week before my first prelims, I decided to build exam intelligence.
The plan was simple: introduce brutal efficiency using AI for what it’s actually built for: pattern recognition

Parse every past paper, mark scheme, and examiner report. Distill it down to precisely what matters. Free up time for coding and creative work.

Vibe-Coding the POC (and Why It Collapsed)

I’m generally against vibe-coding. It’s unreliable, hard to maintain, and a security nightmare. But with prelims staring me in the face, I had no choice.

I opened Claude and vibe-coded it module by module. The only code review I had time for was checking for suspicious os.system or subprocess calls. That was it. I shipped anyway.

Initial stack:

Gemini API (no agent frameworks, no LangGraph)
Streamlit frontend
PostgreSQL

It validated my idea but functionally, it barely held together.

After prelims, I finally looked at what the AI had actually built:

Dashboard showing random stats
Asked Gemini for a JSON response with 5 keys, saved only 2
Randomly created DB tables while trying to read subjects

The kind of code you end up with when you let an AI cook unsupervised for a week.

So I did the only reasonable thing: opened Neovim and rebuilt from scratch.

Stack shift:
Streamlit → Django.
Raw API calls → proper LangGraph workflow with auto-retries.
Spaghetti → something I could actually reason about.

Hot take: letting a coding agent write your core architecture is a recipe for unmaintainable code. To build something real, the foundation has to be real.

My Agent Strategy (and How I Spent $0)

I now code the base architecture myself. Not by dumping decisions into a .md file and walking away, but by writing actual code in Neovim with minimal AI reliance, leaning on docs and classic debugging (and of course stackoverflow).

Then I hand it to the agent with a real starting point.

Example workflow:

I build the login page
Agent takes it, turns it into signup
Modifies precisely - instead of generate slop from scratch

Consistent patterns. Reviewable diffs. Code I can actually reason about.

Agents I use:

Precise edits → Claude web
Cloud models → OpenCode
Local models → Pi coding agents

And yes, I spent $0 on agents. Not by hopping free tiers, but by making every edit targeted and intentional. Tokens spent = value shipped.

This didn’t slow me down. The 8 months wasn’t lost to agent strategy—it was spent on experiments: fine-tuning, hybrid ML/LLM systems, guaranteeing quality at a level I actually wanted to ship. The agent strategy just kept the codebase clean enough to run those experiments.

The Experimental Mess (Everything I Tried To Ensure Quality & Precise Notes)

Once the base app was stable, I went heads-down on the AI core. This is where things got messy.

I tried moving the processing pipeline from Gemini 2.5 Flash to local Llama weights. Spent a week benchmarking Python PDF-to-text parsers (PyMuPDF, pdfplumber). Results were great for raw text, but all it took was one complex structural diagram to break them.

Switched strategies to small multimodal models (Llama-3.2-3B-Instruct). Ran a sliding context window: passing page-by-page image inputs alongside cumulative state JSON from previous pages. Turns out, thinking like GIL was a mistake.

I burned my Lightning AI GPU budget on inference and Unsloth fine-tuning loops. It took way longer than cloud endpoints. Output was pure garbage in, garbage out—the data pipeline accidentally trained on unfiltered, broken v1 outputs.

Pivoted back to Gemini. But over-engineering the UI engine (JS dynamically rendering complex JSON schemas) broke the model's JSON outputs. With prelims 2 approaching, I had to git restore and reset.

Production Push: The Hybrid Architecture That Actually Worked

After prelims 2, I did a complete fresh rebuild from the ground up. Same stack, but this time with production in mind. Fixed branding, color theme, typography, UX flows, and simplified the complex schema requirements for the notes UI.

Once frontend was out of the way, I returned to the AI core:

Speculative Decoding: Paired Qwen 3.5 0.8B & 2B models. Both gave quality responses independently, but pairing them completely broke the outputs (probably a sampling logic bug). Dropped it due to time constraints & prod risk.
Auto-Research Setup: Tasked OpenCode to run a prompt sweep using local Gemma 4 and Qwen 3.5 models to break down PDFs into our required structured layout. Slight prompt-tuning pushed outputs close to Gemini’s quality & required format, but processing time (text prompt + images of all pages → JSON) was unreasonably slow for production.
Final Decision: Hybrid architecture.
- Gemini 3.1 Flash-Lite for heavy multimodal/PDF parsing steps
- Qwen 3.6 35B for routing and text-only steps

Delivers production-grade quality at speeds I can actually stand behind.

The app officially pushed to prod on June 5th.

If you're an IGCSE student looking for a tool that actually respects your time as much as your marks, the waitlist is open (early access for 1st 500 signups):

🔗 https://examintelligence.app/

Follow along if you want to see how this iterates next.

Top comments (3)

Nazar Boyko • Jun 29

I keep coming back to the bit where the agent turns your login page into the signup page, because it's a sharper rule than most posts about agents reach. Handing the agent an existing pattern to modify gives you a diff to review, while generating from scratch hands you a blank check you have to audit line by line. The quiet cost nobody names is that it only works once you've built the first of each pattern yourself, one real form and one list view, so the agent always has something to copy. That first instance is the tax, and it's also the part most worth doing by hand. Between your two ideas for the next post I'd read the one on forcing local models to follow constraints first, that's the harder problem.

Mohammed Ibrahim Khan • Jun 29

Exactly. That “pattern tax” is worth it for verifiable diffs over the technical debt of agents blindly laying foundations.
Keeping things as composable modules cuts rewrite cycles—just pull what works from previous projects and iterate.
Next I’ll break down how I swapped AGENTS.md for AGENTS.db to better steer qwen3.6:35b as the orchestrator in my Pi coding harness.
Appreciate the read.

Mohammed Ibrahim Khan • Jun 28

Already drafting the next deep dive: how I actually prevented hallucinations and forced local LLMs to follow strict constraints inside my Pi coding harness. But I’ll let you drive — should I breakdown that or the developer bias disadvantage in prod