Soham

Posted on May 24

What If Your Portfolio Verifier Could Actually See Your UI?

#ai #devchallenge #gemma #gemmachallenge

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

Cairn — Turn "I Want to Be an AI Engineer" Into a Verified Portfolio

What I Built

Every year, millions of people type something like "I want to become an AI engineer" into Google and then spend the next two weeks drowning in tabs. Roadmaps that don't know where you're starting from. YouTube playlists with no structure. Free courses with 13% completion rates. ChatGPT roadmaps that vanish the moment you close the tab.

I built Cairn because that gap is real, and the 100x-cheaper alternative to a coding bootcamp genuinely does not exist yet.

Cairn is a personalized AI learning and career engine. You tell it your goal in plain English — "I want to land an AI engineering role in 6 months, I know Python basics and built one Flask app" — and it builds you a structured 12-week path: phases, weekly milestones with actual deliverables, curated free resources scored by real learner outcomes, and projects to build. Not a static document. A living system that adapts as you go.

But the part I'm most proud of is what happens when you ship something.

You submit your GitHub repo. Cairn pulls the code, runs it through a three-stage evaluation pipeline — structural checks, LLM-powered code review, and a multimodal visual review where Gemma 4 12B literally looks at screenshots of your running app — and if you pass, it issues a cryptographically signed credential that goes on your public portfolio at cairn.dev/u/your-handle. That's the URL you put on your resume. Not "completed Course X." Verified work.

The three things every learning product gets wrong:

Problem	How Cairn addresses it
Path paralysis — "what do I even learn?"	Personalized 12-week path generated from your actual starting point, not a generic template
No accountability — you ghost your own goals	Daily nudges, streak tracking, adaptive replanning when life happens
No proof-of-work — tutorial clones don't get interviews	Multi-stage project verification with HMAC-signed credentials on a public portfolio

The product was designed for Indian engineering students and career-switchers first — a market of 10M+ active learners who are priced out of bootcamps (₹3 lakh+) but get zero structure from free resources. But the same problem exists globally, and the architecture reflects that.

Demo

🎬 5-minute walkthrough

🌐 Live app (https://cairnapp.netlify.app/)

👤 Example public portfolio — no signup needed https://cairnapp.netlify.app/example

The example portfolio is the fastest way to see what a verified Cairn profile looks like — the thing a user puts on their resume after 12 weeks.

Code

🔗 GitHub repository (https://github.com/Soham-0047/cairn)

The repo is a full-stack TypeScript monorepo:

frontend/    → Next.js 15 (App Router, SSR, Tailwind)
backend/     → Express + TypeScript + MongoDB + Mongoose
              └── llm/   → provider-agnostic router with fallback chains

Everything brand-visible — the name, logo, colors, hero copy, CTA text, LLM provider chains — is editable at runtime via /admin without touching code. That's not an accident; it's what makes the codebase reusable for the next hackathon without a rewrite.

How I Used Gemma 4

This is the part I want to explain carefully, because I think the interesting choice isn't which Gemma 4 model I used — it's why I used three different ones for three different jobs.

The problem with "just use the biggest model"

Path generation requires loading 50+ curated resources, similar past learners' completed paths, and a user's full profile into a single prompt — then reasoning over all of it to produce a coherent 12-week plan. That needs a large model with a long context window.

Parsing a goal statement from a text box — extracting structured fields like current_skills, target_role, timeline_weeks — is a small, latency-sensitive extraction task. Using a 31B model for that would be wasteful and slow.

Reviewing a screenshot of someone's running app to check whether the UI matches what the code claims? That's not a text problem at all.

Three different shapes of job. Three different models.

Model 1 — Gemma 4 4B: Goal parsing at onboarding

When a user types their goal in plain English, that text goes to Gemma 4 4B for structured extraction.

Why 4B? It runs in ~600ms on Google AI Studio's free tier. This call happens multiple times per onboarding session as users refine their input. A 27B model here would feel sluggish and blow through rate limits before the user even starts their path. The task is bounded: extract a handful of fields from a short paragraph. 4B handles it cleanly, fast, and free.

// backend/src/llm/router.ts
{
  task: "parse_goal",
  primary: "gemma-4-4b",  // fast + cheap; perfect for extraction
  fallback: ["gemini-flash"]
}

Model 2 — Gemma 4 27B: Path generation and code review

The heavy lifting goes to Gemma 4 27B via Google AI Studio and OpenRouter's free Gemma 4 endpoint.

Path generation loads the user's structured profile, a subset of the curated resource corpus (~50 items matched to the user's target role), and summarized outcome data from similar past learners — all in a single prompt, well inside the 128K context window. The model has to reason about dependencies between topics, realistic hours-per-week, and what kinds of projects actually demonstrate the skills required for the target role. This is not a task for a small model.

The same model handles code review at project evaluation Stage 2: reading a multi-file repo snapshot and answering whether the code does what the README claims, whether it demonstrates the skills the user said they practiced, and where specifically it falls short. Multi-file code reasoning with long context is exactly where 27B earns its place.

SYSTEM: You are an expert career coach generating a personalized learning path.
You have access to {N} similar-profile learners' actual completed paths and outcomes.
Prefer concrete projects over passive content...

[resource corpus subset]
[similar past learner paths with outcomes]
[user's profile + target role + weekly hours]

OUTPUT: JSON path schema only, no preamble.

Model 3 — Gemma 4 12B Vision: The hero feature

This is the one I'm most excited about.

When a user submits their project, they can upload up to 4 screenshots of their running application alongside the GitHub URL. Gemma 4 12B — the vision-capable variant — looks at the actual UI and cross-references it against what the code and README claim.

This catches two failure modes that pure code review misses:

"Looks polished but the code is sloppy" — a nice-looking template wrapped around someone else's logic
"Code is technically fine but the UI is a placeholder" — the README claims a working app; the screenshot shows a 404

No other evaluation approach I know of does this. It's not just "did the tests pass." It's "show me what you actually shipped."

// eval.service.ts — Stage 3
const visualReview = await llmRouter.call({
  task: "visual_eval",
  model: "gemma-4-12b-vision",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: visualEvalPrompt(repo, readme) },
        ...screenshots.map(s => ({ type: "image_url", image_url: s }))
      ]
    }
  ]
});

The evaluation page shows the user exactly which provider and model ran each stage. That transparency is intentional — the whole model selection story is visible to anyone who submits a project, not just buried in the docs.

The full pipeline

User submits GitHub repo + screenshots
         │
         ▼
Stage 1: Structural (deterministic)
  • README present?
  • Commit count + history
  • Tests exist?
  • File tree size reasonable?
         │
         ▼
Stage 2: Code review  →  Gemma 4 27B
  • Does code match README claims?
  • Originality vs known tutorial repos
  • Domain-specific checks (ML training loops? Backend auth?)
         │
         ▼
Stage 3: Visual review  →  Gemma 4 12B (vision)
  • Does the UI match what the code claims?
  • Polish level: shipped / demo / prototype?
  • Per-screenshot findings
         │
         ▼
Stage 4: Synthesis
  • Weighted score (pass threshold: ≥0.65 + originality ≥0.55)
  • If passing: HMAC-signed credential → public portfolio
  • If failing: specific, actionable feedback

Provider routing + fallback

The LLM router is a ~300-line module that tracks rate-limit headroom per provider in Redis and automatically falls back when free tiers are exhausted.

Google AI Studio  →  OpenRouter (Gemma 4 free)  →  Gemini 2.5 Pro  →  DeepSeek V3

No user-visible failure. No surprise bills — there's a monthly cost ceiling per route; if every free tier is exhausted, the router refuses the call and shows a graceful degradation message rather than hitting a paid endpoint uncontrolled.

The entire routing table is editable at runtime via /admin/providers. Swapping Gemma 4 out for any other model family is a UI change, not a code change. I built it that way deliberately.

Why this is a Gemma 4 submission and not a "Gemini with Gemma branding" submission

All three Gemma variants are running on Google AI Studio's free tier as the primary provider. The fallback for 27B tasks is OpenRouter's free google/gemma-4-27b endpoint — not Gemini. The vision eval has no non-Gemma fallback; if Gemma 4 12B is unavailable, Stage 3 is skipped and the evaluation is flagged as "code-only review." The multimodal story only works with Gemma 4.

Tech Stack

Frontend: Next.js 15 (App Router, SSR), React 18, TypeScript, Tailwind CSS, NextAuth
Backend: Node.js 20, Express, TypeScript, Mongoose, MongoDB Atlas
LLM routing: Custom provider-agnostic router — Google AI Studio, OpenRouter, Groq, Cerebras, Together AI
Storage: MongoDB Atlas (paths/progress), Cloudflare R2 (screenshots)
Auth: GitHub OAuth via NextAuth
Payments: Razorpay (India UPI + cards), Stripe (global)

Built solo, in public, for the people who are trying to get their first AI engineering role and can't afford to spend ₹3 lakh on a bootcamp. If that's you — or if you know someone it describes — Cairn is built for exactly that.

Top comments (1)

Harjot Singh • May 31

The "verified" in verified-portfolio is doing all the heavy lifting, and it's the right thing to obsess over, because the whole reason roadmaps and course-completion certs are worthless to employers is that they verify attendance, not capability. Anyone can finish a tutorial; the question is did you actually build the thing and does it work. Giving the verifier eyes to see the UI is a genuinely better signal than "tests pass" or "repo exists," because a screenshot of a working interface is much harder to fake than a green checkmark, it's closer to evidence of the real outcome. The trap to guard against: a vision model can be fooled by a UI that looks done but is hollow (static mockup, hardcoded data, dead buttons), so the strongest verification pairs what-it-looks-like with what-it-does, see the UI and probe that it actually responds. Verify the outcome, not the appearance of the outcome. That distinction is central to how I think about verification in Moonshift. Does Cairn just analyze a screenshot, or does it interact with the running app to confirm the UI isn't a facade?