DEV Community

Cover image for I Built an Offline AI Career Advisor Using Gemma 4 — Here's Exactly How It Works
soohan abbasi
soohan abbasi

Posted on

I Built an Offline AI Career Advisor Using Gemma 4 — Here's Exactly How It Works

I Built an Offline AI Career Advisor Using Gemma 4 — Here's Exactly How It Works

A technical walkthrough of GuidanceOS: from model loading to multi-agent orchestration, running entirely on a Kaggle T4 GPU with no internet at inference time.


I teach Computer Science. Over the years, one thing I kept seeing was students who had decent skills but no idea what to do with them. They didn't know what jobs matched their profile, what courses to take next, or how to position themselves for a career. Career guidance platforms exist, sure — but they're mostly behind paywalls, require accounts, and need a stable internet connection.

So I built GuidanceOS for the Gemma 4 Good Hackathon. The goal was simple: a fully offline AI system that takes your resume, figures out your skills, and gives you a complete career analysis — job matches, course recommendations, a 3-month learning plan, and an ATS score — all running locally on a GPU, no API calls at inference time.

Here's exactly how I built it.


The Model Choice: Why Gemma 4 e4b-it

The hackathon required using Gemma 4. Google released four variants: 2B, 4B (edge), 26B MoE, and 31B Dense. I went with gemma-4-e4b-it for a specific reason.

The "e" stands for edge-optimized. The "it" stands for instruction-tuned. On Kaggle's free T4 GPU (15GB VRAM), a naive load of even a 4B model can fail if quantization isn't handled right. With 4-bit NF4 quantization via BitsAndBytes, gemma-4-e4b-it loads in about 8.7GB — leaving headroom for inference.

One problem I ran into immediately: the stable release of Hugging Face Transformers (5.0.0 at the time) didn't recognize the gemma4 architecture. Loading the model threw:

ValueError: The checkpoint you are trying to load has model type `gemma4`
but Transformers does not recognize this architecture.
Enter fullscreen mode Exit fullscreen mode

The fix was straightforward — install Transformers from the GitHub dev branch:

`pip install git+https://github.com/huggingface/transformers.git`
Enter fullscreen mode Exit fullscreen mode

This bumped the version to 5.8.0.dev0, which includes the Gemma 4 model class.

The second issue was GPU memory management. Using device_map="auto" caused BitsAndBytes to split the model across CPU and GPU, which it doesn't allow in 4-bit mode:

ValueError: Some modules are dispatched on the CPU or the disk.
Make sure you have enough GPU RAM to fit the quantized model.
Enter fullscreen mode Exit fullscreen mode

Solution: pin everything to a single GPU.

model = AutoModelForImageTextToText.from_pretrained(
    MODEL_PATH,
    quantization_config=bnb_config,
    device_map="cuda:0",
    dtype=torch.bfloat16,
)
Enter fullscreen mode Exit fullscreen mode

After that, the model loaded cleanly in about 3 minutes and sat at 8.7GB on GPU 0.


The Knowledge Base: TF-IDF Over 130K Records

I used two datasets:

  • LinkedIn Job Postings — 123,849 jobs with title, description, skills, location, experience level, and salary
  • Coursera Courses 2024 — 6,645 courses with title, skills, description, level, rating, and URL

For job and course matching, I built a TF-IDF index over combined text fields. For jobs, I concatenated the job title, skills description, and the first 300 characters of the full description. For courses, I combined the title, skills tags, and description.

jobs_clean['combined_text'] = (
    jobs_clean['title'] + ' ' +
    jobs_clean['skills_desc'] + ' ' +
    jobs_clean['description'].str[:300]
)
Enter fullscreen mode Exit fullscreen mode

Then I fit a TfidfVectorizer with bigrams and 10,000 features:

jobs_vectorizer = TfidfVectorizer(
    max_features=10000,
    stop_words='english',
    ngram_range=(1, 2)
)
jobs_tfidf_matrix = jobs_vectorizer.fit_transform(jobs_clean['combined_text'])
Enter fullscreen mode Exit fullscreen mode

At query time, the user's skill string gets transformed by the same vectorizer and compared against the full matrix using cosine similarity. The top-k results come back in milliseconds — no GPU needed, no network call.

I chose TF-IDF over dense vector search (FAISS + sentence embeddings) deliberately. Dense search needs an embedding model at query time, which adds latency and memory. TF-IDF is deterministic, fast, and reproducible — important when the whole point is offline-first operation.


The Inference Helper

Before building agents, I needed a clean wrapper around Gemma 4's generation. The model uses a specific chat format:

def ask_gemma(prompt, max_tokens=300, temperature=0.7):
    formatted = f"<bos><start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model\n"

    inputs = tokenizer(
        formatted,
        return_tensors="pt",
        add_special_tokens=False
    ).to("cuda:0")

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
            repetition_penalty=1.3,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    input_len = inputs["input_ids"].shape[-1]
    response = tokenizer.decode(outputs[0][input_len:], skip_special_tokens=True)

    if "<end_of_turn>" in response:
        response = response.split("<end_of_turn>")[0]

    return response.strip()
Enter fullscreen mode Exit fullscreen mode

A few things worth noting here:

add_special_tokens=False — because I'm manually prepending <bos> in the prompt string. If you let the tokenizer add it automatically as well, you get a duplicate BOS token which confuses the model.

repetition_penalty=1.3 — without this, the model loops. I found this out the hard way when my first test response was 200 repetitions of "matched matched matched".

Decoding only new tokensoutputs[0][input_len:] strips the input tokens from the output before decoding. Otherwise you get the full prompt echoed back before the response.


The Four Agents

Each agent is a focused prompt sent to ask_gemma. The agents run sequentially, not in parallel — this keeps memory usage flat and avoids context window issues.

Agent 1 — Skills Analyzer

Takes the raw resume text and returns a structured output in a fixed format:

TECHNICAL SKILLS: Python, NLP, LangChain, ...
SOFT SKILLS: Communication, Teaching, ...
EXPERIENCE: 5 years
LEVEL: mid
DOMAINS: Artificial Intelligence, NLP, Education
Enter fullscreen mode Exit fullscreen mode

I enforce the format in the prompt rather than post-processing with regex. Gemma 4 follows structured output instructions reliably when you give it an exact template to fill.

Agent 2 — Career Path Advisor

Takes the extracted skills string and returns three career paths with job titles, required additional skills, USD salary ranges, and a growth potential score out of 10.

Agent 3 — Learning Plan Designer

Takes the skills and target role and returns a 3-month plan broken down by month — foundation topics in month 1, intermediate topics in month 2, advanced topics and portfolio projects in month 3.

Agent 4 — Resume and ATS Analyst

Takes the resume text and target role and returns an ATS score out of 100, three strengths, three improvement areas, missing keywords, and a suggested rewrite for the professional summary.

The skills string extracted by Agent 1 is passed directly into Agents 2 and 3, creating a lightweight chain without needing LangChain or CrewAI overhead.


The Gradio Interface

I used Gradio instead of Streamlit for one reason: on Kaggle, app.launch(share=True) generates a public ngrok URL in a single line. No tunnel setup, no separate process.

The interface has two inputs — resume text and target role — and six output tabs, one per agent plus job matches and course recommendations.

with gr.Blocks(title="GuidanceOS") as app:
    with gr.Row():
        with gr.Column(scale=1):
            resume_input = gr.Textbox(label="Resume Text", lines=14)
            role_input   = gr.Textbox(label="Target Role")
            submit_btn   = gr.Button("Analyze My Profile", variant="primary")
        with gr.Column(scale=2):
            with gr.Tab("Skills Analysis"):
                skills_out = gr.Textbox(lines=10)
            # ... five more tabs

app.launch(share=True)
Enter fullscreen mode Exit fullscreen mode

I added gr.Progress() to the main function so the UI shows which agent is running instead of just freezing. Each agent call takes 30-90 seconds on T4 — the progress bar makes it feel responsive.


End-to-End Flow

When a user clicks Analyze:

  1. Resume text → Agent 1 → structured skills profile
  2. Skills string → TF-IDF search → top 5 jobs from 123K LinkedIn postings
  3. Skills string → TF-IDF search → top 5 courses from 6.6K Coursera courses
  4. Skills string → Agent 2 → three career paths with salaries
  5. Skills string + target role → Agent 3 → 3-month learning roadmap
  6. Resume text + target role → Agent 4 → ATS score and improvements
  7. All outputs → six Gradio tabs

Total time: 3-5 minutes on a T4 GPU. All computation on-device. Zero external API calls.


What I Would Do Differently

A few things I'd change with more time:

Structured JSON output from agents. Right now the agents return free-form text. Enforcing JSON output would make the results easier to display in a proper UI — cards instead of plain text boxes.

FAISS for course search. TF-IDF misses semantic similarity — "data analysis" and "analytics" are treated as different terms. Sentence embeddings with FAISS would improve course matching quality significantly.

Session persistence with SQLite. The current setup doesn't remember previous conversations. Adding a lightweight SQLite store would let users build on previous sessions.

SHAP explainability. I had planned to add a SHAP chart showing which skills drove each job recommendation using a Random Forest trained on the jobs dataset. It didn't make the deadline but the data pipeline supports it cleanly.


Running It Yourself

The full notebook is on Kaggle:
kaggle.com/code/abbasi110/guidanceos-gemma4-offline-career-advisor

Source code on GitHub:
github.com/soohanAbbasi/GuidanceOS

You need a Kaggle account to run it. Add the gemma-4-e4b-it model and both datasets, set the accelerator to GPU T4 x2, and run all cells in order. The Gradio URL prints in the last cell.


That's the full build. If you have questions about any part of it — the quantization setup, the prompt templates, or the TF-IDF indexing — leave a comment and I'll answer.

Top comments (0)