<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: soohan abbasi</title>
    <description>The latest articles on DEV Community by soohan abbasi (@soohan_abbasi).</description>
    <link>https://dev.to/soohan_abbasi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3928507%2F0c2e7853-8258-4f70-ae6c-5915eeba921a.png</url>
      <title>DEV Community: soohan abbasi</title>
      <link>https://dev.to/soohan_abbasi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/soohan_abbasi"/>
    <language>en</language>
    <item>
      <title>Chain-of-Thought and Beyond: How LLMs Actually Learn to Reason</title>
      <dc:creator>soohan abbasi</dc:creator>
      <pubDate>Sat, 16 May 2026 12:30:00 +0000</pubDate>
      <link>https://dev.to/soohan_abbasi/chain-of-thought-and-beyond-how-llms-actually-learn-to-reason-3kk4</link>
      <guid>https://dev.to/soohan_abbasi/chain-of-thought-and-beyond-how-llms-actually-learn-to-reason-3kk4</guid>
      <description>&lt;p&gt;&lt;em&gt;"The ability to reason step-by-step is not just a feature. It might be the difference between a language model that sounds intelligent and one that actually is."&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction: When AI Started Thinking
&lt;/h2&gt;

&lt;p&gt;In 2022, researchers at Google Brain published a paper titled &lt;strong&gt;"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models"&lt;/strong&gt;. At the time, nobody quite anticipated it would mark the beginning of a shift that would reshape the entire AI field.&lt;/p&gt;

&lt;p&gt;The idea was simple: instead of asking a model to answer directly, give it time to think. Ask it to write out intermediate steps. Accuracy improves dramatically.&lt;/p&gt;

&lt;p&gt;That paper now sits at over 10,000 citations. But the question it raised has never been fully answered:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do LLMs actually think? Or do they create a very convincing illusion of thinking?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That is what this blog is about. And as someone preparing for a PhD in AI, it is a question I keep coming back to.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1: What Is Chain-of-Thought?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Standard Prompting vs. CoT Prompting
&lt;/h3&gt;

&lt;p&gt;Imagine asking a model this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many does he have now?"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;With standard prompting, the model jumps straight to: "11"&lt;/p&gt;

&lt;p&gt;With chain-of-thought prompting, the model works through it first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Roger starts with 5 balls.
2 cans × 3 balls = 6 balls.
5 + 6 = 11 balls.
Answer: 11
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both get the same answer. So what is the point?&lt;/p&gt;

&lt;p&gt;The gap shows up on harder problems. Models that reason through steps outperform those that answer directly on multi-step math, symbolic reasoning, and commonsense problems. The more complex the task, the bigger the difference.&lt;/p&gt;

&lt;h3&gt;
  
  
  Zero-Shot CoT: One Phrase Changes Everything
&lt;/h3&gt;

&lt;p&gt;In the same year, researchers discovered something even more surprising. Simply adding the phrase &lt;strong&gt;"Let's think step by step"&lt;/strong&gt; to a question, without any examples, significantly improved reasoning accuracy.&lt;/p&gt;

&lt;p&gt;No demonstrations. No fine-tuning. Just those five words.&lt;/p&gt;

&lt;p&gt;This became known as zero-shot CoT. And the obvious follow-up question is: why does this even work?&lt;/p&gt;




&lt;h2&gt;
  
  
  My Own Experiment: Testing CoT on GSM8K
&lt;/h2&gt;

&lt;p&gt;Before going deeper into the theory, I wanted to test this myself. So I ran a small experiment using an open-source model on a standard benchmark.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model:&lt;/strong&gt; Qwen 2.5 1.5B Instruct (free, runs on Kaggle GPU)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dataset:&lt;/strong&gt; GSM8K (grade school math problems)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test:&lt;/strong&gt; Standard prompting vs. "Let's think step by step"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sample:&lt;/strong&gt; 10 problems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Results:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Correct&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Without CoT&lt;/td&gt;
&lt;td&gt;2/10&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;With CoT&lt;/td&gt;
&lt;td&gt;3/10&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;[CoT vs No-CoT Results on GSM8K]&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmz4somay658v8xjhnbld.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmz4somay658v8xjhnbld.png" alt=" " width="800" height="341"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Even on a model roughly 360 times smaller than the one used in the original paper, the improvement showed up. A single phrase shifted accuracy by 10%.&lt;/p&gt;

&lt;p&gt;A few things stood out from the per-problem breakdown:&lt;/p&gt;

&lt;p&gt;Problem 1 was solved correctly with CoT, but not without it. Problem 7 showed the same pattern. Problem 4 was solved correctly either way. But Problem 6 was actually solved correctly without CoT and incorrectly with it. The model overthought a straightforward calculation and got it wrong.&lt;/p&gt;

&lt;p&gt;That last observation matters and connects to something I discuss in Part 4.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Quick note: the overall accuracy numbers look low because this model is tiny compared to what the original paper used. The point here is the relative difference, not the absolute numbers.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Part 2: What Is Actually Happening Inside the Model?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  More Than Pattern Matching
&lt;/h3&gt;

&lt;p&gt;The common criticism of LLMs is that they are sophisticated autocomplete. They match patterns from training data rather than genuinely reasoning. This criticism is not entirely wrong, but it is incomplete.&lt;/p&gt;

&lt;p&gt;Between 2023 and 2024, researchers doing mechanistic interpretability work found some interesting things inside these models.&lt;/p&gt;

&lt;p&gt;LLMs contain specific reasoning circuits: groups of neurons and attention heads that work together to perform logical operations. They use something called induction heads, which are attention patterns that identify sequences in context and predict what follows. Some models have developed implicit world models, meaning they internally represent concepts like spatial relationships, time, and causality.&lt;/p&gt;

&lt;p&gt;None of this was explicitly programmed. It emerged from training on text.&lt;/p&gt;

&lt;p&gt;The picture that comes out of this research is more interesting than "just pattern matching." These models have developed internal structures that support reasoning-like behavior. Whether that constitutes real reasoning is a separate philosophical question, but it is clearly more than autocomplete.&lt;/p&gt;

&lt;h3&gt;
  
  
  Process Reward Models: Grading the Work, Not Just the Answer
&lt;/h3&gt;

&lt;p&gt;Here is an idea that changed how reasoning models are trained. Instead of grading only the final answer, what if you graded every individual reasoning step?&lt;/p&gt;

&lt;p&gt;That is the core of a Process Reward Model (PRM).&lt;/p&gt;

&lt;p&gt;In standard training, the model produces an answer and gets told whether it was right or wrong. In PRM-based training, each step in the reasoning chain gets its own score. A wrong step gets flagged early, before it derails the rest of the solution.&lt;/p&gt;

&lt;p&gt;OpenAI's 2023 paper "Let's Verify Step by Step" showed that PRMs significantly outperform outcome-based reward models on mathematical reasoning tasks.&lt;/p&gt;

&lt;p&gt;This idea became the foundation for something much bigger, which I will cover in Week 12 when we get to test-time compute scaling.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 3: OpenAI o1 and DeepSeek-R1
&lt;/h2&gt;

&lt;h3&gt;
  
  
  OpenAI o1: Giving Models Time to Think
&lt;/h3&gt;

&lt;p&gt;In September 2024, OpenAI released o1, and the response from the research community was immediate.&lt;/p&gt;

&lt;p&gt;The idea behind o1 is straightforward. Give the model more time to think about the inference. Before producing an answer, o1 generates a hidden chain of thought that the user never sees, but the model uses internally. This chain is trained with reinforcement learning: the model gets rewarded for reaching correct answers, which teaches it to develop better internal reasoning strategies.&lt;/p&gt;

&lt;p&gt;The results on AIME 2024, a notoriously difficult high school math competition, were striking. GPT-4o scored 12%. o1 scored 74%.&lt;/p&gt;

&lt;p&gt;That is not a small improvement. That is a different class of performance, driven almost entirely by letting the model think longer.&lt;/p&gt;

&lt;h3&gt;
  
  
  DeepSeek-R1: The Open Source Answer
&lt;/h3&gt;

&lt;p&gt;In January 2025, a Chinese startup called DeepSeek released R1, and it caused genuine disruption in the Western AI community.&lt;/p&gt;

&lt;p&gt;DeepSeek-R1 matched o1-level performance at a fraction of the training cost. And it was fully open source.&lt;/p&gt;

&lt;p&gt;Three technical contributions made this possible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Group Relative Policy Optimization (GRPO):&lt;/strong&gt; Standard RLHF needs a separate critic model to score responses, which adds significant overhead. GRPO removes that requirement. Instead, the model generates multiple responses to the same question, compares them against each other, and rewards the best one. No criticism needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Warm Start Before RL:&lt;/strong&gt; Training a model from scratch with pure reinforcement learning is unstable because the model starts random. DeepSeek's approach was to first run supervised fine-tuning to give the model a reasonable starting point, then apply RL on top of that. A sensible idea that turned out to matter a lot.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Emergent Reasoning Behaviors:&lt;/strong&gt; During training, R1 developed behaviors that were never explicitly programmed. The model began catching its own mistakes mid-reasoning and reconsidering. It started verifying its own answers before finalizing them. It explored alternative solution paths. These behaviors just appeared from the training process. For researchers trying to understand what is happening inside these models, this is genuinely interesting territory.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 4: Where CoT Fails
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Unfaithful Reasoning
&lt;/h3&gt;

&lt;p&gt;One of the more unsettling findings in recent research is that CoT explanations do not always reflect what the model actually computed.&lt;/p&gt;

&lt;p&gt;Anthropic's 2023 research showed that models sometimes produce post-hoc rationalizations. They settle on an answer through some internal process, then construct a reasoning chain that appears to justify it. The explanation and the computation are decoupled.&lt;/p&gt;

&lt;p&gt;What the model writes as its reasoning may not be what actually happened.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reasoning or Memorization?
&lt;/h3&gt;

&lt;p&gt;There is a deeper question underneath CoT performance: is the model actually reasoning, or is it recalling reasoning-shaped patterns from its training data?&lt;/p&gt;

&lt;p&gt;Researchers created a symbolic variant of GSM8K where the logic of each problem stayed the same, but surface features like numbers and names were changed. Performance dropped significantly. If the model were truly reasoning about the structure of the problem, this change should not matter. The fact that it does suggests some of the apparent reasoning is memorization in disguise.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Overthinking Problem
&lt;/h3&gt;

&lt;p&gt;My experiment showed a small version of this. On Problem 6, the model solved it correctly without CoT. With CoT, it added extra steps, got confused, and got it wrong.&lt;/p&gt;

&lt;p&gt;Researchers have documented this pattern at scale. Longer reasoning chains are not always better. Past a certain point, additional steps introduce errors rather than correct them. This has been called "overthinking" or the "lost in the middle" problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Compositional Generalization
&lt;/h3&gt;

&lt;p&gt;LLMs also struggle when they need to combine reasoning skills in novel ways. They can handle familiar patterns well. But put two familiar patterns together in a configuration the model has not seen, and performance degrades. This suggests the reasoning ability is less flexible and generalizable than it might appear from benchmark numbers.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 5: What We Still Do Not Know
&lt;/h2&gt;

&lt;p&gt;CoT has genuinely advanced what language models can do. But there are open questions that the field has not resolved.&lt;/p&gt;

&lt;h3&gt;
  
  
  Are the Explanations Honest?
&lt;/h3&gt;

&lt;p&gt;When a model shows its reasoning, is that actually what happened computationally? The unfaithful reasoning research says it often is not. We do not have reliable tools to check whether a model's stated reasoning matches its internal computation. This matters a lot if you want to trust the reasoning, not just the answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where Does Reasoning End and Memorization Begin?
&lt;/h3&gt;

&lt;p&gt;The symbolic variant experiments raise a question that nobody has cleanly answered yet. For any given correct reasoning chain, how much of it reflects genuine logical inference versus pattern recall? The boundary is not well defined.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Does CoT Work in English and Struggle Elsewhere?
&lt;/h3&gt;

&lt;p&gt;Almost all CoT research was conducted in English. When you apply the same techniques to Arabic, Urdu, or other lower-resource languages, performance drops noticeably. Whether this is primarily a data coverage problem or something more structural about how reasoning transfers across language families is still an open question.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can We Formally Verify a Reasoning Step?
&lt;/h3&gt;

&lt;p&gt;A calculator gives you a provably correct answer. An LLM gives you a confident one. There is currently no reliable way to formally verify whether an individual step in an LLM's reasoning chain is logically valid. Researchers are exploring integrations with formal theorem provers such as Lean4, but this remains largely unsolved.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does Interpretability Scale?
&lt;/h3&gt;

&lt;p&gt;Mechanistic interpretability research has produced real insights at small model scales: specific circuits identified, specific behaviors localized. But as models grow to hundreds of billions of parameters, these techniques become computationally impractical. How interpretability research keeps pace with model scale is an open problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  Papers Worth Reading
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Paper&lt;/th&gt;
&lt;th&gt;What It Contributes&lt;/th&gt;
&lt;th&gt;Venue&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Wei et al. (2022)&lt;/td&gt;
&lt;td&gt;Original CoT paper&lt;/td&gt;
&lt;td&gt;NeurIPS 2022&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kojima et al. (2022)&lt;/td&gt;
&lt;td&gt;Zero-shot CoT discovery&lt;/td&gt;
&lt;td&gt;NeurIPS 2022&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lightman et al. (2023)&lt;/td&gt;
&lt;td&gt;Process Reward Models&lt;/td&gt;
&lt;td&gt;OpenAI Tech Report&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-AI (2025)&lt;/td&gt;
&lt;td&gt;GRPO and DeepSeek-R1&lt;/td&gt;
&lt;td&gt;arXiv 2501&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Turpin et al. (2023)&lt;/td&gt;
&lt;td&gt;Unfaithful reasoning&lt;/td&gt;
&lt;td&gt;NeurIPS 2023&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wang et al. (2022)&lt;/td&gt;
&lt;td&gt;Self-consistency via majority voting&lt;/td&gt;
&lt;td&gt;ICLR 2023&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Research Groups Doing Interesting Work Here
&lt;/h2&gt;

&lt;p&gt;Anthropic's interpretability team is doing some of the most rigorous work on understanding what is happening inside these models. DeepMind's Gemini team is pushing multimodal reasoning. MIT's BCS and CSAIL groups are connecting cognitive science with language model research. Peking University's NLP group has produced strong work on multilingual reasoning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarks You Should Know
&lt;/h2&gt;

&lt;p&gt;GSM8K covers grade school math with 8,500 problems. MATH is competition-level with 12,500 problems. MMLU covers broad knowledge across many domains. ARC-Challenge focuses on scientific reasoning. BIG-Bench Hard collects 23 tasks specifically designed to be difficult for current models.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Chain-of-thought prompting is one of the more surprising ideas in recent AI research. A single phrase, added to a prompt, unlocks reasoning capabilities that were already there but not being used.&lt;/p&gt;

&lt;p&gt;And yet the central question it raised remains unanswered. Do these models actually reason, or do they produce sophisticated simulations of reasoning? The honest answer is that we do not fully know.&lt;/p&gt;

&lt;p&gt;The gap between sounding intelligent and being intelligent is where the most interesting work in this field is happening right now.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next week:&lt;/strong&gt; Small Language Models. How models like Phi-3 and Gemma became serious competitors to GPT-4, and what the research landscape looks like when you do not need a data center to run your model.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;1.Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. &lt;em&gt;NeurIPS 2022&lt;/em&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Kojima, T., et al. (2022). Large Language Models are Zero-Shot Reasoners. &lt;em&gt;NeurIPS 2022&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Lightman, H., et al. (2023). Let's Verify Step by Step. &lt;em&gt;arXiv:2305.20050&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. &lt;em&gt;arXiv:2501.12948&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Turpin, M., et al. (2023). Language Models Don't Always Say What They Think. &lt;em&gt;NeurIPS 2023&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Wang, X., et al. (2022). Self-Consistency Improves Chain-of-Thought Reasoning in Language Models. &lt;em&gt;ICLR 2023&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Elhage, N., et al. (2021). A Mathematical Framework for Transformer Circuits. &lt;em&gt;Anthropic&lt;/em&gt;.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;Code for this experiment is available on GitHub: &lt;a href="https://github.com/soohanAbbasi/weekly-AI-ML-research/tree/main/week01-chain-of-thought" rel="noopener noreferrer"&gt;Week 01 Code&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This is part of a weekly series on AI/ML research. Each post covers theory, recent work, and experiments I run myself.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Connect on LinkedIn &lt;a href="https://www.linkedin.com/in/soohan-abbasi-36267b183/" rel="noopener noreferrer"&gt;Soohan Abbasi&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>deeplearning</category>
    </item>
    <item>
      <title>I Built an Offline AI Career Advisor Using Gemma 4 — Here's Exactly How It Works</title>
      <dc:creator>soohan abbasi</dc:creator>
      <pubDate>Wed, 13 May 2026 06:04:46 +0000</pubDate>
      <link>https://dev.to/soohan_abbasi/i-built-an-offline-ai-career-advisor-using-gemma-4-heres-exactly-how-it-works-3hgc</link>
      <guid>https://dev.to/soohan_abbasi/i-built-an-offline-ai-career-advisor-using-gemma-4-heres-exactly-how-it-works-3hgc</guid>
      <description>&lt;h1&gt;
  
  
  I Built an Offline AI Career Advisor Using Gemma 4 — Here's Exactly How It Works
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;A technical walkthrough of GuidanceOS: from model loading to multi-agent orchestration, running entirely on a Kaggle T4 GPU with no internet at inference time.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;I teach Computer Science. Over the years, one thing I kept seeing was students who had decent skills but no idea what to do with them. They didn't know what jobs matched their profile, what courses to take next, or how to position themselves for a career. Career guidance platforms exist, sure — but they're mostly behind paywalls, require accounts, and need a stable internet connection.&lt;/p&gt;

&lt;p&gt;So I built GuidanceOS for the Gemma 4 Good Hackathon. The goal was simple: a fully offline AI system that takes your resume, figures out your skills, and gives you a complete career analysis — job matches, course recommendations, a 3-month learning plan, and an ATS score — all running locally on a GPU, no API calls at inference time.&lt;/p&gt;

&lt;p&gt;Here's exactly how I built it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Model Choice: Why Gemma 4 e4b-it
&lt;/h2&gt;

&lt;p&gt;The hackathon required using Gemma 4. Google released four variants: 2B, 4B (edge), 26B MoE, and 31B Dense. I went with &lt;strong&gt;gemma-4-e4b-it&lt;/strong&gt; for a specific reason.&lt;/p&gt;

&lt;p&gt;The "e" stands for edge-optimized. The "it" stands for instruction-tuned. On Kaggle's free T4 GPU (15GB VRAM), a naive load of even a 4B model can fail if quantization isn't handled right. With 4-bit NF4 quantization via BitsAndBytes, gemma-4-e4b-it loads in about 8.7GB — leaving headroom for inference.&lt;/p&gt;

&lt;p&gt;One problem I ran into immediately: the stable release of Hugging Face Transformers (5.0.0 at the time) didn't recognize the &lt;code&gt;gemma4&lt;/code&gt; architecture. Loading the model threw:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ValueError: The checkpoint you are trying to load has model type `gemma4`
but Transformers does not recognize this architecture.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fix was straightforward — install Transformers from the GitHub dev branch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="sb"&gt;`&lt;/span&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;git+https://github.com/huggingface/transformers.git&lt;span class="sb"&gt;`&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This bumped the version to &lt;code&gt;5.8.0.dev0&lt;/code&gt;, which includes the Gemma 4 model class.&lt;/p&gt;

&lt;p&gt;The second issue was GPU memory management. Using &lt;code&gt;device_map="auto"&lt;/code&gt; caused BitsAndBytes to split the model across CPU and GPU, which it doesn't allow in 4-bit mode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ValueError: Some modules are dispatched on the CPU or the disk.
Make sure you have enough GPU RAM to fit the quantized model.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Solution: pin everything to a single GPU.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForImageTextToText&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;MODEL_PATH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quantization_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bnb_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda:0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After that, the model loaded cleanly in about 3 minutes and sat at 8.7GB on GPU 0.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Knowledge Base: TF-IDF Over 130K Records
&lt;/h2&gt;

&lt;p&gt;I used two datasets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LinkedIn Job Postings&lt;/strong&gt; — 123,849 jobs with title, description, skills, location, experience level, and salary&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coursera Courses 2024&lt;/strong&gt; — 6,645 courses with title, skills, description, level, rating, and URL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For job and course matching, I built a TF-IDF index over combined text fields. For jobs, I concatenated the job title, skills description, and the first 300 characters of the full description. For courses, I combined the title, skills tags, and description.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;jobs_clean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;combined_text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;jobs_clean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
    &lt;span class="n"&gt;jobs_clean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;skills_desc&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
    &lt;span class="n"&gt;jobs_clean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then I fit a TfidfVectorizer with bigrams and 10,000 features:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;jobs_vectorizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TfidfVectorizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;max_features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;stop_words&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;english&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ngram_range&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;jobs_tfidf_matrix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jobs_vectorizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jobs_clean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;combined_text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At query time, the user's skill string gets transformed by the same vectorizer and compared against the full matrix using cosine similarity. The top-k results come back in milliseconds — no GPU needed, no network call.&lt;/p&gt;

&lt;p&gt;I chose TF-IDF over dense vector search (FAISS + sentence embeddings) deliberately. Dense search needs an embedding model at query time, which adds latency and memory. TF-IDF is deterministic, fast, and reproducible — important when the whole point is offline-first operation.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Inference Helper
&lt;/h2&gt;

&lt;p&gt;Before building agents, I needed a clean wrapper around Gemma 4's generation. The model uses a specific chat format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ask_gemma&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;formatted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;bos&amp;gt;&amp;lt;start_of_turn&amp;gt;user&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;end_of_turn&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;start_of_turn&amp;gt;model&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;formatted&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;add_special_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda:0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_grad&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;do_sample&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;top_p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;repetition_penalty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;pad_token_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;eos_token_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;eos_token_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;eos_token_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;input_len&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;input_len&lt;/span&gt;&lt;span class="p"&gt;:],&lt;/span&gt; &lt;span class="n"&gt;skip_special_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;end_of_turn&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;end_of_turn&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things worth noting here:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;add_special_tokens=False&lt;/code&gt;&lt;/strong&gt; — because I'm manually prepending &lt;code&gt;&amp;lt;bos&amp;gt;&lt;/code&gt; in the prompt string. If you let the tokenizer add it automatically as well, you get a duplicate BOS token which confuses the model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;repetition_penalty=1.3&lt;/code&gt;&lt;/strong&gt; — without this, the model loops. I found this out the hard way when my first test response was 200 repetitions of "matched matched matched".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decoding only new tokens&lt;/strong&gt; — &lt;code&gt;outputs[0][input_len:]&lt;/code&gt; strips the input tokens from the output before decoding. Otherwise you get the full prompt echoed back before the response.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Four Agents
&lt;/h2&gt;

&lt;p&gt;Each agent is a focused prompt sent to &lt;code&gt;ask_gemma&lt;/code&gt;. The agents run sequentially, not in parallel — this keeps memory usage flat and avoids context window issues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent 1 — Skills Analyzer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Takes the raw resume text and returns a structured output in a fixed format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TECHNICAL SKILLS: Python, NLP, LangChain, ...
SOFT SKILLS: Communication, Teaching, ...
EXPERIENCE: 5 years
LEVEL: mid
DOMAINS: Artificial Intelligence, NLP, Education
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I enforce the format in the prompt rather than post-processing with regex. Gemma 4 follows structured output instructions reliably when you give it an exact template to fill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent 2 — Career Path Advisor&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Takes the extracted skills string and returns three career paths with job titles, required additional skills, USD salary ranges, and a growth potential score out of 10.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent 3 — Learning Plan Designer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Takes the skills and target role and returns a 3-month plan broken down by month — foundation topics in month 1, intermediate topics in month 2, advanced topics and portfolio projects in month 3.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent 4 — Resume and ATS Analyst&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Takes the resume text and target role and returns an ATS score out of 100, three strengths, three improvement areas, missing keywords, and a suggested rewrite for the professional summary.&lt;/p&gt;

&lt;p&gt;The skills string extracted by Agent 1 is passed directly into Agents 2 and 3, creating a lightweight chain without needing LangChain or CrewAI overhead.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Gradio Interface
&lt;/h2&gt;

&lt;p&gt;I used Gradio instead of Streamlit for one reason: on Kaggle, &lt;code&gt;app.launch(share=True)&lt;/code&gt; generates a public ngrok URL in a single line. No tunnel setup, no separate process.&lt;/p&gt;

&lt;p&gt;The interface has two inputs — resume text and target role — and six output tabs, one per agent plus job matches and course recommendations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;gr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Blocks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GuidanceOS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;gr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Row&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;gr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;resume_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Textbox&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Resume Text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;role_input&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Textbox&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Target Role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;submit_btn&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Button&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Analyze My Profile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;variant&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;primary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;gr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;gr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Tab&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Skills Analysis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;skills_out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Textbox&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="c1"&gt;# ... five more tabs
&lt;/span&gt;
&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;share&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I added &lt;code&gt;gr.Progress()&lt;/code&gt; to the main function so the UI shows which agent is running instead of just freezing. Each agent call takes 30-90 seconds on T4 — the progress bar makes it feel responsive.&lt;/p&gt;




&lt;h2&gt;
  
  
  End-to-End Flow
&lt;/h2&gt;

&lt;p&gt;When a user clicks Analyze:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Resume text → Agent 1 → structured skills profile&lt;/li&gt;
&lt;li&gt;Skills string → TF-IDF search → top 5 jobs from 123K LinkedIn postings&lt;/li&gt;
&lt;li&gt;Skills string → TF-IDF search → top 5 courses from 6.6K Coursera courses&lt;/li&gt;
&lt;li&gt;Skills string → Agent 2 → three career paths with salaries&lt;/li&gt;
&lt;li&gt;Skills string + target role → Agent 3 → 3-month learning roadmap&lt;/li&gt;
&lt;li&gt;Resume text + target role → Agent 4 → ATS score and improvements&lt;/li&gt;
&lt;li&gt;All outputs → six Gradio tabs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Total time: 3-5 minutes on a T4 GPU. All computation on-device. Zero external API calls.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Would Do Differently
&lt;/h2&gt;

&lt;p&gt;A few things I'd change with more time:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structured JSON output from agents.&lt;/strong&gt; Right now the agents return free-form text. Enforcing JSON output would make the results easier to display in a proper UI — cards instead of plain text boxes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FAISS for course search.&lt;/strong&gt; TF-IDF misses semantic similarity — "data analysis" and "analytics" are treated as different terms. Sentence embeddings with FAISS would improve course matching quality significantly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session persistence with SQLite.&lt;/strong&gt; The current setup doesn't remember previous conversations. Adding a lightweight SQLite store would let users build on previous sessions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SHAP explainability.&lt;/strong&gt; I had planned to add a SHAP chart showing which skills drove each job recommendation using a Random Forest trained on the jobs dataset. It didn't make the deadline but the data pipeline supports it cleanly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Running It Yourself
&lt;/h2&gt;

&lt;p&gt;The full notebook is on Kaggle:&lt;br&gt;
&lt;a href="https://www.kaggle.com/code/abbasi110/guidanceos-gemma4-offline-career-advisor" rel="noopener noreferrer"&gt;kaggle.com/code/abbasi110/guidanceos-gemma4-offline-career-advisor&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Source code on GitHub:&lt;br&gt;
&lt;a href="https://github.com/soohanAbbasi/GuidanceOS" rel="noopener noreferrer"&gt;github.com/soohanAbbasi/GuidanceOS&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You need a Kaggle account to run it. Add the gemma-4-e4b-it model and both datasets, set the accelerator to GPU T4 x2, and run all cells in order. The Gradio URL prints in the last cell.&lt;/p&gt;




&lt;p&gt;That's the full build. If you have questions about any part of it — the quantization setup, the prompt templates, or the TF-IDF indexing — leave a comment and I'll answer.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>career</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
