DEV Community: soohan abbasi

# Agentic AI: Architecture of Autonomous Systems

soohan abbasi — Sun, 31 May 2026 12:30:00 +0000

"A language model that answers questions is a tool. A language model that decides which questions to ask and then acts on the answers is something else entirely."

Introduction: When Models Started Deciding

For the first several years of modern NLP, the task was always the same: given input, produce output. One forward pass. One completion. Done.

In 2022, a paper from Google Brain asked a different question. What if, instead of producing an answer directly, a model could reason about what information it needs, act to retrieve it, and revise its thinking based on what it found?

The paper was ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022). Applying it to an LLM created something qualitatively different: a model that could take real-world actions and adapt its reasoning based on what came back.

A completion model is a calculator. An agent is a process: it has a goal, takes steps toward it, and updates when things go wrong. This week I went deep on the architecture behind these systems, the frameworks that define them, and what the open problems look like from a research perspective.

Part 1: What Makes a System "Agentic"?

The word "agent" gets used loosely in current literature. A clean definition comes from Russell and Norvig's Artificial Intelligence: A Modern Approach:

An agent is anything that perceives its environment through sensors and acts upon that environment through actuators.

For an LLM-based system, this is a loop: perceive an observation, reason about what to do, act via a tool call or output, observe the result, and loop again. But not every loop qualifies as agentic. Three properties distinguish genuinely agentic systems from tool-augmented chatbots:

Property	What It Means
Goal persistence	Maintains the original goal across multiple steps without re-prompting
Adaptive planning	Revises its approach based on intermediate results
Tool autonomy	Decides when and which tools to use, not just how to use one it was told to call

Most production systems in 2026 satisfy the first two reliably. The third, genuine tool autonomy where an agent discovers appropriate tools from scratch, is still largely unsolved.

Part 2: The Core Frameworks

ReAct: Reasoning and Acting Together

The core contribution of ReAct (Yao et al., ICLR 2023) is structuring the model's output as alternating Thought and Action blocks.

Thought: I need current statistics on LLM deployment.
Action: search("LLM production deployment 2025")
Observation: 68% of enterprises report using LLMs in production workflows...
Thought: Enough context. I can now answer the original question.

Two things happen here that do not happen in single-pass completion. The model commits to a reasoning step before acting, and every action has a traceable reason. The original paper evaluated ReAct on HotpotQA and ALFWorld. On both tasks it outperformed chain-of-thought alone, with the largest gains on problems requiring multiple sequential lookups.

The intuition: chain-of-thought helps a model reason over information it already has. ReAct helps it acquire what it needs, then reason over it.

What ReAct does not solve. It is reactive. If the first three steps go down the wrong path, there is no mechanism to step back and reconsider.

)
Figure 1: The ReAct agent loop — Perceive, Reason, Act, Observe. Every action traces back to a reasoning step.

Reflexion: Learning From Failure Without Gradient Updates

Reflexion (Shinn et al., NeurIPS 2023) addresses exactly that. After each failed attempt, the agent generates a verbal self-reflection analyzing what went wrong. This is stored in a memory buffer and prepended to context at the start of the next episode.

[Episode 1 fails]
Reflection: "I searched by title, which broke on the colon. Next time: search by author and year."

[Episode 2]
Agent searches "Shinn 2023 language agent" and succeeds.

The model improves across attempts through language, not weight updates. On HumanEval, Reflexion improved pass@1 by approximately 10 percentage points over a ReAct baseline. On AlfWorld, after 3 reflection cycles, success rate reached 97% on seen environments.

Figure 2: Reflexion vs ReAct on HumanEval and AlfWorld. Numbers from Shinn et al., NeurIPS 2023.

The fundamental limitation. The memory buffer lives in the context window. As episodes accumulate, early reflections get pushed out. True long-term learning from experience requires something outside the context window entirely.

Multi-Agent Frameworks

Single-agent systems face one bottleneck: one model handling planning, retrieval, tool use, and synthesis simultaneously. Multi-agent frameworks decompose this. The standard pattern has an orchestrator that breaks the goal into subtasks, specialist agents that handle each, and an aggregator that synthesizes the result.

The three dominant frameworks take meaningfully different approaches:

Framework	Communication Model	Key Distinction
AutoGen (Microsoft, 2023)	Agents converse with each other	Human-in-the-loop as a first-class citizen
CrewAI (2024)	Role-based delegation	Each agent has an explicit role and goal
LangGraph (LangChain, 2024)	Directed graph with shared state	Explicit control flow, most debuggable in production

LangGraph models the entire workflow as a directed graph: nodes are agents, edges are transitions. This makes execution paths readable and failures traceable, which matters significantly in production.

Part 3: Memory Architecture

Memory is where most agentic systems underperform. The naive approach of keeping everything in context breaks at scale. A well-designed agent needs four distinct memory types:

In-context memory is the active context window. Fast and immediate, but size-limited and cleared between sessions. Use for current task state and the ongoing reasoning chain.

External memory is a persistent vector database. Facts are stored as embeddings and retrieved by cosine similarity. This is essentially RAG applied to the agent's own accumulated knowledge rather than a document corpus.

Episodic memory is a log of past trajectories: what the agent did, what succeeded, what failed. Reflexion's verbal buffer is a simple version. More sophisticated implementations store full (observation, action, outcome) tuples and retrieve by similarity, enabling few-shot learning from experience without retraining.

Procedural memory is the agent's fixed capabilities: tool schemas and system prompts. What it contains, particularly how tools are described, has outsized influence on behavior.

The memory architecture determines the learning capacity of the system. Getting the interaction between these layers right is still an open engineering and research problem.

My Experiment: Building the Architecture From Scratch

Most tutorials on agentic AI use LangChain or AutoGen. For this week, I deliberately avoided both and built a minimal ReAct agent using only the Anthropic API. The goal was not to produce novel empirical results — it was to understand what these frameworks are actually abstracting away.

The pipeline has four tools: web_search, memory_store, memory_retrieve, and final_answer. The orchestrator is Claude running in tool-use mode. Memory is a simple cosine similarity store over embeddings. Two queries run sequentially, sharing the same memory instance, so facts retrieved in Query 1 are available to Query 2.

To be clear about what this is and is not: this is an architectural walkthrough, not an empirical study. The outcome — that a warm memory store reduces tool calls — is exactly what theory predicts. I was not testing whether it works. I was making visible how it works, because the mechanism only becomes concrete when you can see every tool call in sequence rather than having a framework handle it silently.

Two things became clear that I had not fully appreciated from reading papers alone. First, tool description quality matters more than I expected. A vague tool description produces inconsistent selection — the model sometimes calls web_search when memory_retrieve was the right first step, purely because the description did not make the priority explicit. This is a grounding problem that frameworks handle through opinionated defaults, which means when their defaults are wrong, you often cannot see why. Second, the memory store without real semantic embeddings is brittle. I used mock embeddings seeded by text hash, which are consistent but not meaningful. On queries where surface-level keyword overlap is low, retrieval fails entirely. The framework abstracts this away. Building without it made the failure visible immediately.

Figure 3: Agent trace showing cold start (Query 1, 4 steps) vs warm
start (Query 2, 2 steps). Facts stored in Query 1 were retrieved
directly in Query 2, eliminating web search entirely.

The full code is in the GitHub repo linked below. The more interesting exercise, which I plan to run properly in a later week, is a controlled comparison of prompted versus trained agents on a fixed benchmark — ideally reproducing part of the Reflexion evaluation on AlfWorld to see whether my numbers match the paper.

Part 4: Failure Modes in Production

Agentic systems fail in ways that single-pass models do not, and the failures follow predictable patterns.

Tool call loops. The agent calls the same tool repeatedly with slightly different inputs without making progress. This happens when the tool returns unhelpful results and the agent has no mechanism for declaring failure. Step limits and explicit "I cannot find this" states help.

Hallucinated observations. The model predicts what a tool would return rather than waiting for the actual result. It is a context management error and subtle to catch without logging every call.

Memory poisoning. An incorrect fact stored early gets retrieved for related queries and contaminates future reasoning. Errors compound. Confidence-weighted storage and verification before storing are partial mitigations.

Goal drift. Past roughly 15 steps, agents frequently lose track of the original objective and optimize for the most recent subtask. Re-injecting the original goal into every system prompt turn reduces this.

Prompt injection. A web search result or document contains text designed to override the agent's instructions. This is a real attack vector in production, not a theoretical one.

Each has partial mitigations. None has a clean solution.

What We Still Do Not Know

Is prompting-based agency enough? Current agents improve through prompting: ReAct, Reflexion, tool descriptions, with no weight updates. As tasks grow longer and environments more complex, will this hit a ceiling? If training-based agents eventually replace prompted ones, what does that change about interpretability and control?

How do you evaluate trustworthiness? An agent scoring 80% on SWE-bench may still fail unpredictably on cases outside the benchmark distribution. We do not have frameworks for measuring agent reliability statistically, just average performance on fixed task sets.

Memory or fine-tuning for domain adaptation? When specializing an agent for a domain, is it better to give it a rich external memory store or to fine-tune on domain trajectories? The tradeoffs in cost, latency, generalization, and catastrophic forgetting are not well characterized.

Can we formally bound agent behavior? A calculator is provably correct within its domain. An LLM agent is evaluated empirically on benchmarks. There is no formal framework for specifying what an agent will and will not do, analogous to how formal verification works for software. Whether this is achievable for learned systems is an open question.

Papers Worth Reading

Paper	Contribution	Venue
Yao et al. (2022)	ReAct: Reasoning + Acting loop	ICLR 2023
Shinn et al. (2023)	Reflexion: Verbal self-reflection for improvement	NeurIPS 2023
Wu et al. (2023)	AutoGen: Multi-agent conversation framework	arXiv 2023
Schick et al. (2023)	Toolformer: Self-supervised tool learning	NeurIPS 2023
Liu et al. (2023)	AgentBench: Evaluating LLMs as agents	ICLR 2024
Park et al. (2023)	Generative Agents: Simulating believable behavior	UIST 2023

Research Groups Doing Interesting Work

Stanford NLP (Yao et al.) is extending agent reasoning with Tree-of-Thought and beyond. Princeton NLP (Shinn et al.) continues on self-improvement mechanisms. Microsoft Research is focused on multi-agent reliability in production. DeepMind is working on agent training at scale through SIMA. LangChain's infrastructure team publishes pragmatic findings on what actually breaks in production.

Benchmarks Worth Knowing

AgentBench evaluates agents across 8 environments including code, database, and web tasks. WebArena tests realistic web navigation. SWE-bench is the most demanding: real GitHub issues requiring working code fixes. ALFWorld is the interactive environment from the original ReAct paper. ToolBench evaluates tool selection across a large library.

Conclusion

Agentic AI is not primarily a model story. The models have not changed fundamentally. What changed is the architecture around them. ReAct gave agents a structured reasoning-action cycle. Reflexion gave them a mechanism to improve from failure within a session. Multi-agent frameworks gave them specialization. Memory systems gave them persistence across queries.

The hard problems are real. Long-horizon planning still drifts. Memory poisoning is still a live issue. Prompt injection has no clean solution. There is no formal way to guarantee agent behavior.

But the direction is clear. The next frontier for LLMs is not better completions. It is better decisions.

References

Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.
Shinn, N., et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023.
Wu, Q., et al. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155.
Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS 2023.
Liu, X., et al. (2023). AgentBench: Evaluating LLMs as Agents. ICLR 2024.
Park, J. S., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. UIST 2023.
Russell, S., and Norvig, P. (2020). Artificial Intelligence: A Modern Approach, 4th ed. Pearson.

This is part of a weekly series on AI/ML research. Each post covers theory, recent papers, and experiments I run myself.

Connect on LinkedIn | [GitHub]: weekly-AI-ML research

Small Language Models: Rethinking What Intelligence Actually Requires

soohan abbasi — Sun, 24 May 2026 12:30:00 +0000

"Scale solves everything — until it doesn't."

Introduction: A Result Nobody Predicted

In March 2024, Microsoft published a technical report with a claim that most researchers found difficult to take seriously at first. Their new model, Phi-3 Mini, had 3.8 billion parameters. GPT-3 had 175 billion. GPT-4 is estimated at somewhere above a trillion. And yet Phi-3 Mini outperformed GPT-3 on standard benchmarks, approached GPT-3.5 on several tasks, and ran entirely on a laptop with no internet connection.

The response from the research community was not celebration. It was confusion. The scaling laws, the empirical relationships between model size, data, compute, and performance, had held for years. They were the closest thing the field had to a reliable theory of how intelligence emerges in these systems. Phi-3 did not break the scaling laws, but it suggested something the field had underweighted: the laws describe what scale can do, not what scale is required to do it.

The question Phi-3 raised is not "how small can we go?" It is something more fundamental: what does a language model actually need in order to reason well?

That is what this post is about. I spent this week reading the papers, running three experiments on Kaggle, and trying to build an honest picture of where SLMs stand today — what they can genuinely do, what they cannot, and why the answer matters more than most benchmark tables suggest.

Part 1: Why Small Language Models Exist at All

By 2023, the dominant AI paradigm was clear: train larger models on more data with more compute. GPT-4, PaLM 2, Gemini Ultra — each required infrastructure that only a handful of organizations on earth could afford. Training costs ran into tens or hundreds of millions of dollars.

This created a real problem. Most AI applications do not need a trillion-parameter model. They need something reliable, fast, cheap, and ideally not dependent on a cloud API that sends data to an external server. Finance, legal, government — domains with the strongest AI use cases are also the ones with the strictest data privacy requirements. There is no local deployment option for GPT-4.

SLMs emerged as a direct response. Not as a compromise, but as a deliberate design decision: build the smallest model that can reliably do a specific set of tasks.

Around the same time, a quieter debate was happening in research circles. A group of papers, starting with the original Phi-1 work in 2023, made a provocative argument: the reason large models outperform small ones is not primarily because they are larger. It is because they are trained on more data, and most of that data is low quality. Filter the data aggressively, keep only dense reasoning-heavy content, and a much smaller model performs surprisingly well.

This is sometimes called the textbook hypothesis: a model trained on textbook-quality material learns to reason better than a model trained on ten times as much internet text. The Phi series became the primary empirical test of this hypothesis, and the results were striking enough that the idea is now taken seriously across the field.

There is no official definition of small, but the community generally treats anything under 7 billion parameters as an SLM:

Category	Parameters	Examples
Large	>70B	GPT-4, Claude 3 Opus, Llama 3 70B
Medium	7B to 70B	Mistral 7B, Llama 3 8B
Small	<7B	Phi-3 Mini (3.8B), Gemma 2B, TinyLlama 1.1B

Part 2: How SLMs Are Actually Built

Knowledge Distillation

The most important technique behind high-performing SLMs is knowledge distillation, and it is worth understanding properly rather than just naming it.

Standard training optimizes against ground truth labels: a math problem has a correct answer and the model learns to produce it. But this only tells the model what the right answer is. It says nothing about the shape of the problem space — which wrong answers are close, which are far, what the structure of uncertainty looks like.

A large teacher model, when it answers a question, produces a full probability distribution over all possible next tokens. If the teacher gives "Paris" 80% probability and "Lyon" 15% for a question about French capitals, that 15% carries real information. These two answers are related in a way that "banana" and "Paris" are not. The distribution encodes structured knowledge about relationships between concepts.

Distillation trains the student to match the teacher's full distribution, not just the top answer. The student learns from the teacher's uncertainty, not just its correctness. This is why a 3.8B model trained with distillation can outperform a 7B model trained without it.

The Orca and Alpaca Results

The most compelling demonstration of distillation's power came from Microsoft's Orca papers in 2023. Orca was a 13B model fine-tuned not just on GPT-4 answers but on GPT-4's full reasoning traces — step-by-step explanations of how it arrived at each answer. Orca outperformed models five times its size on several reasoning benchmarks.

Orca 2 pushed further and showed that smaller models could be explicitly taught when to use different reasoning strategies — step-by-step for complex problems, direct answers for simple ones. This was not emerging naturally from scale. It was being deliberately taught through the quality of the training signal.

Stanford's Alpaca showed a related result: a 7B LLaMA model fine-tuned on 52,000 GPT-generated instruction examples matched GPT-3.5 on instruction-following tasks. 52,000 examples, one GPU, a few hours. The gap between open and closed models narrowed overnight.

The bottleneck was never parameter count. It was training signal quality.

Quantization

Running a model locally requires fitting it in memory. A 7B model in 32-bit floating point takes roughly 28GB of RAM. This is where quantization comes in.

Quantization reduces numerical precision. Instead of storing each parameter as a 32-bit float, you store it as an 8-bit or 4-bit integer. The memory savings are proportional: 8-bit halves the footprint, 4-bit quarters it.

For most language tasks, 8-bit quantization produces outputs essentially indistinguishable from full precision. 4-bit is where degradation becomes detectable, particularly on tasks requiring precise numerical reasoning. Techniques like GPTQ and AWQ apply quantization non-uniformly, preserving precision in the weights that matter most. My Experiment 3 results below show exactly this tradeoff in practice.

Efficient Architectures

Beyond training and quantization, the architecture choices in SLMs reflect deliberate engineering for inference efficiency.

Grouped query attention shares key and value projections across multiple query heads. Not every attention head needs its own unique representation — sharing costs little in model quality but significantly reduces memory during generation.

Sliding window attention, used in Mistral, limits each token's attention to a local window rather than the full context. This makes inference cost linear in sequence length rather than quadratic.

Speculative decoding is one of the more elegant recent ideas. A small draft model generates several tokens quickly. A larger target model then evaluates all of them in a single parallel forward pass, accepting those it would have generated and rejecting the rest. Net result: significantly faster generation with no change in output quality. SLMs become accelerators for larger models rather than replacements.

Part 3: What the Benchmarks Actually Show

On general benchmarks, the gap between SLMs and large models is real but not as dramatic as headlines suggest:

Benchmark	GPT-4	Phi-3 Mini (3.8B)	Gemma 2B	TinyLlama 1.1B
MMLU (General)	~86%	~69%	~51%	~26%
GSM8K (Math)	~92%	~78%	~52%	~8%
HumanEval (Code)	~87%	~59%	~34%	~12%
ARC-Challenge	~96%	~85%	~71%	~45%

Before treating these numbers as deployment guidance, there is a problem worth understanding. Academic benchmarks are published on the internet and may appear in training data. A model that has seen the test set during training is being evaluated on recall, not reasoning. This affects all language model benchmarks. Treat these numbers as upper bounds, not precise measurements.

Where SLMs genuinely dominate is cost, privacy, and hardware requirements:

Metric	GPT-4 API	Phi-3 Mini (Local)
Cost per 1M tokens	~$10 to $30	$0
First token latency	500ms to 2s	less than 100ms
Throughput on GPU	Cloud only	200+ tok/s
Data privacy	Sent to external server	Fully on-device

And the hardware floor matters enormously. GPT-4 has no local deployment option. Llama 3 70B requires roughly 40GB VRAM. Phi-3 Mini runs on 4GB RAM, which means a MacBook or a Raspberry Pi 5. TinyLlama at 1.1B fits in 700MB, enough for embedded devices.

My Own Experiments: Three Tasks, Two Models, Real Numbers

For this week's experiments I ran Phi-3 Mini locally on Kaggle using a T4 GPU and compared it against Llama 3.3 70B via the Groq API. I designed three prompts myself to cover different reasoning types rather than using a standard benchmark dataset. The choice was intentional: I wanted to observe how the models behave on naturally phrased tasks, not just benchmark-formatted questions. For rigorous evaluation at scale, datasets like GSM8K and HumanEval would be the right choice, and that is something I plan to revisit in a later week.

Experiment setup:


Model A	Microsoft Phi-3 Mini 4k Instruct (3.8B, FP16)
Model B	Llama 3.3 70B via Groq API
Hardware	Kaggle T4 GPU
Framework	HuggingFace Transformers 4.44.0
Code	GitHub → weekly-AI-ML-research/week02-slms

The three prompts I used:

ID	Type	Prompt summary
T-001	Math reasoning	Multi-step word problem involving apples, oranges, and change calculation
T-002	Code understanding	Identify what a Python function does and spot a hidden bug
T-003	Language reasoning	Identify the core argument in a technical paragraph

Experiment 1: Phi-3 Mini Inference

ID	Task	Latency	Speed	Correct?
T-001	Math reasoning	32.4s	5.7 tok/s	Yes, $7.60
T-002	Code understanding	13.2s	19.3 tok/s	Yes, ZeroDivisionError found
T-003	Language reasoning	1.9s	19.2 tok/s	Yes, clean one-sentence summary

T-001 was the slowest because the model generated a full step-by-step working, which produced more tokens. T-003 required only one sentence so it finished in under two seconds. All three answers were correct.

Experiment 2: Llama 3.3 70B vs Phi-3 Mini

ID	Type	Llama 3.3 70B	Phi-3 Mini	Quality gap
T-001	Math reasoning	0.5s (Cloud)	13.3s (Local)	Minimal
T-002	Code understanding	1.0s (Cloud)	13.7s (Local)	Noticeable
T-003	Language reasoning	0.3s (Cloud)	2.0s (Local)	Minimal

Llama 3.3 70B was faster on every task, which is expected since it runs on Groq's optimized cloud infrastructure. But the quality gap was smaller than I expected. Both models got the correct answers on T-001 and T-003. On T-002, Llama gave a richer explanation of why the ZeroDivisionError occurs. Phi-3 identified the bug correctly but explained it more shallowly. For use cases where correctness is what matters rather than explanation depth, Phi-3 holds up well. For use cases where explanation quality matters, the gap is real.

The more important comparison is not latency but deployment context. Llama 3.3 70B through Groq costs money and sends your data to an external server. Phi-3 Mini costs nothing after hardware and never leaves your machine.

Experiment 3: Quantization in Practice

Config	VRAM	Latency	Speed
FP16 (baseline)	3.89 GB	16.9s	11.8 tok/s
8-bit	1.86 GB	23.1s	8.6 tok/s
4-bit NF4	1.08 GB	13.9s	13.1 tok/s

The 8-bit result is the practical takeaway: VRAM drops by more than half with no meaningful quality loss on these tasks. 4-bit was the most surprising result. It used the least memory (1.08 GB) and was actually faster than FP16 on this task (13.1 vs 11.8 tok/s). The response quality on a conceptual explanation task was comparable across all three configurations. The degradation from quantization shows up more clearly on tasks requiring precise multi-step numerical reasoning, which is exactly where SLMs are already weakest.

Part 4: The Emergent Capabilities Problem

One of the more surprising findings in scaling research was the concept of emergent capabilities: abilities that appear suddenly in large models and are essentially absent in smaller ones. Few-shot learning, multi-step arithmetic, chain-of-thought reasoning were all identified as capabilities that emerge with scale.

SLMs challenge this picture but do not fully overturn it. What the Phi and Orca results show is that some apparently emergent capabilities can be induced in smaller models through better training. The capability was not truly emergent — it was underspecified by the training data. Give the model a better signal and the capability appears at smaller scale.

But some capabilities appear to be genuinely scale-dependent. Complex multi-step mathematical reasoning, reliable code generation for non-trivial programs, coherent reasoning across very long contexts — these degrade noticeably as model size decreases, even with high-quality training and distillation.

The uncomfortable implication is that we do not have a reliable theory for which capabilities are genuinely scale-dependent and which are just undertrained in smaller models. The only way to find out is to try empirically for your specific task.

Part 5: The On-Device AI Movement

The most significant infrastructure shift happening around SLMs is not in data centers. It is on consumer devices.

Apple's Neural Engine, present in every iPhone since the A11 chip, is now powerful enough to run models in the 1 to 3B parameter range at reasonable speeds. Apple Intelligence uses a 3B on-device model for most tasks, calling a larger cloud model only when necessary. The privacy argument is central: your data never leaves the device.

Qualcomm's Snapdragon X Elite, targeting Windows laptops, includes dedicated NPU hardware rated for 45 TOPS. Microsoft's Copilot+ PC initiative is built around this, with on-device models handling real-time summarization and other features locally.

Google's Gemini Nano runs on Pixel phones and recent Android devices, enabling on-device summarization and voice transcription without cloud calls.

This hardware push reflects a bet that the next generation of AI features will be defined not by which cloud model is most capable but by which on-device model is fast, private, and reliable enough to be always available. SLMs are the only class of model that can compete in this environment.

Part 6: Fine-Tuning — Power and Trap

A base SLM is a generalist with limited specialized knowledge. Fine-tuning on domain-specific data produces dramatic improvements on narrow tasks. Parameter-efficient methods like LoRA make this practical: instead of updating all model weights, LoRA introduces small trainable matrices that approximate the updates. A LoRA fine-tune of a 7B model can be done in a few hours on a single consumer GPU with a few thousand examples.

The trap is catastrophic forgetting. When you fine-tune on domain-specific data, the model improves on that domain at the cost of general capability. It overwrites some prior knowledge with new patterns. A model fine-tuned aggressively on legal documents may produce excellent legal summaries and poor responses to everything else.

LoRA mitigates this significantly because you are not modifying base weights directly. But it does not eliminate the problem entirely. Fine-tuning requires evaluating not just the target task but also the general capabilities you want to preserve.

Part 7: Where SLMs Genuinely Cannot Compete

Being honest about hard limits is more useful than optimism.

For problems requiring many intermediate results held simultaneously — advanced mathematics, multi-constraint planning — large models are meaningfully better and fine-tuning does not close the gap. Maintaining complex internal state during long reasoning chains appears to benefit from scale in ways data quality alone does not address.

When a task requires combining skills in configurations the model has not seen before — not applying a familiar pattern but genuinely constructing a new approach — smaller models are more brittle than benchmark numbers suggest.

Maintaining a coherent thread across 100,000+ tokens is qualitatively harder for smaller models even when they technically support the context window. The model loses track of earlier constraints in ways that compound over long sequences.

Large models follow a wider range of novel instructions reliably. Smaller models are more sensitive to exact prompt phrasing — small wording changes produce larger output quality changes than you would expect.

Part 8: Open Questions

Is the textbook hypothesis general? Phi-3's data approach worked for reasoning tasks. Does it transfer to less structured domains such as creative writing, open-ended dialogue, or cultural reasoning? The hypothesis has not been tested rigorously outside its original domain.

Where is the true capability floor? We know some capabilities emerge with scale. We do not know the minimum scale at which each reliably appears as a deployment characteristic rather than a benchmark number.

Can quantization go further? 2-bit and 1-bit quantization have been explored experimentally. The results are not yet good enough for general deployment. Whether this is a fundamental limit or an engineering problem is not resolved.

What happens when SLMs are wrong? Error analysis for SLMs in production is underdeveloped. Large model failures tend to be graceful — wrong but coherent. SLM failures can be less graceful. A systematic understanding of failure modes across task types would be practically valuable and is mostly absent from the literature.

How does on-device AI change development practices? If inference moves to the edge, evaluation, updating, and monitoring all change significantly. The MLOps infrastructure built around centralized cloud inference does not translate directly to a world where models run on millions of individual devices.

Papers Worth Reading

Paper	What It Contributes	Venue
Gunasekar et al. (2023)	Phi-1: the textbook hypothesis	NeurIPS 2023
Abdin et al. (2024)	Phi-3 technical report	arXiv 2404
Mukherjee et al. (2023)	Orca: learning from GPT-4 explanations	arXiv 2306
Mitra et al. (2023)	Orca 2: teaching reasoning strategies	arXiv 2311
Taori et al. (2023)	Alpaca: instruction following at 7B	Stanford HAI
Zhang et al. (2024)	TinyLlama: pretraining at 1.1B	arXiv 2401
Frantar et al. (2022)	GPTQ: post-training quantization	arXiv 2210
Lin et al. (2023)	AWQ: activation-aware weight quantization	arXiv 2306
Leviathan et al. (2023)	Speculative decoding	ICML 2023
Hu et al. (2021)	LoRA: low-rank adaptation	ICLR 2022

Research Groups Doing Relevant Work

Microsoft Research's Phi team has produced the most sustained empirical investigation of the data quality hypothesis. Meta's LLaMA team made open weights standard practice and enabled the fine-tuning ecosystem most SLM work depends on. Hugging Face's evaluation team, particularly the Open LLM Leaderboard and their contamination research, is essential for understanding what benchmark numbers actually mean. On hardware, Qualcomm Research and Apple's ML team are defining what on-device inference looks like in practice. MIT's Han Lab has done foundational work on quantization and efficient inference.

Benchmarks You Should Know

MMLU covers 57 subjects across multiple domains and is useful for measuring breadth but is known to have contamination issues. ARC-Challenge focuses on scientific reasoning that requires inference rather than recall. GSM8K has 8,500 grade school math problems requiring multi-step reasoning and is the most widely used reasoning benchmark. HumanEval tests code generation with 164 programming problems across different difficulty levels. BIG-Bench Hard collects 23 tasks specifically designed to resist current models. HELM provides a more structured evaluation framework with explicit contamination controls. For any production decision, treat published benchmark numbers as approximate and build an evaluation set from your actual task distribution.

Conclusion

The SLM story is not about building smaller versions of large models. It is about a set of discoveries that have changed what we understand about the relationship between model size and capability.

Data quality can substitute for scale to a remarkable degree. Distillation can transfer knowledge across size boundaries in ways flat training data cannot. Efficient architectures reduce the hardware floor without meaningful capability loss. And the on-device movement is creating a deployment environment where the question is not "what is the best model?" but "what is the best model that fits this constraint set?"

Running these experiments myself made the tradeoffs concrete in a way that reading papers alone does not. Phi-3 Mini got every answer right. It was slower than the cloud alternative, but it ran locally, cost nothing per query, and required no data to leave the machine. For many real applications, that is not a compromise. It is exactly what you need.

The scaling laws are not wrong. But they were describing one path to capability. The research of the last two years has found that there are others, and some of them are more practical for the problems most people actually need to solve.

Next week: Retrieval-Augmented Generation. How do you give a language model access to knowledge it was never trained on? What actually happens when retrieval goes wrong? And does RAG actually solve the hallucination problem, or just change where the failures occur?

References

Gunasekar, S., et al. (2023). Textbooks Are All You Need. NeurIPS 2023.
Abdin, M., et al. (2024). Phi-3 Technical Report. arXiv:2404.14219.
Mukherjee, S., et al. (2023). Orca: Progressive Learning from Complex Explanation Traces. arXiv:2306.02707.
Mitra, A., et al. (2023). Orca 2: Teaching Small Language Models How to Reason. arXiv:2311.11045.
Taori, R., et al. (2023). Stanford Alpaca: An Instruction-following LLaMA Model. Stanford HAI.
Zhang, P., et al. (2024). TinyLlama: An Open-Source Small Language Model. arXiv:2401.02385.
Frantar, E., et al. (2022). GPTQ: Accurate Post-Training Quantization for GPT. arXiv:2210.17323.
Lin, J., et al. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression. arXiv:2306.00978.
Leviathan, Y., et al. (2023). Fast Inference from Transformers via Speculative Decoding. ICML 2023.
Hu, E., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.

Experiment code: GitHub → weekly-AI-ML-research/week02-slms

This is part of a weekly series on AI/ML research. Each post covers theory, recent papers, and experiments I run myself.

Connect on LinkedIn: Soohan Abbasi

Chain-of-Thought and Beyond: How LLMs Actually Learn to Reason

soohan abbasi — Sat, 16 May 2026 12:30:00 +0000

"The ability to reason step-by-step is not just a feature. It might be the difference between a language model that sounds intelligent and one that actually is."

Introduction: When AI Started Thinking

In 2022, researchers at Google Brain published a paper titled "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models". At the time, nobody quite anticipated it would mark the beginning of a shift that would reshape the entire AI field.

The idea was simple: instead of asking a model to answer directly, give it time to think. Ask it to write out intermediate steps. Accuracy improves dramatically.

That paper now sits at over 10,000 citations. But the question it raised has never been fully answered:

Do LLMs actually think? Or do they create a very convincing illusion of thinking?

That is what this blog is about. And as someone preparing for a PhD in AI, it is a question I keep coming back to.

Part 1: What Is Chain-of-Thought?

Standard Prompting vs. CoT Prompting

Imagine asking a model this:

"Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many does he have now?"

With standard prompting, the model jumps straight to: "11"

With chain-of-thought prompting, the model works through it first:

Roger starts with 5 balls.
2 cans × 3 balls = 6 balls.
5 + 6 = 11 balls.
Answer: 11

Both get the same answer. So what is the point?

The gap shows up on harder problems. Models that reason through steps outperform those that answer directly on multi-step math, symbolic reasoning, and commonsense problems. The more complex the task, the bigger the difference.

Zero-Shot CoT: One Phrase Changes Everything

In the same year, researchers discovered something even more surprising. Simply adding the phrase "Let's think step by step" to a question, without any examples, significantly improved reasoning accuracy.

No demonstrations. No fine-tuning. Just those five words.

This became known as zero-shot CoT. And the obvious follow-up question is: why does this even work?

My Own Experiment: Testing CoT on GSM8K

Before going deeper into the theory, I wanted to test this myself. So I ran a small experiment using an open-source model on a standard benchmark.

Setup:

Model: Qwen 2.5 1.5B Instruct (free, runs on Kaggle GPU)
Dataset: GSM8K (grade school math problems)
Test: Standard prompting vs. "Let's think step by step"
Sample: 10 problems

Results:

Approach	Correct	Accuracy
Without CoT	2/10	20%
With CoT	3/10	30%

[CoT vs No-CoT Results on GSM8K]

Even on a model roughly 360 times smaller than the one used in the original paper, the improvement showed up. A single phrase shifted accuracy by 10%.

A few things stood out from the per-problem breakdown:

Problem 1 was solved correctly with CoT, but not without it. Problem 7 showed the same pattern. Problem 4 was solved correctly either way. But Problem 6 was actually solved correctly without CoT and incorrectly with it. The model overthought a straightforward calculation and got it wrong.

That last observation matters and connects to something I discuss in Part 4.

Quick note: the overall accuracy numbers look low because this model is tiny compared to what the original paper used. The point here is the relative difference, not the absolute numbers.

Part 2: What Is Actually Happening Inside the Model?

More Than Pattern Matching

The common criticism of LLMs is that they are sophisticated autocomplete. They match patterns from training data rather than genuinely reasoning. This criticism is not entirely wrong, but it is incomplete.

Between 2023 and 2024, researchers doing mechanistic interpretability work found some interesting things inside these models.

LLMs contain specific reasoning circuits: groups of neurons and attention heads that work together to perform logical operations. They use something called induction heads, which are attention patterns that identify sequences in context and predict what follows. Some models have developed implicit world models, meaning they internally represent concepts like spatial relationships, time, and causality.

None of this was explicitly programmed. It emerged from training on text.

The picture that comes out of this research is more interesting than "just pattern matching." These models have developed internal structures that support reasoning-like behavior. Whether that constitutes real reasoning is a separate philosophical question, but it is clearly more than autocomplete.

Process Reward Models: Grading the Work, Not Just the Answer

Here is an idea that changed how reasoning models are trained. Instead of grading only the final answer, what if you graded every individual reasoning step?

That is the core of a Process Reward Model (PRM).

In standard training, the model produces an answer and gets told whether it was right or wrong. In PRM-based training, each step in the reasoning chain gets its own score. A wrong step gets flagged early, before it derails the rest of the solution.

OpenAI's 2023 paper "Let's Verify Step by Step" showed that PRMs significantly outperform outcome-based reward models on mathematical reasoning tasks.

This idea became the foundation for something much bigger, which I will cover in Week 12 when we get to test-time compute scaling.

Part 3: OpenAI o1 and DeepSeek-R1

OpenAI o1: Giving Models Time to Think

In September 2024, OpenAI released o1, and the response from the research community was immediate.

The idea behind o1 is straightforward. Give the model more time to think about the inference. Before producing an answer, o1 generates a hidden chain of thought that the user never sees, but the model uses internally. This chain is trained with reinforcement learning: the model gets rewarded for reaching correct answers, which teaches it to develop better internal reasoning strategies.

The results on AIME 2024, a notoriously difficult high school math competition, were striking. GPT-4o scored 12%. o1 scored 74%.

That is not a small improvement. That is a different class of performance, driven almost entirely by letting the model think longer.

DeepSeek-R1: The Open Source Answer

In January 2025, a Chinese startup called DeepSeek released R1, and it caused genuine disruption in the Western AI community.

DeepSeek-R1 matched o1-level performance at a fraction of the training cost. And it was fully open source.

Three technical contributions made this possible.

Group Relative Policy Optimization (GRPO): Standard RLHF needs a separate critic model to score responses, which adds significant overhead. GRPO removes that requirement. Instead, the model generates multiple responses to the same question, compares them against each other, and rewards the best one. No criticism needed.

Warm Start Before RL: Training a model from scratch with pure reinforcement learning is unstable because the model starts random. DeepSeek's approach was to first run supervised fine-tuning to give the model a reasonable starting point, then apply RL on top of that. A sensible idea that turned out to matter a lot.

Emergent Reasoning Behaviors: During training, R1 developed behaviors that were never explicitly programmed. The model began catching its own mistakes mid-reasoning and reconsidering. It started verifying its own answers before finalizing them. It explored alternative solution paths. These behaviors just appeared from the training process. For researchers trying to understand what is happening inside these models, this is genuinely interesting territory.

Part 4: Where CoT Fails

Unfaithful Reasoning

One of the more unsettling findings in recent research is that CoT explanations do not always reflect what the model actually computed.

Anthropic's 2023 research showed that models sometimes produce post-hoc rationalizations. They settle on an answer through some internal process, then construct a reasoning chain that appears to justify it. The explanation and the computation are decoupled.

What the model writes as its reasoning may not be what actually happened.

Reasoning or Memorization?

There is a deeper question underneath CoT performance: is the model actually reasoning, or is it recalling reasoning-shaped patterns from its training data?

Researchers created a symbolic variant of GSM8K where the logic of each problem stayed the same, but surface features like numbers and names were changed. Performance dropped significantly. If the model were truly reasoning about the structure of the problem, this change should not matter. The fact that it does suggests some of the apparent reasoning is memorization in disguise.

The Overthinking Problem

My experiment showed a small version of this. On Problem 6, the model solved it correctly without CoT. With CoT, it added extra steps, got confused, and got it wrong.

Researchers have documented this pattern at scale. Longer reasoning chains are not always better. Past a certain point, additional steps introduce errors rather than correct them. This has been called "overthinking" or the "lost in the middle" problem.

Compositional Generalization

LLMs also struggle when they need to combine reasoning skills in novel ways. They can handle familiar patterns well. But put two familiar patterns together in a configuration the model has not seen, and performance degrades. This suggests the reasoning ability is less flexible and generalizable than it might appear from benchmark numbers.

Part 5: What We Still Do Not Know

CoT has genuinely advanced what language models can do. But there are open questions that the field has not resolved.

Are the Explanations Honest?

When a model shows its reasoning, is that actually what happened computationally? The unfaithful reasoning research says it often is not. We do not have reliable tools to check whether a model's stated reasoning matches its internal computation. This matters a lot if you want to trust the reasoning, not just the answer.

Where Does Reasoning End and Memorization Begin?

The symbolic variant experiments raise a question that nobody has cleanly answered yet. For any given correct reasoning chain, how much of it reflects genuine logical inference versus pattern recall? The boundary is not well defined.

Why Does CoT Work in English and Struggle Elsewhere?

Almost all CoT research was conducted in English. When you apply the same techniques to Arabic, Urdu, or other lower-resource languages, performance drops noticeably. Whether this is primarily a data coverage problem or something more structural about how reasoning transfers across language families is still an open question.

Can We Formally Verify a Reasoning Step?

A calculator gives you a provably correct answer. An LLM gives you a confident one. There is currently no reliable way to formally verify whether an individual step in an LLM's reasoning chain is logically valid. Researchers are exploring integrations with formal theorem provers such as Lean4, but this remains largely unsolved.

Does Interpretability Scale?

Mechanistic interpretability research has produced real insights at small model scales: specific circuits identified, specific behaviors localized. But as models grow to hundreds of billions of parameters, these techniques become computationally impractical. How interpretability research keeps pace with model scale is an open problem.

Papers Worth Reading

Paper	What It Contributes	Venue
Wei et al. (2022)	Original CoT paper	NeurIPS 2022
Kojima et al. (2022)	Zero-shot CoT discovery	NeurIPS 2022
Lightman et al. (2023)	Process Reward Models	OpenAI Tech Report
DeepSeek-AI (2025)	GRPO and DeepSeek-R1	arXiv 2501
Turpin et al. (2023)	Unfaithful reasoning	NeurIPS 2023
Wang et al. (2022)	Self-consistency via majority voting	ICLR 2023

Research Groups Doing Interesting Work Here

Anthropic's interpretability team is doing some of the most rigorous work on understanding what is happening inside these models. DeepMind's Gemini team is pushing multimodal reasoning. MIT's BCS and CSAIL groups are connecting cognitive science with language model research. Peking University's NLP group has produced strong work on multilingual reasoning.

Benchmarks You Should Know

GSM8K covers grade school math with 8,500 problems. MATH is competition-level with 12,500 problems. MMLU covers broad knowledge across many domains. ARC-Challenge focuses on scientific reasoning. BIG-Bench Hard collects 23 tasks specifically designed to be difficult for current models.

Conclusion

Chain-of-thought prompting is one of the more surprising ideas in recent AI research. A single phrase, added to a prompt, unlocks reasoning capabilities that were already there but not being used.

And yet the central question it raised remains unanswered. Do these models actually reason, or do they produce sophisticated simulations of reasoning? The honest answer is that we do not fully know.

The gap between sounding intelligent and being intelligent is where the most interesting work in this field is happening right now.

Next week: Small Language Models. How models like Phi-3 and Gemma became serious competitors to GPT-4, and what the research landscape looks like when you do not need a data center to run your model.

References

1.Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.

Kojima, T., et al. (2022). Large Language Models are Zero-Shot Reasoners. NeurIPS 2022.
Lightman, H., et al. (2023). Let's Verify Step by Step. arXiv:2305.20050.
DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948.
Turpin, M., et al. (2023). Language Models Don't Always Say What They Think. NeurIPS 2023.
Wang, X., et al. (2022). Self-Consistency Improves Chain-of-Thought Reasoning in Language Models. ICLR 2023.
Elhage, N., et al. (2021). A Mathematical Framework for Transformer Circuits. Anthropic.

Code for this experiment is available on GitHub: Week 01 Code

This is part of a weekly series on AI/ML research. Each post covers theory, recent work, and experiments I run myself.

Connect on LinkedIn Soohan Abbasi

I Built an Offline AI Career Advisor Using Gemma 4 — Here's Exactly How It Works

soohan abbasi — Wed, 13 May 2026 06:04:46 +0000

I Built an Offline AI Career Advisor Using Gemma 4 — Here's Exactly How It Works

A technical walkthrough of GuidanceOS: from model loading to multi-agent orchestration, running entirely on a Kaggle T4 GPU with no internet at inference time.

I teach Computer Science. Over the years, one thing I kept seeing was students who had decent skills but no idea what to do with them. They didn't know what jobs matched their profile, what courses to take next, or how to position themselves for a career. Career guidance platforms exist, sure — but they're mostly behind paywalls, require accounts, and need a stable internet connection.

So I built GuidanceOS for the Gemma 4 Good Hackathon. The goal was simple: a fully offline AI system that takes your resume, figures out your skills, and gives you a complete career analysis — job matches, course recommendations, a 3-month learning plan, and an ATS score — all running locally on a GPU, no API calls at inference time.

Here's exactly how I built it.

The Model Choice: Why Gemma 4 e4b-it

The hackathon required using Gemma 4. Google released four variants: 2B, 4B (edge), 26B MoE, and 31B Dense. I went with gemma-4-e4b-it for a specific reason.

The "e" stands for edge-optimized. The "it" stands for instruction-tuned. On Kaggle's free T4 GPU (15GB VRAM), a naive load of even a 4B model can fail if quantization isn't handled right. With 4-bit NF4 quantization via BitsAndBytes, gemma-4-e4b-it loads in about 8.7GB — leaving headroom for inference.

One problem I ran into immediately: the stable release of Hugging Face Transformers (5.0.0 at the time) didn't recognize the gemma4 architecture. Loading the model threw:

ValueError: The checkpoint you are trying to load has model type `gemma4`
but Transformers does not recognize this architecture.

The fix was straightforward — install Transformers from the GitHub dev branch:

`pip install git+https://github.com/huggingface/transformers.git`

This bumped the version to 5.8.0.dev0, which includes the Gemma 4 model class.

The second issue was GPU memory management. Using device_map="auto" caused BitsAndBytes to split the model across CPU and GPU, which it doesn't allow in 4-bit mode:

ValueError: Some modules are dispatched on the CPU or the disk.
Make sure you have enough GPU RAM to fit the quantized model.

Solution: pin everything to a single GPU.

model = AutoModelForImageTextToText.from_pretrained(
    MODEL_PATH,
    quantization_config=bnb_config,
    device_map="cuda:0",
    dtype=torch.bfloat16,
)

After that, the model loaded cleanly in about 3 minutes and sat at 8.7GB on GPU 0.

The Knowledge Base: TF-IDF Over 130K Records

I used two datasets:

LinkedIn Job Postings — 123,849 jobs with title, description, skills, location, experience level, and salary
Coursera Courses 2024 — 6,645 courses with title, skills, description, level, rating, and URL

For job and course matching, I built a TF-IDF index over combined text fields. For jobs, I concatenated the job title, skills description, and the first 300 characters of the full description. For courses, I combined the title, skills tags, and description.

jobs_clean['combined_text'] = (
    jobs_clean['title'] + ' ' +
    jobs_clean['skills_desc'] + ' ' +
    jobs_clean['description'].str[:300]
)

Then I fit a TfidfVectorizer with bigrams and 10,000 features:

jobs_vectorizer = TfidfVectorizer(
    max_features=10000,
    stop_words='english',
    ngram_range=(1, 2)
)
jobs_tfidf_matrix = jobs_vectorizer.fit_transform(jobs_clean['combined_text'])

At query time, the user's skill string gets transformed by the same vectorizer and compared against the full matrix using cosine similarity. The top-k results come back in milliseconds — no GPU needed, no network call.

I chose TF-IDF over dense vector search (FAISS + sentence embeddings) deliberately. Dense search needs an embedding model at query time, which adds latency and memory. TF-IDF is deterministic, fast, and reproducible — important when the whole point is offline-first operation.

The Inference Helper

Before building agents, I needed a clean wrapper around Gemma 4's generation. The model uses a specific chat format:

def ask_gemma(prompt, max_tokens=300, temperature=0.7):
    formatted = f"<bos><start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model\n"

    inputs = tokenizer(
        formatted,
        return_tensors="pt",
        add_special_tokens=False
    ).to("cuda:0")

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
            repetition_penalty=1.3,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    input_len = inputs["input_ids"].shape[-1]
    response = tokenizer.decode(outputs[0][input_len:], skip_special_tokens=True)

    if "<end_of_turn>" in response:
        response = response.split("<end_of_turn>")[0]

    return response.strip()

A few things worth noting here:

add_special_tokens=False — because I'm manually prepending <bos> in the prompt string. If you let the tokenizer add it automatically as well, you get a duplicate BOS token which confuses the model.

repetition_penalty=1.3 — without this, the model loops. I found this out the hard way when my first test response was 200 repetitions of "matched matched matched".

Decoding only new tokens — outputs[0][input_len:] strips the input tokens from the output before decoding. Otherwise you get the full prompt echoed back before the response.

The Four Agents

Each agent is a focused prompt sent to ask_gemma. The agents run sequentially, not in parallel — this keeps memory usage flat and avoids context window issues.

Agent 1 — Skills Analyzer

Takes the raw resume text and returns a structured output in a fixed format:

TECHNICAL SKILLS: Python, NLP, LangChain, ...
SOFT SKILLS: Communication, Teaching, ...
EXPERIENCE: 5 years
LEVEL: mid
DOMAINS: Artificial Intelligence, NLP, Education

I enforce the format in the prompt rather than post-processing with regex. Gemma 4 follows structured output instructions reliably when you give it an exact template to fill.

Agent 2 — Career Path Advisor

Takes the extracted skills string and returns three career paths with job titles, required additional skills, USD salary ranges, and a growth potential score out of 10.

Agent 3 — Learning Plan Designer

Takes the skills and target role and returns a 3-month plan broken down by month — foundation topics in month 1, intermediate topics in month 2, advanced topics and portfolio projects in month 3.

Agent 4 — Resume and ATS Analyst

Takes the resume text and target role and returns an ATS score out of 100, three strengths, three improvement areas, missing keywords, and a suggested rewrite for the professional summary.

The skills string extracted by Agent 1 is passed directly into Agents 2 and 3, creating a lightweight chain without needing LangChain or CrewAI overhead.

The Gradio Interface

I used Gradio instead of Streamlit for one reason: on Kaggle, app.launch(share=True) generates a public ngrok URL in a single line. No tunnel setup, no separate process.

The interface has two inputs — resume text and target role — and six output tabs, one per agent plus job matches and course recommendations.

with gr.Blocks(title="GuidanceOS") as app:
    with gr.Row():
        with gr.Column(scale=1):
            resume_input = gr.Textbox(label="Resume Text", lines=14)
            role_input   = gr.Textbox(label="Target Role")
            submit_btn   = gr.Button("Analyze My Profile", variant="primary")
        with gr.Column(scale=2):
            with gr.Tab("Skills Analysis"):
                skills_out = gr.Textbox(lines=10)
            # ... five more tabs

app.launch(share=True)

I added gr.Progress() to the main function so the UI shows which agent is running instead of just freezing. Each agent call takes 30-90 seconds on T4 — the progress bar makes it feel responsive.

End-to-End Flow

When a user clicks Analyze:

Resume text → Agent 1 → structured skills profile
Skills string → TF-IDF search → top 5 jobs from 123K LinkedIn postings
Skills string → TF-IDF search → top 5 courses from 6.6K Coursera courses
Skills string → Agent 2 → three career paths with salaries
Skills string + target role → Agent 3 → 3-month learning roadmap
Resume text + target role → Agent 4 → ATS score and improvements
All outputs → six Gradio tabs

Total time: 3-5 minutes on a T4 GPU. All computation on-device. Zero external API calls.

What I Would Do Differently

A few things I'd change with more time:

Structured JSON output from agents. Right now the agents return free-form text. Enforcing JSON output would make the results easier to display in a proper UI — cards instead of plain text boxes.

FAISS for course search. TF-IDF misses semantic similarity — "data analysis" and "analytics" are treated as different terms. Sentence embeddings with FAISS would improve course matching quality significantly.

Session persistence with SQLite. The current setup doesn't remember previous conversations. Adding a lightweight SQLite store would let users build on previous sessions.

SHAP explainability. I had planned to add a SHAP chart showing which skills drove each job recommendation using a Random Forest trained on the jobs dataset. It didn't make the deadline but the data pipeline supports it cleanly.

Running It Yourself

The full notebook is on Kaggle:
kaggle.com/code/abbasi110/guidanceos-gemma4-offline-career-advisor

Source code on GitHub:
github.com/soohanAbbasi/GuidanceOS

You need a Kaggle account to run it. Add the gemma-4-e4b-it model and both datasets, set the accelerator to GPU T4 x2, and run all cells in order. The Gradio URL prints in the last cell.

That's the full build. If you have questions about any part of it — the quantization setup, the prompt templates, or the TF-IDF indexing — leave a comment and I'll answer.