<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Thousand Miles AI</title>
    <description>The latest articles on DEV Community by Thousand Miles AI (@thousand_miles_ai).</description>
    <link>https://dev.to/thousand_miles_ai</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3797174%2Fd4269adf-379c-4f41-b2c2-23c03e233305.jpeg</url>
      <title>DEV Community: Thousand Miles AI</title>
      <link>https://dev.to/thousand_miles_ai</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/thousand_miles_ai"/>
    <language>en</language>
    <item>
      <title>How LLMs Actually Generate Text — Temperature, Top-K, Top-P, and the Dice Rolls You Never See</title>
      <dc:creator>Thousand Miles AI</dc:creator>
      <pubDate>Fri, 06 Mar 2026 05:46:02 +0000</pubDate>
      <link>https://dev.to/thousand_miles_ai/how-llms-actually-generate-text-temperature-top-k-top-p-and-the-dice-rolls-you-never-see-jop</link>
      <guid>https://dev.to/thousand_miles_ai/how-llms-actually-generate-text-temperature-top-k-top-p-and-the-dice-rolls-you-never-see-jop</guid>
      <description>&lt;p&gt;You set temperature to 0.7 because a tutorial told you to. But do you know what that actually does? Under the hood of every LLM response is a probability game — here's how the dice are loaded.&lt;/p&gt;




&lt;h1&gt;
  
  
  How LLMs Actually Generate Text — Temperature, Top-K, Top-P, and the Dice Rolls You Never See
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Every token an LLM outputs is a gamble. Understanding how that gamble works changes how you use these models forever.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Same Prompt, Three Different Answers
&lt;/h2&gt;

&lt;p&gt;Try this experiment. Open any LLM — ChatGPT, Claude, Gemini, whatever you have access to. Ask it: "Write a one-sentence product description for a coffee mug." Hit send. Copy the result. Now ask the exact same question again. And again.&lt;/p&gt;

&lt;p&gt;Three attempts. Three different sentences. Maybe slightly different, maybe wildly different. But almost certainly not identical.&lt;/p&gt;

&lt;p&gt;Why? You gave it the exact same input. The model's weights didn't change between requests. The system prompt is the same. So where does the randomness come from?&lt;/p&gt;

&lt;p&gt;It comes from the sampling step — the moment after the model calculates probabilities for every possible next word, and before it actually picks one. That choice — how the model selects from thousands of candidates — is controlled by parameters you've probably seen but maybe never understood: temperature, top-K, top-P.&lt;/p&gt;

&lt;p&gt;These aren't minor settings. They fundamentally change the model's behavior. Get them wrong, and your creative writing tool sounds robotic. Or your code assistant hallucinates syntax that doesn't exist. Or your customer support bot gives a different answer to the same question every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Should You Care?
&lt;/h2&gt;

&lt;p&gt;If you're building anything with an LLM — even just making API calls — you're setting these parameters, whether you know it or not. Every API has defaults. Every playground has sliders. And most developers just leave them alone or copy values from tutorials without understanding what they do.&lt;/p&gt;

&lt;p&gt;Understanding sampling isn't academic — it's one of the highest-leverage ways to improve LLM output quality without changing a single word of your prompt. It also shows up in interviews constantly. "Explain how temperature works" is practically a warmup question at any AI-focused company.&lt;/p&gt;

&lt;h2&gt;
  
  
  Let Me Back Up — How an LLM Picks the Next Word
&lt;/h2&gt;

&lt;p&gt;Here's what happens every time an LLM generates a single token:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The model processes your input and produces a set of &lt;strong&gt;logits&lt;/strong&gt; — raw scores for every token in its vocabulary (typically 30,000–100,000+ tokens).&lt;/li&gt;
&lt;li&gt;Those logits go through a &lt;strong&gt;softmax function&lt;/strong&gt;, which converts them into probabilities that sum to 1.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;sampling strategy&lt;/strong&gt; picks one token from that probability distribution.&lt;/li&gt;
&lt;li&gt;That token gets appended to the output, and the whole process repeats for the next token.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The model generates text one token at a time, left to right. It doesn't plan ahead. It doesn't have a draft that it edits. Every single token is a fresh probabilistic choice based on everything that came before it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fhow-llms-actually-generate-text-temperature-top-k-top-p-and-the-dice-rolls-you-never-see%2Fmermaid-c90f1f4bd8e17e1707d0ebe51fa2eeeb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fhow-llms-actually-generate-text-temperature-top-k-top-p-and-the-dice-rolls-you-never-see%2Fmermaid-c90f1f4bd8e17e1707d0ebe51fa2eeeb.png" alt="Mermaid Diagram" width="800" height="51"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The generation loop: predict probabilities for all tokens, sample one, append, repeat. The sampling strategy is where the magic (and danger) happens.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Sampling Strategies — One by One
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Greedy Decoding: Always Pick the Winner
&lt;/h3&gt;

&lt;p&gt;The simplest strategy. At every step, pick the token with the highest probability. No randomness, no dice rolling. If "the" has probability 0.35 and "a" has 0.20, you always pick "the."&lt;/p&gt;

&lt;p&gt;Sounds sensible, right? But greedy decoding has a nasty problem: it's boring. It tends to produce repetitive, predictable text. It gets stuck in loops. It picks the "safe" word every time, and the result reads like it was written by someone who's afraid to take any creative risk.&lt;/p&gt;

&lt;p&gt;Greedy decoding is fine for tasks where you want the single most likely answer — like classification or extraction. For anything generative, it's almost never what you want.&lt;/p&gt;

&lt;h3&gt;
  
  
  Temperature: Turning Up the Creativity Dial
&lt;/h3&gt;

&lt;p&gt;Temperature is the parameter everyone knows and almost nobody understands precisely. Here's what it actually does.&lt;/p&gt;

&lt;p&gt;Before the softmax function converts logits to probabilities, temperature &lt;strong&gt;divides the logits by a number.&lt;/strong&gt; That's it. That's the whole mechanism.&lt;/p&gt;

&lt;p&gt;But the effect is dramatic:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Temperature = 1.0&lt;/strong&gt; — No change. The probabilities are whatever the model naturally produces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Temperature &amp;lt; 1.0&lt;/strong&gt; (say, 0.3) — The logits get divided by a small number, which &lt;em&gt;amplifies&lt;/em&gt; the differences between them. High-probability tokens become even more probable. Low-probability tokens become nearly impossible. The distribution gets "peaky" — the model becomes more confident, more predictable, more conservative.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Temperature &amp;gt; 1.0&lt;/strong&gt; (say, 1.5) — The logits get divided by a large number, which &lt;em&gt;flattens&lt;/em&gt; the differences. Every token becomes more equally likely. The distribution spreads out — the model becomes more random, more creative, more surprising. Also more likely to say something unhinged.&lt;/p&gt;

&lt;p&gt;Think of temperature like a volume knob for randomness. Turn it down for math homework. Turn it up for poetry. Turn it all the way down (temperature = 0) and you get greedy decoding — pure determinism.&lt;/p&gt;

&lt;h3&gt;
  
  
  Top-K Sampling: Only Consider the Top Candidates
&lt;/h3&gt;

&lt;p&gt;Top-K is a filter. Before sampling, it looks at all 50,000+ tokens in the vocabulary, keeps only the K most probable ones, and throws the rest away. The probability mass gets redistributed among the survivors.&lt;/p&gt;

&lt;p&gt;Set K = 50, and the model can only choose from its top 50 candidates. Set K = 5, and it's stuck with the top 5. Set K = 1, and you're back to greedy decoding.&lt;/p&gt;

&lt;p&gt;The problem with top-K? The number K is fixed, regardless of context. Sometimes the model is very confident — 3 tokens account for 95% of the probability, and everything else is noise. A K of 50 would include 47 tokens that have almost zero chance of being right. Other times the model is uncertain — 200 tokens each have a small but meaningful probability. A K of 50 would cut off potentially good options.&lt;/p&gt;

&lt;p&gt;Top-K doesn't adapt to the shape of the distribution. It's blunt.&lt;/p&gt;

&lt;h3&gt;
  
  
  Top-P (Nucleus Sampling): The Smart Filter
&lt;/h3&gt;

&lt;p&gt;Top-P, also called nucleus sampling, is the clever answer to top-K's rigidity. Instead of keeping a fixed number of tokens, it keeps the smallest set of tokens whose combined probability exceeds a threshold P.&lt;/p&gt;

&lt;p&gt;Set P = 0.9, and the model keeps adding tokens (from most to least probable) until their probabilities sum to 0.9. If the model is confident, that might be only 3 tokens. If the model is uncertain, it might be 200.&lt;/p&gt;

&lt;p&gt;The beauty is that top-P adapts to context. When the next word is obvious ("The Eiffel Tower is in &lt;em&gt;__"), it narrows down to very few candidates. When the next word could genuinely go many ways ("She felt _&lt;/em&gt;_"), it keeps a wider pool.&lt;/p&gt;

&lt;p&gt;This is why top-P has become the default sampling strategy in most production systems. It's more robust across different situations than top-K.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fhow-llms-actually-generate-text-temperature-top-k-top-p-and-the-dice-rolls-you-never-see%2Fmermaid-de0d7615c66d27cc6d6fa91a0225b740.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fhow-llms-actually-generate-text-temperature-top-k-top-p-and-the-dice-rolls-you-never-see%2Fmermaid-de0d7615c66d27cc6d6fa91a0225b740.png" alt="Mermaid Diagram" width="800" height="203"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Top-K keeps a fixed number regardless of confidence. Top-P adapts — tight when confident, wide when uncertain.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Min-P: The 2026 Newcomer
&lt;/h3&gt;

&lt;p&gt;There's a newer approach that's gaining traction, especially in open-source communities. Min-P sets a threshold relative to the most probable token. If the top token has probability 0.8 and min-P is 0.1, any token with probability below 0.08 (10% of 0.8) gets cut.&lt;/p&gt;

&lt;p&gt;The elegance is that it scales with the model's own confidence. When the model is very sure (top token at 0.95), the threshold is high and very few alternatives survive. When the model is less sure (top token at 0.2), the threshold drops and more tokens stay in the pool.&lt;/p&gt;

&lt;p&gt;As of early 2026, the combination of temperature + min-P is what many open-source LLM users have converged on as the most practical setup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Guide: What Settings for What Task
&lt;/h2&gt;

&lt;p&gt;Here's a cheat sheet based on how these strategies interact:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code generation, factual Q&amp;amp;A, data extraction:&lt;/strong&gt; Temperature 0–0.3, top-P 0.9. You want determinism and accuracy. The model should pick the most likely token almost every time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;General chatbot, customer support:&lt;/strong&gt; Temperature 0.5–0.7, top-P 0.9. A balance of reliability and natural-sounding language. Not robotic, not chaotic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Creative writing, brainstorming, poetry:&lt;/strong&gt; Temperature 0.8–1.2, top-P 0.95. Give the model room to explore. Higher temperature means more surprising word choices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Never go above 1.5 for temperature&lt;/strong&gt; unless you're doing it for fun. At that point, the probability distribution is so flat that the model starts producing incoherent output — like a writer who's had too much coffee and is just free-associating.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistakes That Bite — Common Misunderstandings
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"Temperature controls how smart the model is."&lt;/strong&gt; No. It controls the randomness of token selection. A low temperature doesn't make the model think harder — it makes it pick the highest-probability token more consistently. If the model's probabilities are wrong, low temperature just makes it confidently wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"I should always use top-K AND top-P together."&lt;/strong&gt; You can, but be careful. If you set K=50 and P=0.9, the effective filter is whichever is more restrictive. Often one overrides the other, and the second parameter does nothing. Pick one or understand how they interact in your specific framework.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Temperature 0 means the same output every time."&lt;/strong&gt; Almost. It means greedy decoding — always picking the highest-probability token. But some implementations have floating-point tie-breaking that can occasionally vary. For true determinism, also set a fixed random seed if the API supports it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Now Go Break Something — Where to Go from Here
&lt;/h2&gt;

&lt;p&gt;The best way to internalize these concepts is to play with them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Use the OpenAI or Anthropic playground&lt;/strong&gt; — they have real-time sliders for temperature, top-P, and top-K. Ask the same question at different settings and watch how the output changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Try the Hugging Face text generation playground&lt;/strong&gt; — it shows the token probabilities alongside the generated text, so you can literally see the dice being rolled.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Search for "LLM sampling parameters interactive demo"&lt;/strong&gt; — several blog posts have visual explainers that let you see how temperature reshapes the probability distribution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read the Hugging Face blog post "Decoding Strategies in Large Language Models"&lt;/strong&gt; — it covers everything from greedy search to min-P with code examples.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For open-source users:&lt;/strong&gt; Experiment with llama.cpp's sampler chain — it lets you compose multiple sampling strategies in sequence and see how each one transforms the distribution.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Next time you set temperature to 0.7 and top-P to 0.95, you'll know exactly what's happening: the model calculates probabilities for 50,000 tokens, temperature sharpens the distribution slightly, top-P keeps only the tokens that matter, and one gets picked. Every word you read from an LLM went through this gauntlet. The same prompt, the same model, but different dice rolls — and that's why you get a different coffee mug description every time.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;Author: Shibin&lt;/p&gt;

</description>
      <category>learning</category>
      <category>ai</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>The LLM Interview Cheat Sheet — 10 Questions That Actually Come Up</title>
      <dc:creator>Thousand Miles AI</dc:creator>
      <pubDate>Fri, 06 Mar 2026 05:44:59 +0000</pubDate>
      <link>https://dev.to/thousand_miles_ai/the-llm-interview-cheat-sheet-10-questions-that-actually-come-up-5faa</link>
      <guid>https://dev.to/thousand_miles_ai/the-llm-interview-cheat-sheet-10-questions-that-actually-come-up-5faa</guid>
      <description>&lt;p&gt;You've used ChatGPT, built a RAG pipeline, maybe even fine-tuned a model. But can you explain how attention actually works when the interviewer asks? Here are 10 LLM questions that keep showing up in interviews — with answers that actually make sense.&lt;/p&gt;




&lt;p&gt;It's 10 PM the night before your Google / Meta / OpenAI LLM engineer interview. You're scrolling through your notes on transformers, and your mind goes blank when you try to explain self-attention out loud. You panic. You Google "explain attention mechanisms" and spend the next hour reading academic papers that feel written in a different language.&lt;/p&gt;

&lt;p&gt;By midnight, you're convinced you don't know anything.&lt;/p&gt;

&lt;p&gt;Here's the truth: you probably know more than you think. You've fine-tuned models, built RAG pipelines, maybe even experimented with prompt engineering. But when an interviewer asks "How does self-attention work?" or "When would you use fine-tuning vs RAG?", panic takes over and you blank out.&lt;/p&gt;

&lt;p&gt;This post is your cheat sheet. Not the academic definitions. The answers that actually work in an interview — clear, concise, and confident.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why You Should Care
&lt;/h2&gt;

&lt;p&gt;LLM roles are exploding right now. Google, Meta, OpenAI, Anthropic, Microsoft — they're all hiring ML engineers who can talk intelligently about transformers, RAG, fine-tuning, and hallucination. These aren't niche roles anymore. They're the growth area in tech.&lt;/p&gt;

&lt;p&gt;The gatekeepers for these roles are these 10 questions (or variations of them). They appear across companies because they separate people who understand LLMs from people who just know how to use them.&lt;/p&gt;

&lt;p&gt;The good news: these questions have predictable answers. You just need to know how to explain them.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 10 Questions (+ Answers You Can Deliver)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Explain self-attention. Why can't you just use RNNs?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why they ask:&lt;/strong&gt; This is the foundation of everything. If you can't explain this clearly, everything else falls apart.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The answer:&lt;/strong&gt;&lt;br&gt;
Self-attention lets a token look at every other token in the sequence at once and assign weights to determine which ones matter. It answers: "Given this token, which other tokens should I pay attention to?"&lt;/p&gt;

&lt;p&gt;Here's the concrete difference:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RNN (old way):&lt;/strong&gt; Processes tokens one at a time, left to right. Token at position 10 struggles to "remember" token at position 1 because information has to flow through 9 steps. Long dependencies get lost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-attention (new way):&lt;/strong&gt; Token at position 10 directly computes its similarity to all other tokens (positions 1–9) and decides their importance instantly. No information decay.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The formula you don't need to memorize, but should understand:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Attention(Q, K, V) = softmax(Q * K / sqrt(d)) * V
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Translation: Take your query (Q), multiply by all keys (K), normalize with softmax so weights sum to 1, then multiply by the actual values (V).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The gotcha:&lt;/strong&gt; Interviewers might ask "What's the computational cost?" Answer: O(n²) where n is sequence length. That's why long context windows are expensive. That's also why companies invest in optimized attention (multi-query attention, flash attention).&lt;/p&gt;




&lt;h3&gt;
  
  
  2. What is positional encoding and why do we need it?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why they ask:&lt;/strong&gt; Transformers are permutation-invariant (word order doesn't matter by default). They want to know if you understand why that's broken and how we fix it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The answer:&lt;/strong&gt;&lt;br&gt;
Self-attention doesn't inherently know position. If you feed "dog bit man" or "man bit dog", the attention mechanism computes the same weights. The model needs to know which word is first, second, third.&lt;/p&gt;

&lt;p&gt;Positional encoding adds information about position to each token's embedding. The most common method (from the original paper) uses sin/cos waves at different frequencies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Low frequencies encode large-scale positions (is this early or late in the sequence?)&lt;/li&gt;
&lt;li&gt;High frequencies encode local positions (is this token next to another one?)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This way, the model can learn relationships like "noun at position 2, verb at position 4" instead of just "noun, verb".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The gotcha:&lt;/strong&gt; There's no single best positional encoding. Some models use learned positional embeddings. Others use relative position bias. What matters is that you know the problem exists.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Self-attention, multi-head attention — what's the difference?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why they ask:&lt;/strong&gt; This trips up a lot of candidates. They use the term "attention head" without understanding what it does.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The answer:&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Self-attention&lt;/strong&gt; is the basic mechanism (Q, K, V multiply and softmax).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-head attention&lt;/strong&gt; is running the same self-attention operation multiple times in parallel, each with different weight matrices, then combining the results.&lt;/p&gt;

&lt;p&gt;Why? Because different "heads" can learn different patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One head might learn to focus on nearby words (local grammar)&lt;/li&gt;
&lt;li&gt;Another head might learn to focus on distant words (long-range references)&lt;/li&gt;
&lt;li&gt;A third head might learn to focus on certain semantic relationships&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it like having 8 different "experts" all looking at the same input but with different lenses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Formula-wise:&lt;/strong&gt; Instead of one attention output, you get multiple outputs and concatenate them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The gotcha:&lt;/strong&gt; Having 8 heads doesn't mean 8× the understanding. Empirically, 8–16 heads work well. More isn't always better (there are diminishing returns).&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Explain the transformer architecture in 30 seconds.
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why they ask:&lt;/strong&gt; They want to know if you can break down complexity. If you ramble for 5 minutes, they think you don't understand the core.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The answer (say this fast):&lt;/strong&gt;&lt;br&gt;
Transformer has two parts: encoder and decoder.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Encoder:&lt;/strong&gt; Takes input text, runs it through self-attention (to let tokens attend to each other), then through a feed-forward network. Do this 12–24 times (stacking layers). Output: rich representation of the input.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decoder:&lt;/strong&gt; Takes target tokens, runs self-attention (but masked so it can't look ahead), then cross-attention (attends to encoder output), then feed-forward. Do this 12–24 times. Output: next token prediction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In one sentence:&lt;/strong&gt; "Stack self-attention and feed-forward layers, apply masking in the decoder, and train to predict the next token."&lt;/p&gt;




&lt;h3&gt;
  
  
  5. What is tokenization and why does it matter?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why they ask:&lt;/strong&gt; Tokenization is the first step. Get it wrong and everything downstream breaks. They want to know if you've thought about this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The answer:&lt;/strong&gt;&lt;br&gt;
Tokenization converts raw text into tokens (usually subwords) that the model can process.&lt;/p&gt;

&lt;p&gt;"Hello world" might become ["Hel", "lo", "world"] or ["Hello", "world"] depending on the tokenizer.&lt;/p&gt;

&lt;p&gt;Why subwords instead of just words?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rare words:&lt;/strong&gt; If the tokenizer has never seen "pneumonia", breaking it into ["pneu", "monia"] lets the model handle it anyway&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Efficiency:&lt;/strong&gt; Fewer tokens = faster processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spelling variations:&lt;/strong&gt; "color" and "colour" map to similar tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Two main approaches:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;BPE (Byte Pair Encoding):&lt;/strong&gt; Used by GPT. Learns common character pairs and merges them iteratively&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WordPiece:&lt;/strong&gt; Used by BERT. Similar idea but with a frequency-based approach&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The gotcha:&lt;/strong&gt; Different models use different tokenizers. GPT-4 uses a different tokenizer than GPT-3. This matters for token counting, context window size, and fine-tuning.&lt;/p&gt;




&lt;h3&gt;
  
  
  6. Explain the difference between fine-tuning and RAG. When would you use each?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why they ask:&lt;/strong&gt; This separates people building LLM products from people who understand the tradeoffs. It's a systems thinking question.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The answer:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Fine-tuning&lt;/th&gt;
&lt;th&gt;RAG&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What it does&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Adjusts model weights on your task-specific data&lt;/td&gt;
&lt;td&gt;Retrieves relevant docs, adds them to prompt before generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Expensive (GPU hours, time)&lt;/td&gt;
&lt;td&gt;Cheap (just needs retrieval + inference)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Speed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Slow to deploy&lt;/td&gt;
&lt;td&gt;Fast to iterate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Knowledge cutoff&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Can be months/years old (trained on historical data)&lt;/td&gt;
&lt;td&gt;Can include live, up-to-date information&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;When to use&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Specific writing style, domain-specific reasoning, behavior you can't prompt into the model&lt;/td&gt;
&lt;td&gt;Factual Q&amp;amp;A, company docs, changing information&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The real answer:&lt;/strong&gt; Most of the time, start with RAG. It's faster to build and easier to maintain. Use fine-tuning only when:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You have lots of labeled examples (1000+)&lt;/li&gt;
&lt;li&gt;You need consistent style/format&lt;/li&gt;
&lt;li&gt;RAG isn't getting you there&lt;/li&gt;
&lt;li&gt;You have the infrastructure to maintain a custom model&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Customer support chatbot? RAG + the company's knowledge base. Custom code generation for your codebase? Fine-tuning might be worth it.&lt;/p&gt;




&lt;h3&gt;
  
  
  7. What causes hallucination in LLMs and how do you prevent it?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why they ask:&lt;/strong&gt; Hallucination is the biggest issue in production LLM systems. They want to know if you've dealt with it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The answer:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is hallucination?&lt;/strong&gt; The model generates confident, fluent text that's completely false. Not random gibberish — plausible-sounding facts that are wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it happens:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The model predicts the next most-likely token based on pattern matching, not factual knowledge&lt;/li&gt;
&lt;li&gt;It hasn't learned the boundary between "I know this" and "I'm guessing"&lt;/li&gt;
&lt;li&gt;It's trained to be coherent, not accurate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How to prevent it (in order of effectiveness):&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;RAG (best solution):&lt;/strong&gt; Give the model a document to read from. Now it can only hallucinate based on what's in that document. Most controllable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Prompt engineering:&lt;/strong&gt; Explicit instructions like "Only answer based on the provided context" or "If unsure, say 'I don't know'" help a bit. But models still hallucinate.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fine-tuning on high-quality data:&lt;/strong&gt; Train the model on examples where it's penalized for hallucinating. Helps but doesn't fully solve it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fact-checking layer:&lt;/strong&gt; After generation, run the output through a separate fact-checker (another model or rule-based system).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Temperature control:&lt;/strong&gt; Lower temperature makes the model more confident in likely tokens, reduces randomness. But doesn't fix hallucination.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The honest answer:&lt;/strong&gt; You can't eliminate hallucination. You can reduce it. RAG is your best bet.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fthe-llm-interview-cheat-sheet-10-questions-that-actually-come-up%2Fmermaid-a0c73c5210c1acc69a00f3f92c064e34.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fthe-llm-interview-cheat-sheet-10-questions-that-actually-come-up%2Fmermaid-a0c73c5210c1acc69a00f3f92c064e34.png" alt="Mermaid Diagram" width="800" height="279"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  8. How would you evaluate an LLM's quality?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why they ask:&lt;/strong&gt; Generating text is easy. Knowing if it's good is hard. They want to know if you've thought about measurement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The answer:&lt;/strong&gt;&lt;br&gt;
Depends on the task. There's no one metric.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automatic metrics (cheap, noisy):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;BLEU, ROUGE:&lt;/strong&gt; Compare generated text to reference text word-by-word. Works for translation, summarization. Penalizes paraphrasing. Bad for open-ended tasks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BERTScore:&lt;/strong&gt; Uses embeddings instead of exact word match. More forgiving. Better than BLEU.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exact Match (EM), F1:&lt;/strong&gt; For QA. Did the model extract the right answer?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Manual evaluation (expensive, signal-rich):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Human raters:&lt;/strong&gt; Have people score outputs (1–5) on relevance, accuracy, tone. Gold standard. Requires budget.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rubric-based:&lt;/strong&gt; Define criteria (factuality, clarity, completeness) and score against them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;LLM-as-a-Judge (emerging, controversial):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use a strong LLM (GPT-4) to score outputs from a weaker LLM. Fast and surprisingly good, but can be circular (errors compound).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For a chatbot: user satisfaction, conversation length, return rate&lt;/li&gt;
&lt;li&gt;For a code generator: does generated code compile? Does it pass tests?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The honest answer:&lt;/strong&gt; Use multiple signals. No single metric tells the full story.&lt;/p&gt;




&lt;h3&gt;
  
  
  9. Explain what's happening in a forward pass through a transformer.
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why they ask:&lt;/strong&gt; They want to verify you can trace through the actual computation. Not just regurgitate definitions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The answer:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fthe-llm-interview-cheat-sheet-10-questions-that-actually-come-up%2Fmermaid-1c42301738ffca714b7948db1cb1b9d1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fthe-llm-interview-cheat-sheet-10-questions-that-actually-come-up%2Fmermaid-1c42301738ffca714b7948db1cb1b9d1.png" alt="Mermaid Diagram" width="800" height="4205"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Step by step:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Tokenization:&lt;/strong&gt; "Hello world" → &lt;a href="https://dev.totoken%20IDs"&gt;101, 7592, 2088&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding:&lt;/strong&gt; Each token ID maps to a d-dimensional vector (e.g., 768D for BERT)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Positional encoding:&lt;/strong&gt; Add sin/cos vectors so the model knows position&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transformer block:&lt;/strong&gt; Run through self-attention, feed-forward, repeat 12+ times. Each layer transforms the embeddings, extracting deeper meaning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output layer:&lt;/strong&gt; Linear layer that converts final embeddings to logits (scores) for each possible next token&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Softmax:&lt;/strong&gt; Convert logits to probabilities summing to 1&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sampling:&lt;/strong&gt; Pick the next token (or the highest probability one)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repeat:&lt;/strong&gt; Feed the new token back in, keep going&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The gotcha:&lt;/strong&gt; During inference, you don't recompute attention over all previous tokens every time (too expensive). You use KV caching: store the keys and values from previous tokens, reuse them, only compute for new tokens.&lt;/p&gt;




&lt;h3&gt;
  
  
  10. What's the difference between base models and instruction-tuned models? Why do we need both?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why they ask:&lt;/strong&gt; This is about understanding the training pipeline and product strategy. It separates engineers from researchers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The answer:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Base models (like GPT-3, LLaMA):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trained on next-token prediction on huge internet text&lt;/li&gt;
&lt;li&gt;Excellent at patterns and language&lt;/li&gt;
&lt;li&gt;Terrible at following instructions&lt;/li&gt;
&lt;li&gt;If you ask "Tell me a joke", it will continue text in a way that follows common patterns, not necessarily tell a joke&lt;/li&gt;
&lt;li&gt;Useful for: creative writing, text completion, in-context learning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Instruction-tuned models (like ChatGPT, Llama-2-Chat):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Take a base model&lt;/li&gt;
&lt;li&gt;Fine-tune it on (instruction, response) pairs where responses are aligned with what users want&lt;/li&gt;
&lt;li&gt;Also fine-tune with RLHF (Reinforcement Learning from Human Feedback) to penalize bad outputs&lt;/li&gt;
&lt;li&gt;Follows instructions reliably&lt;/li&gt;
&lt;li&gt;Useful for: chatbots, Q&amp;amp;A, customer support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why both exist:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Base models are research tools. They're the raw material.&lt;/li&gt;
&lt;li&gt;Instruction-tuned models are products. They're what users interact with.&lt;/li&gt;
&lt;li&gt;Sometimes you want a base model (if you're doing research or building something unusual). Usually you want instruction-tuned (if you're shipping to users).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The gotcha:&lt;/strong&gt; Fine-tuning an instruction-tuned model on new data can degrade instruction-following. This is catastrophic forgetting. You need to be careful about the training setup.&lt;/p&gt;




&lt;h2&gt;
  
  
  Common Gotchas (Things Candidates Mess Up)
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Confusing attention with RNNs:&lt;/strong&gt; Attention is not sequential. RNNs are sequential. Don't say "attention is better because it's faster at each step" — say "it's faster overall because steps are parallelizable".&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Overstating transformer improvements:&lt;/strong&gt; Transformers are great at long context, but they have O(n²) memory. This is a real limitation. Don't pretend it doesn't exist.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Assuming fine-tuning is always the answer:&lt;/strong&gt; Most people reach for fine-tuning too early. RAG, prompting, and in-context learning go further than most engineers think.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Saying "more parameters = better":&lt;/strong&gt; Scaling helps, but data quality and training setup matter just as much. A 7B model trained right beats a 70B model trained poorly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Forgetting the practical constraints:&lt;/strong&gt; Interviewers care about inference cost, latency, and inference cost. Academic perfection doesn't matter if you can't serve it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Not understanding your own tools:&lt;/strong&gt; If you've used OpenAI API, know its pricing, latency, rate limits. If you've fine-tuned on Hugging Face, know how long it takes and what it costs. Specifics matter.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What to Do Next
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Practice explaining these answers out loud.&lt;/strong&gt; Not reading — speaking. Your brain works differently. You'll stumble on things you thought you understood.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Build something.&lt;/strong&gt; Try a RAG system, fine-tune a model on your own data, or build a chatbot. Theory is fine, but interviewers test your judgment. You get that from building.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Read the original papers lightly.&lt;/strong&gt; Not cover-to-cover. Read "Attention is All You Need" (Vaswani et al., 2017) for context. Skim the abstract and architecture section. You don't need to memorize it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Know your specific tech stack.&lt;/strong&gt; If you're interviewing at a company, know what models they use. Google? PaLM and Gemini. Meta? LLaMA. OpenAI? GPT-4. Anthropic? Claude. Know the positioning. It shows you've done your homework.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Practice system design questions.&lt;/strong&gt; "Design a chatbot for a healthcare provider" or "Design a code generation service." These combine everything. Most interviews include one.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The Night Before
&lt;/h2&gt;

&lt;p&gt;You're going to be nervous. That's normal. Everyone is.&lt;/p&gt;

&lt;p&gt;The difference between people who pass and people who don't isn't knowledge — it's clarity. You probably know 80% of what you need. You just need to deliver it with confidence.&lt;/p&gt;

&lt;p&gt;Before bed, do this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read through your answers once (not for hours — 20 minutes max)&lt;/li&gt;
&lt;li&gt;Do a practice explanation out loud for each question&lt;/li&gt;
&lt;li&gt;Go to sleep knowing that you've prepped well&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;During the interview:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If they ask something you don't know, say "I don't know that specific detail, but here's how I'd think about it." Then reason through it. Reasoning is more valuable than memorization.&lt;/li&gt;
&lt;li&gt;If you blank out on a question, pause for 5 seconds. Think. Then answer. Silence is okay. Rambling is bad.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You've got this. Good luck.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Sources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.datacamp.com/blog/llm-interview-questions" rel="noopener noreferrer"&gt;Top 36 LLM Interview Questions and Answers for 2026 | DataCamp&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.lockedinai.com/blog/llm-interview-questions-answers-complete-guide-engineers" rel="noopener noreferrer"&gt;LLM Interview Questions &amp;amp; Answers (2026): A Complete Guide for Engineers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/llmgenai/LLMInterviewQuestions" rel="noopener noreferrer"&gt;GitHub - llmgenai/LLMInterviewQuestions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.datacamp.com/blog/rag-interview-questions" rel="noopener noreferrer"&gt;Top 30 RAG Interview Questions and Answers for 2026 | DataCamp&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://medium.com/fonzi-ai/would-you-pass-an-openai-ml-engineer-interview-in-2025-d4fb2d8c4708" rel="noopener noreferrer"&gt;Would You Pass an OpenAI ML Engineer Interview in 2025? | Medium&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Author: thousandmiles-ai-admin&lt;/p&gt;

</description>
      <category>learning</category>
      <category>ai</category>
    </item>
    <item>
      <title>How AI Agents Actually Execute Multi-Step Tasks — The Orchestration Nobody Talks About</title>
      <dc:creator>Thousand Miles AI</dc:creator>
      <pubDate>Fri, 06 Mar 2026 05:43:56 +0000</pubDate>
      <link>https://dev.to/thousand_miles_ai/how-ai-agents-actually-execute-multi-step-tasks-the-orchestration-nobody-talks-about-4ahp</link>
      <guid>https://dev.to/thousand_miles_ai/how-ai-agents-actually-execute-multi-step-tasks-the-orchestration-nobody-talks-about-4ahp</guid>
      <description>&lt;p&gt;You asked the AI to 'book a flight and update the spreadsheet.' It did both. But how? A deep dive into the reasoning loop, tool calling, and orchestration patterns that make AI agents actually work.&lt;/p&gt;




&lt;h1&gt;
  
  
  How AI Agents Actually Execute Multi-Step Tasks — The Orchestration Nobody Talks About
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;An LLM can write poetry and explain quantum physics. But ask it to "check the database, find stale records, and send a Slack alert" — and suddenly it needs an entire architecture to pull it off.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The "Just Do It" Illusion
&lt;/h2&gt;

&lt;p&gt;You're watching a demo. Someone types into a chat: "Find all overdue invoices in our system, calculate the total amount, and draft an email to the finance team with a summary." The AI assistant thinks for a moment, then — like magic — it queries the database, crunches the numbers, writes a professional email, and asks for confirmation before sending.&lt;/p&gt;

&lt;p&gt;It looks seamless. Like the AI just... understood and did everything. But behind that smooth demo is something much more interesting: a loop. The AI didn't do all of that in one shot. It thought about what to do first, executed one step, looked at the result, thought again, executed the next step, and kept going until the job was done.&lt;/p&gt;

&lt;p&gt;That loop — the reasoning-action-observation cycle — is the beating heart of every AI agent. And understanding it is the difference between building chatbots that answer questions and building agents that actually get things done.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Should You Care?
&lt;/h2&gt;

&lt;p&gt;If you're building anything with LLMs beyond a simple Q&amp;amp;A bot — a coding assistant, an automated workflow, an internal ops tool — you're building an agent, whether you call it that or not. And every AI-focused company, from startups to the big labs, is hiring people who understand how agents work under the hood.&lt;/p&gt;

&lt;p&gt;More practically: the agent architecture you choose determines whether your system is reliable or a house of cards. The "just let the LLM figure it out" approach works in demos. In production, it falls apart spectacularly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Let Me Back Up — What Even Is an AI Agent?
&lt;/h2&gt;

&lt;p&gt;Let's get precise. An AI agent is an LLM-powered system that can take actions in the real world — not just generate text, but call APIs, query databases, read files, send messages, execute code. It does this autonomously, deciding on its own what steps to take to achieve a goal.&lt;/p&gt;

&lt;p&gt;The key word is "autonomously." A regular LLM call is like asking someone a question — you get an answer back. An agent is like giving someone a task — they figure out the steps, do the work, and come back with results.&lt;/p&gt;

&lt;p&gt;But here's the thing: LLMs don't inherently know how to plan and execute multi-step tasks. They're trained to predict the next token. The agent behavior comes from the architecture wrapped around the LLM — the loop, the tools, the memory, the orchestration logic. The LLM is the brain. Everything else is the body.&lt;/p&gt;

&lt;h2&gt;
  
  
  Okay, But How Does It Actually Work? — The ReAct Loop
&lt;/h2&gt;

&lt;p&gt;The most foundational pattern in agent design is called &lt;strong&gt;ReAct&lt;/strong&gt; — short for "Reasoning and Acting." It was introduced in a 2023 research paper, and by 2026 it's become the default mental model for how agents operate.&lt;/p&gt;

&lt;p&gt;Here's the core idea: instead of asking the LLM to produce a final answer in one shot, you put it in a loop where it alternates between thinking and doing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fhow-ai-agents-actually-execute-multi-step-tasks-the-orchestration-nobody-talks-about%2Fmermaid-9968f2fdfff3cf6a8252d44cd3eebbfc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fhow-ai-agents-actually-execute-multi-step-tasks-the-orchestration-nobody-talks-about%2Fmermaid-9968f2fdfff3cf6a8252d44cd3eebbfc.png" alt="Mermaid Diagram" width="800" height="2144"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The ReAct loop: think, act, observe, repeat. Each cycle brings the agent closer to the goal.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step by Step
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Thought&lt;/strong&gt; — The LLM generates internal reasoning. It looks at the goal, considers what information it has, and decides what to do next. This is essentially chain-of-thought reasoning, but directed toward action. Something like: "The user wants overdue invoices. I need to query the database first. I'll use the query_invoices tool with a filter for overdue status."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Action&lt;/strong&gt; — Based on the thought, the LLM outputs a structured tool call. It's not free-form text — it's a specific function name with specific parameters, like &lt;code&gt;query_invoices(status="overdue", limit=100)&lt;/code&gt;. The orchestration layer parses this and executes it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observation&lt;/strong&gt; — The tool runs, and its output gets fed back to the LLM as context. "Found 23 overdue invoices totaling $47,250." Now the LLM has new information it didn't have before.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Loop&lt;/strong&gt; — The LLM sees the observation, generates a new thought ("Now I need to calculate the total and draft the email"), and takes the next action. This continues until the goal is met or the agent decides it needs human input.&lt;/p&gt;

&lt;p&gt;The beauty of this pattern is that it's self-correcting. If a tool call fails, the LLM sees the error in the observation step and can try a different approach. If it gets unexpected data, it can reason about what went wrong. This feedback loop is what makes agents feel intelligent — they're not just following a script, they're adapting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Not Just Plan Everything Upfront?
&lt;/h3&gt;

&lt;p&gt;Fair question. Why not have the LLM create a full plan at the beginning and then execute it linearly? Some architectures do this — and it works for simple, predictable tasks. But for anything complex, upfront planning breaks down because the agent doesn't know what it'll discover along the way. Maybe the database query returns no results. Maybe the API is down. Maybe the data looks different than expected. The iterative loop handles uncertainty by making decisions one step at a time, with real information.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Orchestration Architectures
&lt;/h2&gt;

&lt;p&gt;Not all agents are built the same. As tasks get more complex, the simple single-loop pattern needs to evolve. Here are the three main architectures you'll see in production systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Single Agent Loop
&lt;/h3&gt;

&lt;p&gt;This is the ReAct pattern we just described — one LLM handling everything end to end. It reads the goal, picks a tool, observes the result, and repeats.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good for:&lt;/strong&gt; Simple-to-moderate tasks with a clear sequence of steps. Think "search for X, summarize it, save to a file."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Breaks down when:&lt;/strong&gt; The task requires expertise in multiple domains, or the number of available tools is so large that the LLM gets confused about which to use. When you give a single agent 50 tools, it starts picking the wrong ones — there's a real cognitive overload problem with tool selection.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Supervisor Pattern (Hierarchical)
&lt;/h3&gt;

&lt;p&gt;A supervisor agent breaks the goal into sub-tasks and delegates each to a specialist agent. The supervisor doesn't do the work itself — it coordinates.&lt;/p&gt;

&lt;p&gt;Think of it like a tech lead assigning tickets. The supervisor says: "Agent A, query the database for overdue invoices. Agent B, once A is done, calculate the total. Agent C, draft the email with the results."&lt;/p&gt;

&lt;p&gt;Each worker agent runs its own ReAct loop with a narrower focus and fewer tools. The supervisor collects results and produces the final output.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fhow-ai-agents-actually-execute-multi-step-tasks-the-orchestration-nobody-talks-about%2Fmermaid-2e365edb9e92326b125779f3d6a1f061.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fhow-ai-agents-actually-execute-multi-step-tasks-the-orchestration-nobody-talks-about%2Fmermaid-2e365edb9e92326b125779f3d6a1f061.png" alt="Mermaid Diagram" width="800" height="263"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Supervisor pattern: one coordinator, multiple specialist workers. Each worker has a focused role and limited tool set.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good for:&lt;/strong&gt; Complex tasks that need different types of expertise. One agent might be great with databases, another with writing, another with code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trade-off:&lt;/strong&gt; More overhead. You're running multiple LLM calls, and the supervisor needs to correctly decompose the task. Bad decomposition means bad results.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Plan-Execute-Synthesize
&lt;/h3&gt;

&lt;p&gt;This is the architecture that's gaining the most traction in 2026. It separates the agent into three distinct roles:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Planner&lt;/strong&gt; — Looks at the goal and produces a structured plan. Just the plan — no execution. This forces the planning step to be explicit and reviewable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Executor&lt;/strong&gt; — Takes the plan and runs it step by step, calling tools and collecting results. The executor can only do what the plan authorizes. This makes the system predictable and auditable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Synthesizer&lt;/strong&gt; — Reads all the collected evidence (tool outputs, intermediate results) and composes the final answer. It never calls tools directly — it just works with the data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt; By separating planning from execution from synthesis, you can enforce policies (the executor can't go rogue), audit every step (the plan is inspectable), and debug failures precisely (was the plan wrong? did a tool fail? did the synthesis miss something?).&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistakes That Bite — Where Agent Architectures Go Wrong
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"Give the agent all the tools and let it figure it out."&lt;/strong&gt; This is the most common mistake. More tools does not mean more capability — it means more confusion. LLMs have a harder time choosing the right tool when the selection is large. Be surgical: give each agent only the tools it needs for its specific role.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"The LLM will handle error recovery."&lt;/strong&gt; Sometimes. But LLMs can also get stuck in loops — calling the same failing tool over and over with slightly different parameters, burning tokens without making progress. Production agents need hard limits: maximum loop iterations, timeout policies, and escalation to a human when the agent is clearly stuck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"We don't need a human in the loop."&lt;/strong&gt; For low-stakes tasks like summarizing data, sure. But for anything that sends emails, modifies databases, or takes irreversible actions? You need a confirmation step. The best agent architectures have explicit "checkpoints" where the agent pauses and asks for human approval before proceeding with high-impact actions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Now Go Break Something — Where to Go from Here
&lt;/h2&gt;

&lt;p&gt;If you want to build your own agent and feel these patterns firsthand, here's a path:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Start with the ReAct pattern.&lt;/strong&gt; Build a simple agent that has 2–3 tools (a web search tool, a calculator, and a file writer). Give it a goal that requires using all three. Watch how it reasons through the steps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Try LangGraph&lt;/strong&gt; — it lets you define agent workflows as graphs, which makes the orchestration patterns visual and easy to experiment with. The official docs have great quickstart tutorials.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explore the OpenAI Agents SDK&lt;/strong&gt; — it's lightweight and has built-in support for tool calling and MCP integration. Good for understanding the basics without framework overhead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read the original ReAct paper&lt;/strong&gt; — search for "ReAct: Synergizing Reasoning and Acting in Language Models" by Yao et al. It's surprisingly readable for an academic paper, and understanding the origin helps you see why everything is built this way.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For the ambitious:&lt;/strong&gt; Build a supervisor-worker system where a planner agent delegates to two specialist agents. Even a toy example with made-up tools will teach you more about orchestration challenges than any tutorial.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;That seamless demo — where the AI queried the database, crunched numbers, and drafted an email — wasn't magic. It was a loop: think, act, observe, repeat. The LLM provided the reasoning. The orchestration provided the structure. And the tools provided the hands. Once you see the loop, every AI agent stops being a black box and starts being an engineering problem you can actually debug.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;Author: thousandmiles-ai-admin&lt;/p&gt;

</description>
      <category>learning</category>
      <category>ai</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>How to Evaluate LLM Outputs — Beyond 'Looks Good to Me'</title>
      <dc:creator>Thousand Miles AI</dc:creator>
      <pubDate>Fri, 06 Mar 2026 05:42:53 +0000</pubDate>
      <link>https://dev.to/thousand_miles_ai/how-to-evaluate-llm-outputs-beyond-looks-good-to-me-4488</link>
      <guid>https://dev.to/thousand_miles_ai/how-to-evaluate-llm-outputs-beyond-looks-good-to-me-4488</guid>
      <description>&lt;p&gt;Your RAG pipeline returns an answer. It sounds confident. But is it actually correct? Turns out 'vibes-based evaluation' doesn't scale. Learn the metrics and frameworks that actually tell you if your LLM is hallucinating, missing context, or nailing it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Classic Problem
&lt;/h2&gt;

&lt;p&gt;You've built a RAG pipeline. Your knowledge base is solid. Your retriever works fine. You run a test query, and the LLM spits out an answer that sounds &lt;em&gt;completely&lt;/em&gt; confident. Grammar? Perfect. Structure? Coherent. Tone? Professional.&lt;/p&gt;

&lt;p&gt;You copy it to your Slack channel: "It works!"&lt;/p&gt;

&lt;p&gt;But then someone asks a follow-up question, and the answer contradicts itself. Or they check a fact and it's subtly wrong. Or they ask "where did you get that?" and you realize the LLM just... made it up.&lt;/p&gt;

&lt;p&gt;That's not an error in your pipeline. That's an error in how you evaluated it.&lt;/p&gt;

&lt;p&gt;If you're relying on "it looks right to me," you're in dangerous territory. The problem scales immediately: when you have 100 queries, 1000 queries, or a production system running 24/7, you can't manually inspect every output.&lt;/p&gt;

&lt;p&gt;You need metrics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Vibes-based evaluation breaks at scale.&lt;/strong&gt; Human inspection is slow, inconsistent, and subjective. One person reads an answer and thinks "solid." Another reads the same answer and spots a hallucination. You're shipping an LLM system that nobody actually understands, and nobody can debug when it fails.&lt;/p&gt;

&lt;p&gt;But here's the thing: traditional ML evaluation metrics don't work for language models. In classification, you have clear right/wrong answers. In RAG, there's no single "ground truth." The same query might have 10 correct answers depending on how you interpret it. And hallucinations are &lt;em&gt;genuinely&lt;/em&gt; hard to spot automatically because the LLM is confident and grammatically flawless.&lt;/p&gt;

&lt;p&gt;So we need new frameworks. We need to measure different dimensions separately, and we need tools that don't require a human to read every single output.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Challenge: Why LLM Eval Is Different
&lt;/h2&gt;

&lt;p&gt;Traditional ML evaluation assumes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There's one right answer (binary classification, exact match, etc.)&lt;/li&gt;
&lt;li&gt;Metrics are purely numerical (accuracy, precision, recall)&lt;/li&gt;
&lt;li&gt;No middle ground between right and wrong&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Language generation throws all of that out:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple correct answers exist (paraphrasing, different phrasings, different correct facts)&lt;/li&gt;
&lt;li&gt;Quality is multidimensional (you need to measure faithfulness &lt;em&gt;and&lt;/em&gt; relevance, not just "accuracy")&lt;/li&gt;
&lt;li&gt;Hallucinations look like correct answers—that's the whole problem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's why LLM evaluation has evolved into a multi-metric framework where you evaluate different dimensions separately.&lt;/p&gt;

&lt;h2&gt;
  
  
  The RAGAS Framework: Your Evaluation Toolkit
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;RAGAS&lt;/strong&gt; (Retrieval Augmented Generation Assessment) is the most popular open-source framework for evaluating RAG systems. It provides a suite of metrics that work &lt;em&gt;without&lt;/em&gt; requiring ground truth labels.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fhow-to-evaluate-llm-outputs-beyond-looks-good-to-me%2Fmermaid-cd38f5b486c86c719d4cda7bfab20854.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fhow-to-evaluate-llm-outputs-beyond-looks-good-to-me%2Fmermaid-cd38f5b486c86c719d4cda7bfab20854.png" alt="Mermaid Diagram" width="800" height="220"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Core Metrics
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Faithfulness&lt;/strong&gt; — Does the answer contain hallucinations?&lt;/p&gt;

&lt;p&gt;The metric works like this: an LLM extracts all &lt;em&gt;claims&lt;/em&gt; made in the answer. Then it checks each claim against the retrieved context. If a claim isn't supported by the context, it's hallucinated.&lt;/p&gt;

&lt;p&gt;Score: 0–1, where 1 means "everything is supported by the retrieved context."&lt;/p&gt;

&lt;p&gt;Why it matters: This catches the sneaky case where your LLM generates grammatically perfect nonsense.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer Relevancy&lt;/strong&gt; — Is the answer actually relevant to the user's question?&lt;/p&gt;

&lt;p&gt;Instead of asking a human "is this relevant?", RAGAS generates multiple synthetic questions from the answer using the LLM, then measures how similar those questions are to the original query.&lt;/p&gt;

&lt;p&gt;Score: 0–1, where 1 means "the answer is directly answering what was asked."&lt;/p&gt;

&lt;p&gt;Why it matters: An answer can be faithful to the context &lt;em&gt;and&lt;/em&gt; completely miss what the user wanted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context Precision&lt;/strong&gt; — Are the most useful chunks ranked first?&lt;/p&gt;

&lt;p&gt;When you retrieve 10 documents, is the most relevant one at position 1? Or buried at position 7? This metric measures whether the retriever ranked things in the right order.&lt;/p&gt;

&lt;p&gt;Score: 0–1, where 1 means "every retrieved chunk is relevant."&lt;/p&gt;

&lt;p&gt;Why it matters: If your LLM has to wade through junk to find useful context, it'll get confused or hallucinate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context Recall&lt;/strong&gt; — Did you retrieve everything needed?&lt;/p&gt;

&lt;p&gt;This asks: given the correct answer, how much of the supporting context did the retriever actually find?&lt;/p&gt;

&lt;p&gt;Score: 0–1, where 1 means "you got all the context needed to answer correctly."&lt;/p&gt;

&lt;p&gt;Why it matters: If key context is missing, your LLM can't answer well, no matter how good it is.&lt;/p&gt;

&lt;h3&gt;
  
  
  Putting It Together
&lt;/h3&gt;

&lt;p&gt;You're not checking one metric. You're checking four dimensions of a single evaluation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Faithfulness measures &lt;em&gt;hallucinations&lt;/em&gt; → low faithfulness = your LLM is making things up&lt;/li&gt;
&lt;li&gt;Answer relevancy measures &lt;em&gt;understanding the question&lt;/em&gt; → low relevancy = wrong answer, right format&lt;/li&gt;
&lt;li&gt;Context precision measures &lt;em&gt;retriever ranking&lt;/em&gt; → low precision = retriever is mixing junk with gold&lt;/li&gt;
&lt;li&gt;Context recall measures &lt;em&gt;completeness&lt;/em&gt; → low recall = retriever missed important context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A healthy RAG system has all four scores high. If faithfulness is low, you have a hallucination problem. If recall is low, your retriever is weak.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond RAGAS: LLM-as-Judge
&lt;/h2&gt;

&lt;p&gt;RAGAS is great for RAG systems specifically. But what if your system is more general? What if you're not using retrieval?&lt;/p&gt;

&lt;p&gt;That's where &lt;strong&gt;LLM-as-Judge&lt;/strong&gt; comes in.&lt;/p&gt;

&lt;p&gt;The idea is simple: use a powerful LLM (like GPT-4) to evaluate the outputs of another LLM, prompted to assign scores on dimensions like helpfulness, correctness, faithfulness, or safety.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Judge prompt (simplified):
"You are an expert evaluator. The user asked: [QUERY]
The system responded: [ANSWER]
Rate the response from 1-10 on correctness, helpfulness, and truthfulness.
Explain your reasoning."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No ground truth needed&lt;/li&gt;
&lt;li&gt;Works for any task (not just RAG)&lt;/li&gt;
&lt;li&gt;Can evaluate complex, nuanced quality&lt;/li&gt;
&lt;li&gt;Aligns with human judgment (85%+ agreement with human raters on GPT-4 judgments)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Costs money (you're calling an LLM to evaluate another LLM)&lt;/li&gt;
&lt;li&gt;Can inherit biases from the judge (GPT-4 has position bias, verbosity bias, self-enhancement bias)&lt;/li&gt;
&lt;li&gt;Prompt wording matters &lt;em&gt;a lot&lt;/em&gt;—small changes in phrasing can shift scores by 10-15%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Use Chain-of-Thought prompting with your judge. Ask it to explain its reasoning step-by-step before assigning a score. This improves reliability by 10-15% and gives you a debuggable reasoning trail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hallucination Detection: The Hard Problem
&lt;/h2&gt;

&lt;p&gt;Here's the truth: hallucinations are &lt;em&gt;hard&lt;/em&gt; to detect automatically.&lt;/p&gt;

&lt;p&gt;Your LLM generates a paragraph that sounds completely plausible. It cites no sources (because it made it up). It's grammatically perfect. How do you know it's wrong without checking every fact manually?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fhow-to-evaluate-llm-outputs-beyond-looks-good-to-me%2Fmermaid-905f3af0b7c26ba371447ec17ca99415.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fhow-to-evaluate-llm-outputs-beyond-looks-good-to-me%2Fmermaid-905f3af0b7c26ba371447ec17ca99415.png" alt="Mermaid Diagram" width="800" height="507"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Recent approaches:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-Consistency Methods&lt;/strong&gt; (SelfCheckGPT):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generate the same answer multiple times with different random seeds&lt;/li&gt;
&lt;li&gt;If the answer is consistent across generations, it's probably faithful&lt;/li&gt;
&lt;li&gt;If it varies wildly each time, it's probably hallucinated&lt;/li&gt;
&lt;li&gt;This works because factual claims are stable; hallucinations drift&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Token Probability Methods&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Look at the model's internal confidence scores during generation&lt;/li&gt;
&lt;li&gt;If the model assigns low probability to its own words, something's off&lt;/li&gt;
&lt;li&gt;This doesn't always work—some hallucinations are high-confidence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Supervised Detection&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Train a detector on labeled hallucination data&lt;/li&gt;
&lt;li&gt;Feed it hidden state representations from the LLM&lt;/li&gt;
&lt;li&gt;Let it predict whether a claim is hallucinated&lt;/li&gt;
&lt;li&gt;Works well in-domain; requires new training for new domains&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The Honest Answer&lt;/strong&gt;: There's no silver bullet. You need multiple approaches:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Faithfulness metric to catch unsupported claims&lt;/li&gt;
&lt;li&gt;Self-consistency checks for flagrant hallucinations&lt;/li&gt;
&lt;li&gt;Human spot-checking on high-stakes domains&lt;/li&gt;
&lt;li&gt;Reference-based metrics (comparing output to ground truth) when you have labels&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Common Evaluation Mistakes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Mistake 1: Relying on a Single Metric
&lt;/h3&gt;

&lt;p&gt;"Our faithfulness score is 0.92—we're good!"&lt;/p&gt;

&lt;p&gt;No. Faithfulness only tells you about hallucinations. Your answer could be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Faithful but irrelevant (addresses the wrong question)&lt;/li&gt;
&lt;li&gt;Faithful and relevant but missing half the context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Evaluate all dimensions. If any dimension is weak, you have a problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 2: Gaming the Metrics
&lt;/h3&gt;

&lt;p&gt;You optimize for high RAGAS scores, so you make your retrieved context smaller (fewer chunks = easier for the LLM to be faithful). Now your scores are great, but your answers miss important details.&lt;/p&gt;

&lt;p&gt;Or you use a judge that's biased toward verbose, confident-sounding answers, so your system generates fluff.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The trap&lt;/strong&gt;: High metrics ≠ good product. You still need human evaluation on real user queries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 3: Forgetting About Domain Shift
&lt;/h3&gt;

&lt;p&gt;You evaluate your system on one domain (e.g., Python tutorials) and get great scores. You ship it to production for a different domain (e.g., medical advice). Suddenly users report hallucinations.&lt;/p&gt;

&lt;p&gt;This happens because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your training data was skewed toward one domain&lt;/li&gt;
&lt;li&gt;Your evaluation framework was calibrated on one domain&lt;/li&gt;
&lt;li&gt;The LLM's behavior changes in new domains&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Always evaluate on representative samples from &lt;em&gt;your actual use case&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 4: Ignoring the Prompt
&lt;/h3&gt;

&lt;p&gt;LLM judges are incredibly sensitive to how you phrase the evaluation prompt.&lt;/p&gt;

&lt;p&gt;"Is this answer correct?" gets different results than "Is this answer helpful and accurate?"&lt;br&gt;
"Rate 1-10" gets different results than "Rate excellent/good/fair/poor"&lt;/p&gt;

&lt;p&gt;Test different prompt wordings and see which ones align with your actual needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting It Into Practice
&lt;/h2&gt;

&lt;p&gt;Here's a lightweight evaluation workflow for your RAG system:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Collect 50-100 real test queries&lt;/strong&gt; from your users or domain experts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generate answers&lt;/strong&gt; using your system&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run RAGAS metrics&lt;/strong&gt; on all of them

&lt;ul&gt;
&lt;li&gt;Calculate mean faithfulness, relevancy, precision, recall&lt;/li&gt;
&lt;li&gt;Flag any queries with low scores&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spot-check the flagged queries&lt;/strong&gt; manually

&lt;ul&gt;
&lt;li&gt;Read the answer and context&lt;/li&gt;
&lt;li&gt;Verify if the metric agrees with your judgment&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iterate&lt;/strong&gt; — improve your retriever, prompt, or model based on what you find&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Tools to use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RAGAS&lt;/strong&gt; (open source, free, works with any LLM via API)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DeepEval&lt;/strong&gt; (Python library, supports RAGAS + custom metrics)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Langfuse&lt;/strong&gt; (LLM observability platform with built-in LLM-as-judge)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confident AI&lt;/strong&gt; (commercial, but focuses on evaluation workflows)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Real Win: Debugging
&lt;/h2&gt;

&lt;p&gt;Here's the secret nobody tells you: the real value of metrics isn't the score. It's the &lt;em&gt;debugging information&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;When your faithfulness score is 0.65, you know where to look: the answer contains unsupported claims. Start examining those claims.&lt;/p&gt;

&lt;p&gt;When your context recall is 0.4, you know your retriever is missing stuff. Debug the retriever, not the LLM.&lt;/p&gt;

&lt;p&gt;When answer relevancy is low but everything else is high, you know your prompt is asking the wrong question.&lt;/p&gt;

&lt;p&gt;Metrics are a map. They point you toward the problem. But you still have to solve it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Pick one metric that matters to your system (probably faithfulness for RAG)&lt;/li&gt;
&lt;li&gt;Set a threshold (e.g., "we want 0.8+ on all metrics")&lt;/li&gt;
&lt;li&gt;Evaluate your current system&lt;/li&gt;
&lt;li&gt;When scores are low, debug instead of tweaking prompts blindly&lt;/li&gt;
&lt;li&gt;Repeat&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Start small. Don't build a 500-metric evaluation dashboard on day one. Evaluate the dimensions that matter most to your users, and add more metrics as you grow.&lt;/p&gt;

&lt;p&gt;And yes, you still need humans. Metrics catch patterns and point you toward problems. But someone has to verify that the metrics are actually measuring what you care about.&lt;/p&gt;

&lt;p&gt;Because in the end, "looks good to me" scales to maybe 100 queries before it breaks. Metrics scale to 100,000. And human judgment backed by metrics? That actually works in production.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Your turn: what does your LLM system get wrong most often? Is it hallucinations, missing context, or something else? Metrics can help you find out.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Author: Shibin&lt;/p&gt;

</description>
      <category>learning</category>
      <category>ai</category>
    </item>
    <item>
      <title>How DeepSeek R1 Shocked the World (And Why It Matters to You)</title>
      <dc:creator>Thousand Miles AI</dc:creator>
      <pubDate>Fri, 06 Mar 2026 05:41:43 +0000</pubDate>
      <link>https://dev.to/thousand_miles_ai/how-deepseek-r1-shocked-the-world-and-why-it-matters-to-you-4db2</link>
      <guid>https://dev.to/thousand_miles_ai/how-deepseek-r1-shocked-the-world-and-why-it-matters-to-you-4db2</guid>
      <description>&lt;p&gt;The underdog story that disrupted AI. 671B parameters, $6M budget, MIT license. How a Chinese startup beat the giants.&lt;/p&gt;




&lt;p&gt;January 20, 2025. A Tuesday morning in Hangzhou, China.&lt;/p&gt;

&lt;p&gt;A small AI lab called DeepSeek dropped something that made the entire AI industry go quiet.&lt;/p&gt;

&lt;p&gt;They released DeepSeek-R1: a 671-billion-parameter reasoning model that matched or exceeded OpenAI's o1 on most tasks. On AIME (a difficult math competition for 10th-12th graders), R1 scored 79.8%. On MATH (a dataset of problems from mathematics competitions), it scored 97.4%. These are the kinds of numbers that previously belonged only to expensive, closed models.&lt;/p&gt;

&lt;p&gt;Here's the kicker: they built it in two months for less than $6 million.&lt;/p&gt;

&lt;p&gt;And they open-sourced it under the MIT license.&lt;/p&gt;

&lt;p&gt;For free.&lt;/p&gt;

&lt;p&gt;For commercial use.&lt;/p&gt;

&lt;p&gt;The entire AI industry had assumed that building frontier models required billions of dollars, massive research teams, and years of development. DeepSeek just proved them wrong.&lt;/p&gt;

&lt;p&gt;This is the story of how the underdog broke the game. And why, if you're a student or early-career developer in India in 2026, you should care deeply.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Should You Care?
&lt;/h2&gt;

&lt;p&gt;Let's be real: if you're learning AI or building with language models right now, you probably assume that the best tools cost money and come from Silicon Valley.&lt;/p&gt;

&lt;p&gt;DeepSeek-R1 changes that entire equation.&lt;/p&gt;

&lt;p&gt;Suddenly, the best reasoning model in the world is free. The weights are public. The code is public. The architecture is documented. You can run it locally. You can fine-tune it. You can build on top of it.&lt;/p&gt;

&lt;p&gt;This isn't a small thing. This is a paradigm shift.&lt;/p&gt;

&lt;p&gt;In 2025, if you wanted to use frontier AI models, your options were:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pay OpenAI $20/million tokens for GPT-4o&lt;/li&gt;
&lt;li&gt;Pay Anthropic for Claude&lt;/li&gt;
&lt;li&gt;Use open models that were 1-2 years behind the frontier&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By early 2026, you can:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use DeepSeek-R1 via API for 95% cheaper than GPT-4o&lt;/li&gt;
&lt;li&gt;Download the weights and run it on your own hardware&lt;/li&gt;
&lt;li&gt;Fine-tune it on your own data&lt;/li&gt;
&lt;li&gt;Contribute improvements back to the community&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For students and early-career developers? This is liberation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Narrative: How Did a Chinese Startup Do This?
&lt;/h2&gt;

&lt;p&gt;To understand why DeepSeek's achievement is shocking, you need to understand the before-times.&lt;/p&gt;

&lt;p&gt;The AI scaling narrative went like this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2018-2022:&lt;/strong&gt; "Models need more data and more compute. Those with the most resources win."&lt;/p&gt;

&lt;p&gt;This was true. Google, OpenAI, Meta poured billions into training. The scaling laws held. Bigger compute = better models. The end.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2023-2024:&lt;/strong&gt; The assumption became dogma. "You need a billion-dollar budget to compete."&lt;/p&gt;

&lt;p&gt;OpenAI's Sora. Google's Gemini. Meta's Llama. These required massive computational resources. The era of small labs was over. Only Big Tech could innovate.&lt;/p&gt;

&lt;p&gt;Then DeepSeek whispered: "What if we were... actually optimizing things?"&lt;/p&gt;

&lt;p&gt;They looked at the scaling curves and noticed something. The industry was optimizing for training speed, not training cost. We were throwing compute at problems because we could afford to. What if we optimized for efficiency instead?&lt;/p&gt;

&lt;p&gt;Enter: &lt;strong&gt;Mixture of Experts (MoE)&lt;/strong&gt; architecture.&lt;/p&gt;

&lt;p&gt;Instead of activating all 671 billion parameters for every token, DeepSeek's model only activates 37 billion parameters. The rest sit dormant until needed. Different "experts" handle different types of problems.&lt;/p&gt;

&lt;p&gt;Think of it like having a massive library where you don't need to read every book for every question. You route questions to the expert who knows the most about that topic.&lt;/p&gt;

&lt;p&gt;The result? All the capability of a massive model with 5% of the compute cost.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fhow-deepseek-r1-shocked-the-world-and-why-it-matters-to-you%2Fmermaid-aa70abead915a2c8e7962b15a9de6636.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fhow-deepseek-r1-shocked-the-world-and-why-it-matters-to-you%2Fmermaid-aa70abead915a2c8e7962b15a9de6636.png" alt="Mermaid Diagram" width="800" height="459"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Training Secret: Reinforcement Learning First
&lt;/h2&gt;

&lt;p&gt;But there's another part to the DeepSeek story that's almost as important.&lt;/p&gt;

&lt;p&gt;Most models train in two stages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Supervised Fine-Tuning (SFT):&lt;/strong&gt; Train on high-quality examples. "Here's how a smart human would answer this question."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reinforcement Learning (RL):&lt;/strong&gt; Improve through reward signals. "Did the model do better?"&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The industry sequence: SFT first (to establish baselines), then RL (to polish).&lt;/p&gt;

&lt;p&gt;DeepSeek said: "What if we do massive RL without SFT?"&lt;/p&gt;

&lt;p&gt;They called it &lt;strong&gt;DeepSeek-R1-Zero&lt;/strong&gt;. Train the model primarily through reinforcement learning, without supervised examples telling it the "right" way to reason.&lt;/p&gt;

&lt;p&gt;The model had to figure out reasoning from first principles.&lt;/p&gt;

&lt;p&gt;And it &lt;em&gt;worked&lt;/em&gt;. Remarkably well.&lt;/p&gt;

&lt;p&gt;Here's why this matters: It suggests that reasoning isn't taught—it's incentivized. Give a model the right reward signal and a search space to explore, and it will discover how to reason. You don't need humans showing it examples of perfect reasoning.&lt;/p&gt;

&lt;p&gt;This is a fundamental insight. It changes how we think about training future models.&lt;/p&gt;

&lt;p&gt;And a small lab in Hangzhou figured it out before the multi-billion-dollar labs in San Francisco and Mountain View.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Shock to the System
&lt;/h2&gt;

&lt;p&gt;When DeepSeek released R1, the AI industry had several very uncomfortable realizations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Realization 1: The scaling compute wasn't about fundamentals, it was about shortcuts.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Google, OpenAI, Meta could afford to throw compute at problems. They did. It worked because throwing compute at problems &lt;em&gt;does&lt;/em&gt; work. But it's not the only way. DeepSeek showed that intelligence-per-dollar isn't a function of billions spent. It's a function of creativity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Realization 2: Open source was still winning, just slowly.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The largest open-source model (Meta's Llama 3) was pretty far behind GPT-4o. By releasing DeepSeek-R1 under MIT, the open-source community instantly jumped ahead of Llama. Now researchers everywhere could work with frontier-quality models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Realization 3: Geography doesn't matter anymore.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In 2020, the narrative was: "Silicon Valley and Big Tech have the network, the talent, the funding. You can't compete from outside." DeepSeek disproved this. A lab in China, building autonomously, shipped something that matched or exceeded everything from Silicon Valley.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Realization 4: Cheap inference changes everything.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When your model costs 95% less to run, you can afford to run it in places that previously couldn't use frontier AI. You can fine-tune it cheaply. You can experiment. The barrier to entry drops from millions to thousands.&lt;/p&gt;

&lt;h2&gt;
  
  
  What DeepSeek R1 Actually Does
&lt;/h2&gt;

&lt;p&gt;Let's talk about the actual capability. What makes R1 special?&lt;/p&gt;

&lt;p&gt;Unlike most language models, R1 is a "reasoning model." When you ask it a hard question, it doesn't just stream an answer. It thinks out loud.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User: Solve this differential equation: dy/dx = 2x + 1

DeepSeek R1: [THINKING]
I need to solve dy/dx = 2x + 1
This is a simple differential equation.
I can integrate both sides...

[RESPONSE]
To solve dy/dx = 2x + 1:
Integrate both sides with respect to x:
y = ∫(2x + 1)dx = x² + x + C
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It shows its work. It reasons through problems step-by-step. For mathematical problems, coding problems, and complex logic, this is dramatically more accurate than models that try to stream answers directly.&lt;/p&gt;

&lt;p&gt;On benchmarks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AIME (competition math):&lt;/strong&gt; 79.8% (o1 was 92%, R1 is closer than expected)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MATH (math competition problems):&lt;/strong&gt; 97.4% (frontier performance)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MMLU (broad knowledge):&lt;/strong&gt; 90.8% (GPT-4o performance)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coding:&lt;/strong&gt; Strong performance on competitive programming&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But here's the real revelation: R1 is open-source. Public weights. MIT license. Run it anywhere.&lt;/p&gt;

&lt;p&gt;Meanwhile, OpenAI's o1? Closed. $20/million tokens via API. You can't see how it works. You can't run it yourself.&lt;/p&gt;

&lt;p&gt;For students and researchers, this is revolutionary.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Ripple Effects in 2026
&lt;/h2&gt;

&lt;p&gt;By February 2026 (just a month later), the ecosystem had already transformed:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wave 1: Distillation Models&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Teams immediately started distilling R1 into smaller models. "If we can capture 80% of R1's reasoning ability in a 7B or 13B model, we can run it on laptops."&lt;/p&gt;

&lt;p&gt;By month two, we had open-source reasoning models at various sizes. Not as good as R1, but genuinely useful. All free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wave 2: Rapid Iteration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The entire open-source community moved faster. Researchers published papers on improving reasoning. Teams fine-tuned R1 for specific domains (medical reasoning, code generation, creative writing). Without waiting for OpenAI to release the next model, people were already building on top of R1.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wave 3: Cost Collapse&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;DeepSeek's API (95% cheaper than OpenAI) forced the entire industry to reconsider pricing. By January 2026, everyone was dropping prices. What cost $20 per million tokens now costs $0.50.&lt;/p&gt;

&lt;p&gt;For startups? For students? For countries where AI was previously inaccessible? Suddenly affordable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wave 4: Chinese AI Ecosystem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The narrative shifted. China wasn't just copying Western AI. They were innovating faster, cheaper, better. Investment flowed into Chinese AI labs. The ecosystem that produced DeepSeek started producing other frontier models.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters If You're Starting Your Career
&lt;/h2&gt;

&lt;p&gt;Here's the real-world impact for you:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. You Can Afford Frontier AI
&lt;/h3&gt;

&lt;p&gt;If you're building a startup or learning AI, you can now use the best reasoning models for a fraction of what it cost six months ago. Your laptop-based experiments aren't competing on a budget—they're competitive with well-funded labs.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. You Can Run Models Locally
&lt;/h3&gt;

&lt;p&gt;You don't need cloud access. You don't need to depend on API availability. Download DeepSeek-R1 and run it on your hardware. This is freedom. This is sovereignty over your tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Open Source Is Winning Again
&lt;/h3&gt;

&lt;p&gt;For the last few years, the frontier was in closed models. Suddenly, open-source is at the frontier. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can read the code&lt;/li&gt;
&lt;li&gt;You can contribute improvements&lt;/li&gt;
&lt;li&gt;You can build on top of it without worrying about API deprecations&lt;/li&gt;
&lt;li&gt;You can fine-tune it for your use case&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Geography Doesn't Matter
&lt;/h3&gt;

&lt;p&gt;DeepSeek proved you don't need to be in Silicon Valley to build world-class AI. You can be in Bangalore. You can be in Manila. You can be in Lagos. With internet and talent, you can compete at the frontier.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Thinking About Fundamentals Pays Off
&lt;/h3&gt;

&lt;p&gt;DeepSeek's success wasn't about throwing money at the problem. It was about creative thinking. Mixture of Experts. Different training approaches. Understanding the math deeply.&lt;/p&gt;

&lt;p&gt;For you starting your career: the advantage isn't money. It's understanding. Deep knowledge of fundamentals beats budget every time. DeepSeek proved it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fhow-deepseek-r1-shocked-the-world-and-why-it-matters-to-you%2Fmermaid-559628432627824f9f4f9ded4dbf0ef8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fhow-deepseek-r1-shocked-the-world-and-why-it-matters-to-you%2Fmermaid-559628432627824f9f4f9ded4dbf0ef8.png" alt="Mermaid Diagram" width="800" height="153"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Plot Twist: What Happens Next?
&lt;/h2&gt;

&lt;p&gt;This is February 2026. DeepSeek R1 was released one month ago.&lt;/p&gt;

&lt;p&gt;The question now is: does this become the standard? Does the AI industry permanently shift toward efficiency and open-source? Or was this a one-time breakthrough?&lt;/p&gt;

&lt;p&gt;My bet? DeepSeek made something irreversible happen.&lt;/p&gt;

&lt;p&gt;Once you've proven that you can build frontier models for 1/100th the cost, you can't un-prove it. Every lab will now try to optimize for efficiency. Every competitor will need to match those costs.&lt;/p&gt;

&lt;p&gt;The golden age of billion-dollar training runs might be over.&lt;/p&gt;

&lt;p&gt;But here's the thing that worries the incumbents: DeepSeek also proved that raw spending doesn't guarantee dominance. Strategy, creativity, and understanding fundamentals matter more.&lt;/p&gt;

&lt;p&gt;For the open-source community? For students? For people outside Silicon Valley?&lt;/p&gt;

&lt;p&gt;This changes everything.&lt;/p&gt;

&lt;p&gt;You're not competing against a wall of money anymore. You're competing against intelligence, creativity, and determination.&lt;/p&gt;

&lt;p&gt;And those are things you can build.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sign-Off: The Underdog Story Isn't Over
&lt;/h2&gt;

&lt;p&gt;DeepSeek's story is an underdog narrative in real-time. A small lab. A focused mission. A creative solution. And they won.&lt;/p&gt;

&lt;p&gt;Not by playing the game the incumbents set up. But by changing the rules entirely.&lt;/p&gt;

&lt;p&gt;If you're starting your career in 2026, take note. The lesson isn't "use DeepSeek." The lesson is: "the people who win aren't the ones with the most resources. They're the ones who think differently."&lt;/p&gt;

&lt;p&gt;DeepSeek thought about mixture of experts when everyone else thought about scale. They trained with RL-first when everyone else did SFT-first. They open-sourced when everyone else locked things down.&lt;/p&gt;

&lt;p&gt;Different thinking wins.&lt;/p&gt;

&lt;p&gt;Use DeepSeek-R1 if it serves your project. Build on top of open-source reasoning models. Fine-tune them for your domain. Contribute improvements back to the community.&lt;/p&gt;

&lt;p&gt;But more importantly: think about where the incumbents are wrong. Where are they assuming things that aren't true? Where could you approach the problem differently?&lt;/p&gt;

&lt;p&gt;That's where the next decade of innovation lives.&lt;/p&gt;

&lt;p&gt;And it might come from someone sitting in their apartment in Bangalore, not from some well-funded lab in San Francisco.&lt;/p&gt;

&lt;p&gt;The age of gatekeeping AI is over.&lt;/p&gt;

&lt;p&gt;Now it's just about being creative.&lt;/p&gt;




&lt;h2&gt;
  
  
  References &amp;amp; Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/deepseek-ai/DeepSeek-R1" rel="noopener noreferrer"&gt;DeepSeek-R1 GitHub Repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-R1" rel="noopener noreferrer"&gt;DeepSeek-R1 on Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.clarifai.com/blog/top-10-open-source-reasoning-models-in-2026" rel="noopener noreferrer"&gt;Top 10 Open-source Reasoning Models in 2026&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://api-docs.deepseek.com/news/news250120" rel="noopener noreferrer"&gt;DeepSeek-R1 Release - API Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://fireworks.ai/blog/deepseek-r1-deepdive" rel="noopener noreferrer"&gt;DeepSeek-R1 Deep Dive - Architecture &amp;amp; Capabilities&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.capmad.com/technology-en/deepseek-r1-one-year-later-china-dominates-open-source-ai-in-2026/" rel="noopener noreferrer"&gt;DeepSeek-R1 One Year Later: China Dominates Open Source AI in 2026&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/blog/open-r1" rel="noopener noreferrer"&gt;Open-R1: A Fully Open Reproduction of DeepSeek-R1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Author: Shibin&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>technology</category>
    </item>
    <item>
      <title>Context Engineering Is the New Prompt Engineering</title>
      <dc:creator>Thousand Miles AI</dc:creator>
      <pubDate>Fri, 06 Mar 2026 05:36:36 +0000</pubDate>
      <link>https://dev.to/thousand_miles_ai/context-engineering-is-the-new-prompt-engineering-5akp</link>
      <guid>https://dev.to/thousand_miles_ai/context-engineering-is-the-new-prompt-engineering-5akp</guid>
      <description>&lt;p&gt;How CLAUDE.md files and structured context are transforming AI coding. One file to rule them all.&lt;/p&gt;




&lt;p&gt;Last year, prompt engineering was the hot skill. Craft the perfect prompt. Use the right magic words. Add "step by step." Get better results.&lt;/p&gt;

&lt;p&gt;That era is over.&lt;/p&gt;

&lt;p&gt;In 2026, the teams getting dramatically better results from AI aren't tinkering with prompts. They're architecting context. They're creating &lt;code&gt;.claude/CLAUDE.md&lt;/code&gt; files that sit at the root of their projects. They're encoding their entire system architecture, coding standards, and project rules into a single structured document that the AI reads before every response.&lt;/p&gt;

&lt;p&gt;It's not flashier than prompt engineering. It's just more effective.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Should You Care?
&lt;/h2&gt;

&lt;p&gt;Picture this: You're working with Claude on a feature. You give it a task. It generates code that's syntactically perfect but doesn't match your project's patterns. It uses &lt;code&gt;styled-components&lt;/code&gt; when your codebase uses Tailwind. It puts business logic in components when you have a strict repository pattern. It ignores your testing standards.&lt;/p&gt;

&lt;p&gt;So you say: "We use Tailwind, not styled-components."&lt;/p&gt;

&lt;p&gt;Claude fixes it.&lt;/p&gt;

&lt;p&gt;You say: "Put business logic in repositories, not components."&lt;/p&gt;

&lt;p&gt;Claude adjusts.&lt;/p&gt;

&lt;p&gt;You say: "Add unit tests using Vitest."&lt;/p&gt;

&lt;p&gt;Claude adds them.&lt;/p&gt;

&lt;p&gt;You've just engaged in five rounds of back-and-forth context negotiation. You've burned tokens. You've wasted time. And you're frustrated.&lt;/p&gt;

&lt;p&gt;Now imagine a different scenario. The project has a &lt;code&gt;CLAUDE.md&lt;/code&gt; file. It contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your exact tech stack&lt;/li&gt;
&lt;li&gt;Your styling approach and typography system&lt;/li&gt;
&lt;li&gt;Your folder structure and where to put things&lt;/li&gt;
&lt;li&gt;Your testing framework and where tests live&lt;/li&gt;
&lt;li&gt;Your database access patterns&lt;/li&gt;
&lt;li&gt;Your component rules&lt;/li&gt;
&lt;li&gt;Your exact conventions for naming, formatting, and architecture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Claude reads this file once at the start. Now when you ask for a feature, it generates code that fits your project perfectly. First try.&lt;/p&gt;

&lt;p&gt;Recent studies show that structured context engineering reduces hallucinations by 60% and improves task completion accuracy by 45% compared to prompt engineering alone. This isn't magic wording. This is information architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Context Engineering Manifesto
&lt;/h2&gt;

&lt;p&gt;Context engineering is broader than prompt engineering. It's not just about what you say—it's about what the AI &lt;em&gt;knows&lt;/em&gt; before you say anything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt Engineering:&lt;/strong&gt; "Write a function that validates email addresses"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context Engineering:&lt;/strong&gt; The AI reads your CLAUDE.md, which says:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You use TypeScript&lt;/li&gt;
&lt;li&gt;All validators go in &lt;code&gt;lib/validators/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Use Zod for validation&lt;/li&gt;
&lt;li&gt;Tests go in &lt;code&gt;tests/unit/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Follow existing validator patterns in the codebase&lt;/li&gt;
&lt;li&gt;Export both &lt;code&gt;validate*&lt;/code&gt; function and type &lt;code&gt;*Error&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now when you ask: "Write an email validator," the AI generates code that already fits perfectly into your system.&lt;/p&gt;

&lt;p&gt;Think of it as the difference between giving someone directions ("turn left at the big oak tree") versus giving them a GPS with your entire city mapped out.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5uuock4s1209tcu1wm7m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5uuock4s1209tcu1wm7m.png" alt="Mermaid Diagram" width="800" height="1033"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Anatomy of a Good CLAUDE.md File
&lt;/h2&gt;

&lt;p&gt;Let me break down what goes into an effective context file. Here's the structure:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Project Overview
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Project Name&lt;/span&gt;

One sentence what this is. Tech stack. Key constraints.
&lt;span class="gt"&gt;
&amp;gt; Next.js + Tailwind CSS + Supabase + Drizzle ORM&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why? The AI needs to know the boundaries of the project before it starts working.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Mandatory Completion Checklist
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## MANDATORY: Completion Checklist&lt;/span&gt;

DO NOT declare any task complete until:
&lt;span class="p"&gt;1.&lt;/span&gt; Write tests — Unit/Integration/E2E as applicable
&lt;span class="p"&gt;2.&lt;/span&gt; Run &lt;span class="sb"&gt;`npm test`&lt;/span&gt; — All tests pass
&lt;span class="p"&gt;3.&lt;/span&gt; Run &lt;span class="sb"&gt;`npm run build`&lt;/span&gt; — Build succeeds
&lt;span class="p"&gt;4.&lt;/span&gt; Verify — If any fail, fix it before summarizing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why? This forces the AI (and you) to have quality gates. Tasks don't end until they're actually done.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Rules at a Glance
&lt;/h3&gt;

&lt;p&gt;Quick reference table. Styling approach. Where components go. How to structure files. What frameworks to use for what.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;| Situation | Action |
| Page title | &lt;span class="sb"&gt;`&amp;lt;h1 className="text-heading-32"&amp;gt;`&lt;/span&gt; |
| Card styling | Direct Tailwind, no component |
| Database queries | Drizzle repositories in &lt;span class="sb"&gt;`src/lib/repositories/drizzle/`&lt;/span&gt; |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why? The AI can scan this table before generating code and match every pattern.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Styling System
&lt;/h3&gt;

&lt;p&gt;Exactly how typography works in your project. What classes exist. When to use what.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Purpose-Based Typography&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="sb"&gt;`text-heading-32`&lt;/span&gt;: Page titles, 2rem, 600 weight
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`text-heading-24`&lt;/span&gt;: Section headers, 1.5rem, 600 weight
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`text-copy-16`&lt;/span&gt;: Body text, 1rem, 400 weight, relaxed leading
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`text-label-14`&lt;/span&gt;: UI labels, 0.875rem, 500 weight
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why? Otherwise AI generates random class names or uses hardcoded pixel values instead of your system.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Component Rules
&lt;/h3&gt;

&lt;p&gt;When to create components. What goes in components. What doesn't.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## DO NOT Create Components For&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Styling only — use Tailwind directly
&lt;span class="p"&gt;-&lt;/span&gt; Single-use layouts — inline the classes
&lt;span class="p"&gt;-&lt;/span&gt; Wrapper components

&lt;span class="gu"&gt;## DO Create Components When&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Behavior/State exists (toggle, form, async actions)
&lt;span class="p"&gt;-&lt;/span&gt; Used 3+ times identically
&lt;span class="p"&gt;-&lt;/span&gt; Requires accessibility logic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why? Without this, you get components for everything. Bloated folders. Prop chaos.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Database Access Patterns
&lt;/h3&gt;

&lt;p&gt;Exactly how to access data. Where repositories live. How to structure queries.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Database — Use Drizzle Repositories&lt;/span&gt;

ALL database queries go in &lt;span class="sb"&gt;`src/lib/repositories/drizzle/`&lt;/span&gt;.
Never use Supabase client directly for tables.
ONLY use Supabase client for &lt;span class="sb"&gt;`supabase.auth.*`&lt;/span&gt; and &lt;span class="sb"&gt;`supabase.storage.*`&lt;/span&gt;.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why? Database patterns are architectural decisions, not suggestions. The AI needs to know the non-negotiables.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Testing Structure
&lt;/h3&gt;

&lt;p&gt;Where tests go. What frameworks. What counts as tested.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Testing — Mandatory&lt;/span&gt;

| Change Type | Unit | Integration | E2E |
| New utility | REQUIRED | — | — |
| New API endpoint | REQUIRED | REQUIRED | If user-facing |
| New feature | REQUIRED | If API | REQUIRED |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why? Different changes need different testing levels. Without this, the AI either over-tests or under-tests.&lt;/p&gt;

&lt;h3&gt;
  
  
  8. Project Structure
&lt;/h3&gt;

&lt;p&gt;Where things go. Folder conventions. What each folder contains.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Source Structure&lt;/span&gt;

src/
├── components/
│   ├── ui/          # shadcn/ui (behavior-focused)
│   └── constructs/  # Reusable patterns
├── app/
│   └── (protected)/
├── lib/
│   ├── repositories/  # ALL database queries
│   └── utils/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why? The AI needs to know exactly where to put new code. Without this, files go everywhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context Engineering vs. Prompt Engineering: A Real Example
&lt;/h2&gt;

&lt;p&gt;Let's see the difference in action.&lt;/p&gt;

&lt;h3&gt;
  
  
  Without Context Engineering
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;You:&lt;/strong&gt; "Create a course enrollment component"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude generates:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;styled&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;styled-components&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;CourseCard&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;styled&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;div&lt;/span&gt;&lt;span class="s2"&gt;`
  padding: 24px;
  border-radius: 8px;
  background: #f5f5f5;
  &amp;amp;:hover { background: #e0e0e0; }
`&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;EnrollButton&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;button&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;Enroll Now&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;button&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Problems:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uses styled-components (wrong—you use Tailwind)&lt;/li&gt;
&lt;li&gt;No TypeScript types&lt;/li&gt;
&lt;li&gt;Component in wrong location&lt;/li&gt;
&lt;li&gt;No test file&lt;/li&gt;
&lt;li&gt;Hardcoded colors instead of semantic values&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;You:&lt;/strong&gt; "We use Tailwind, not styled-components. Put it in &lt;code&gt;components/constructs/&lt;/code&gt;. Add types."&lt;/p&gt;

&lt;p&gt;Back-and-forth negotiation begins.&lt;/p&gt;

&lt;h3&gt;
  
  
  With Context Engineering
&lt;/h3&gt;

&lt;p&gt;Your CLAUDE.md says:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="p"&gt;-&lt;/span&gt; Styling: Use Tailwind directly. No CSS-in-JS.
&lt;span class="p"&gt;-&lt;/span&gt; Typography: Use purpose-based classes (text-heading-&lt;span class="ge"&gt;*, text-copy-*&lt;/span&gt;)
&lt;span class="p"&gt;-&lt;/span&gt; Components: Only for behavior. Styling = Tailwind directly.
&lt;span class="p"&gt;-&lt;/span&gt; Constructs: Reusable patterns go in &lt;span class="sb"&gt;`components/constructs/`&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Testing: Unit tests in &lt;span class="sb"&gt;`tests/unit/`&lt;/span&gt;, E2E in &lt;span class="sb"&gt;`tests/e2e/`&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Naming: Kebab-case filenames
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;You:&lt;/strong&gt; "Create a course enrollment component"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude generates:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="c1"&gt;// components/constructs/course-enrollment-button.tsx&lt;/span&gt;
&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;CourseEnrollmentButtonProps&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;courseId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;
  &lt;span class="nx"&gt;onEnroll&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;CourseEnrollmentButton&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="nx"&gt;courseId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;onEnroll&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}:&lt;/span&gt; &lt;span class="nx"&gt;CourseEnrollmentButtonProps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;pending&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;setPending&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;useState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="c1"&gt;// ... implementation using Tailwind&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plus a test file. Plus proper types. First try.&lt;/p&gt;

&lt;p&gt;The only difference? The AI read your CLAUDE.md.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Your Own Context Engineering System
&lt;/h2&gt;

&lt;p&gt;If you're running an AI-heavy project, here's how to start:&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Audit Your Patterns
&lt;/h3&gt;

&lt;p&gt;Spend a day documenting how your project actually works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How do you style things?&lt;/li&gt;
&lt;li&gt;Where do files go?&lt;/li&gt;
&lt;li&gt;What naming conventions do you use?&lt;/li&gt;
&lt;li&gt;What's your testing approach?&lt;/li&gt;
&lt;li&gt;What frameworks are non-negotiable?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 2: Create &lt;code&gt;.claude/CLAUDE.md&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Write this down. Be specific. Use examples. Include the annoying edge cases that the AI will probably get wrong.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/project-root/
├── .claude/
│   └── CLAUDE.md
├── src/
├── tests/
└── ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Reference It in Your Prompts
&lt;/h3&gt;

&lt;p&gt;Just mention it: "Before you start, check the CLAUDE.md file in &lt;code&gt;.claude/&lt;/code&gt; for project rules and conventions."&lt;/p&gt;

&lt;p&gt;Or let the AI infer it. Many modern AI systems automatically scan for context files.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Iterate
&lt;/h3&gt;

&lt;p&gt;The first version won't be perfect. When the AI does something wrong, add a rule to CLAUDE.md. When you notice a pattern, document it. Over time, your context file becomes more powerful.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fcontext-engineering-is-the-new-prompt-engineering%2Fmermaid-8b9b612077df6a7720d48883b4a3a03e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fcontext-engineering-is-the-new-prompt-engineering%2Fmermaid-8b9b612077df6a7720d48883b4a3a03e.png" alt="Mermaid Diagram" width="800" height="97"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Teams Winning With Context Engineering
&lt;/h2&gt;

&lt;p&gt;Companies and open-source projects using context engineering in 2026 are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Shipping faster&lt;/strong&gt; — First-try accuracy means fewer iterations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintaining consistency&lt;/strong&gt; — The CLAUDE.md becomes a source of truth&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Onboarding new team members&lt;/strong&gt; — Read the CLAUDE.md, understand the project&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Training junior developers&lt;/strong&gt; — Your coding standards are documented, not tribal knowledge&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Some teams even use context files for other purposes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Onboarding documentation&lt;/strong&gt; — New developers read CLAUDE.md to understand the project&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code review guidelines&lt;/strong&gt; — The CLAUDE.md &lt;em&gt;is&lt;/em&gt; the standard&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI training data&lt;/strong&gt; — Structure your project around the context file, get better AI results&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Sign-Off: Prompt Engineering Was Just the First Stage
&lt;/h2&gt;

&lt;p&gt;Here's what's happened: Prompt engineering was the first wave. "How do I get the AI to understand what I want?"&lt;/p&gt;

&lt;p&gt;Context engineering is the second wave. "How do I structure my entire project so the AI automatically understands?"&lt;/p&gt;

&lt;p&gt;This isn't about being clever with words. It's about being systematic with information. You're not asking the AI to read your mind. You're handing it a map to your entire codebase's rules.&lt;/p&gt;

&lt;p&gt;In 2026, the competitive advantage isn't who can write the best prompts. It's who can structure the clearest context.&lt;/p&gt;

&lt;p&gt;Start small. Pick one project. Audit your patterns. Write a CLAUDE.md file. Use it for a week. See how many rounds of back-and-forth disappear.&lt;/p&gt;

&lt;p&gt;That's context engineering. And it's the present, not the future.&lt;/p&gt;




&lt;h2&gt;
  
  
  References &amp;amp; Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.promptingguide.ai/guides/context-engineering-guide" rel="noopener noreferrer"&gt;Context Engineering Guide - Prompt Engineering Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sombrainc.com/blog/ai-context-engineering-guide" rel="noopener noreferrer"&gt;AI Context Engineering in 2026: Beyond Prompt Engineering&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://towardsdatascience.com/beyond-prompting-the-power-of-context-engineering/" rel="noopener noreferrer"&gt;The Power of Context Engineering - Towards Data Science&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.analyticsvidhya.com/blog/2026/01/master-prompt-engineering/" rel="noopener noreferrer"&gt;Prompt Engineering Guide 2026 - Analytics Vidhya&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Author: Shibin&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Running LLMs on Your Laptop Without a $10K GPU</title>
      <dc:creator>Thousand Miles AI</dc:creator>
      <pubDate>Fri, 06 Mar 2026 05:32:34 +0000</pubDate>
      <link>https://dev.to/thousand_miles_ai/running-llms-on-your-laptop-without-a-10k-gpu-3a30</link>
      <guid>https://dev.to/thousand_miles_ai/running-llms-on-your-laptop-without-a-10k-gpu-3a30</guid>
      <description>&lt;p&gt;Practical guide to running production-ready LLMs locally using Ollama, llama.cpp, and quantization. No GPU cluster required.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup: Your Laptop as an AI Powerhouse
&lt;/h2&gt;

&lt;p&gt;You're sitting in your college hostel. Your friend won't stop talking about how they're building this incredible LLM app, but they're stuck because cloud API costs are bleeding them dry. &lt;code&gt;$0.002 per token&lt;/code&gt; adds up fast when you're iterating, testing, and frankly, making mistakes.&lt;/p&gt;

&lt;p&gt;Then you mention: "I just spun up a 7B model on my MacBook."&lt;/p&gt;

&lt;p&gt;Their face. Worth it.&lt;/p&gt;

&lt;p&gt;Here's the reality of 2026: &lt;strong&gt;you don't need the internet, you don't need Anthropic's API, and you definitely don't need a $10K GPU cluster.&lt;/strong&gt; You can run production-quality LLMs on your laptop right now, offline, for free.&lt;/p&gt;

&lt;p&gt;This isn't a hobby anymore. It's practical.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Should You Care?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Cost.&lt;/strong&gt; Seriously. If you're building a startup, running 100 inference requests against Claude costs you real money. Running the same requests locally costs you electricity and disk space.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Privacy.&lt;/strong&gt; Everything stays on your machine. Your prompts aren't going to some company's servers. Your data doesn't train their next model. That matters if you're working with sensitive information—medical data, financial models, client projects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speed.&lt;/strong&gt; Local inference is &lt;em&gt;fast&lt;/em&gt;. No network latency. No queuing. No rate limits. You can iterate at the speed of thought.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Offline capability.&lt;/strong&gt; Working on a plane? In a rural area with spotty internet? Your LLM doesn't care. It works anywhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Learning.&lt;/strong&gt; Want to understand how LLMs actually work? Running them locally forces you to think about quantization, memory management, token limits—the details that cloud APIs hide from you.&lt;/p&gt;

&lt;p&gt;So here's what we're going to do: you're going to learn how to run real LLMs on your actual laptop, understand what's happening under the hood, and know exactly when to reach for cloud APIs and when to keep it local.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 1: The Technology Stack
&lt;/h2&gt;

&lt;h3&gt;
  
  
  llama.cpp: The Secret Sauce
&lt;/h3&gt;

&lt;p&gt;At the heart of everything is &lt;strong&gt;llama.cpp&lt;/strong&gt;, a pure C/C++ implementation of LLM inference created by Georgi Gerganov. No dependencies. Just raw efficiency.&lt;/p&gt;

&lt;p&gt;What makes it special: it's optimized for consumer hardware. ARM NEON on iOS. Metal on Apple Silicon. AVX/AVX2/AVX512 on x86. Your CPU isn't just supported—it's &lt;em&gt;screaming&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Think of llama.cpp as the production runtime for LLMs. It's fast, it's memory-efficient, and it's the foundation of everything in this post.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ollama: The User-Friendly Frontend
&lt;/h3&gt;

&lt;p&gt;Ollama is built on llama.cpp but adds a layer of "just use it." You install Ollama, you run a command, and boom—you've got an LLM server running locally.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run mistral
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Ollama handles downloading the model, quantization, all of it. You get a chat interface, you get a local API endpoint at &lt;code&gt;localhost:11434&lt;/code&gt;, and you can build on top of it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F95gzb17hd54i01yuy6fo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F95gzb17hd54i01yuy6fo.png" alt="Mermaid Diagram" width="800" height="1013"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  GGUF: The Model Format
&lt;/h3&gt;

&lt;p&gt;GGUF stands for "General GPU Format for Universal Models" (though "GPU" is misleading—it works great on CPU too). It's the standard format for quantized LLMs in 2026.&lt;/p&gt;

&lt;p&gt;Why GGUF? It's optimized for loading and inference. It stores metadata efficiently. It supports advanced quantization techniques. And crucially: &lt;strong&gt;45,000+ quantized models exist on Hugging Face Hub right now&lt;/strong&gt;, ready to download and use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 2: Understanding Quantization
&lt;/h2&gt;

&lt;p&gt;Here's where most people's eyes glaze over. Let's fix that.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's Quantization Actually Doing?
&lt;/h3&gt;

&lt;p&gt;An LLM like Mistral 7B is stored in &lt;strong&gt;FP16&lt;/strong&gt; (16-bit floating point) by default. That means every number in the model takes 16 bits of memory. With 7 billion parameters, that's roughly &lt;strong&gt;14GB&lt;/strong&gt; of VRAM.&lt;/p&gt;

&lt;p&gt;Quantization is &lt;em&gt;approximation with a purpose&lt;/em&gt;. Instead of storing every number precisely, you store it with less precision. A &lt;strong&gt;4-bit&lt;/strong&gt; version of the same model takes ~3.5GB. A &lt;strong&gt;3-bit&lt;/strong&gt; version takes ~2.6GB.&lt;/p&gt;

&lt;p&gt;"But won't it be worse?" you ask.&lt;/p&gt;

&lt;p&gt;Barely. Here's why: neural networks are &lt;em&gt;robust&lt;/em&gt;. Small precision loss doesn't matter much. You lose maybe 5-10% of quality when going from FP16 to Q4_K_M (4-bit). The models are trained on data with noise, so they're used to imperfect inputs.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Quantization Landscape
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Frunning-llms-on-your-laptop-without-a-10k-gpu%2Fmermaid-f28f267f3a240d204928c2c5ebf35ce7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Frunning-llms-on-your-laptop-without-a-10k-gpu%2Fmermaid-f28f267f3a240d204928c2c5ebf35ce7.png" alt="Mermaid Diagram" width="800" height="178"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q4_K_M is the sweet spot&lt;/strong&gt; for most people. It's the Goldilocks quantization—balances quality and size perfectly. Your Mistral 7B becomes ~3.5GB, runs on basically any laptop, and you lose almost nothing in quality.&lt;/p&gt;

&lt;p&gt;Need faster inference? Drop to Q3_K_M. Need absolute best quality? Use Q6_K or Q8. But start with Q4_K_M. It's magic.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Models End Up Quantized
&lt;/h3&gt;

&lt;p&gt;When you download a model from Hugging Face marked as "GGUF Q4_K_M", someone has already quantized it. Usually it's the community. You download, you use immediately. No extra work.&lt;/p&gt;

&lt;p&gt;If you want to quantize your own model (because you've fine-tuned it, or you found a cool model in FP16), llama.cpp has a &lt;code&gt;quantize&lt;/code&gt; tool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./llama-quantize model.gguf model-q4.gguf Q4_K_M
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Takes a few minutes. You now have a quantized version. Use it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 3: Running Your First Local LLM
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Installation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;macOS:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;ollama
ollama serve
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Linux:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.ai/install.sh | sh
ollama serve
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Windows:&lt;/strong&gt;&lt;br&gt;
Download the installer from ollama.ai. It handles CUDA/ROCm automatically if you have an NVIDIA GPU.&lt;/p&gt;
&lt;h3&gt;
  
  
  Running a Model
&lt;/h3&gt;

&lt;p&gt;Open a new terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run mistral
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ollama downloads the model (~4GB for Q4_K_M Mistral 7B). First run takes a minute. Then you get a prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; What's quantization?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Type your question. Get your answer. Offline. Instant. No API key.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using It Programmatically
&lt;/h3&gt;

&lt;p&gt;Ollama starts an API server at &lt;code&gt;http://localhost:11434&lt;/code&gt;. You can hit it like any LLM API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:11434/api/generate &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "model": "mistral",
  "prompt": "explain quantization in one sentence",
  "stream": false
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or from Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434/api/generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mistral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is quantization?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. You now have a local LLM API. Build on top of it. Use it in your Next.js app. Whatever.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 4: What Models Should You Actually Use?
&lt;/h2&gt;

&lt;p&gt;We've tested models on MacBook Pro M3, RTX 4090, and Raspberry Pi. Here's what works:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For Fast Responses (3-5 sec per 100 tokens):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TinyLlama 1.1B&lt;/strong&gt; — Surprisingly capable. Fast. Good for classification tasks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phi-4-mini 3.8B&lt;/strong&gt; — Microsoft's breakthrough. GPT-3.5-class reasoning from 3.8B parameters. Insane.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mistral 7B&lt;/strong&gt; — Still the king. Balanced. Good at everything. 7 seconds per 100 tokens on M3.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;For Quality (10-15 sec per 100 tokens):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mistral Large 34B&lt;/strong&gt; — If your laptop can handle it (32GB+ RAM). Genuinely good.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLaMA 3.1 13B&lt;/strong&gt; — Rock solid. Open weights. Well-supported.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Qwen 2.5 14B&lt;/strong&gt; — Chinese-trained. Excellent multilingual support. Great reasoning.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;For Code Generation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mistral Coder 7B-nemo&lt;/strong&gt; — Purpose-built for coding. Better than Mistral for programming tasks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DeepSeek Coder 6.7B&lt;/strong&gt; — Fast. Surprisingly good at complex code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can run any of these right now, today, on your laptop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 5: Common Mistakes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Mistake 1: Not understanding memory consumption&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The model file size isn't your memory usage. A 7B model in Q4_K_M is ~3.5GB on disk, but it uses ~8-10GB RAM when running. Your operating system needs RAM too. You need headroom. If your laptop has 16GB total RAM, you can comfortably run a 7B model. If you have 8GB, stick to 3-4B models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 2: Not using the right quantization&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;"I want the best quality" so you download Q8_K. Then it's slow and uses tons of RAM. Try Q4_K_M first. Measure the quality. Only upgrade if you need to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 3: Forgetting about context window&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;LLMs have a maximum context length. Mistral 7B has 32k tokens. You can't feed it a 100,000 token document and expect it to process all of it. Use summarization or retrieval-augmented generation (RAG) to feed the model only relevant context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 4: Not batch-processing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Need to run inference on 1,000 prompts? Don't do it one at a time. Batch them. Local inference is fast—batch processing makes it even faster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 5: Ignoring GPU options&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you have an NVIDIA GPU, tell Ollama. It'll use CUDA and be 5-10x faster. Ollama auto-detects, but verify it's using your GPU:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama list  &lt;span class="c"&gt;# Shows GPU allocation&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Part 6: When to Stay Local vs. When to Go Cloud
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Stay Local:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Development &amp;amp; iteration (you're paying per token on cloud)&lt;/li&gt;
&lt;li&gt;Privacy-sensitive work (medical, financial, proprietary)&lt;/li&gt;
&lt;li&gt;Offline applications&lt;/li&gt;
&lt;li&gt;Running experiments (you control when you pay)&lt;/li&gt;
&lt;li&gt;Building local features that don't scale globally&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Go Cloud:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Production serving thousands of users (you need enterprise scaling)&lt;/li&gt;
&lt;li&gt;Advanced reasoning (Claude Opus / GPT-4 are still better)&lt;/li&gt;
&lt;li&gt;When latency doesn't matter (batch processing external APIs)&lt;/li&gt;
&lt;li&gt;Prototyping new capabilities (test expensive models cheaply)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The future is hybrid. Run Mistral locally for 95% of your tasks. Send hard problems to Claude. Save money. Get better results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Download Ollama&lt;/strong&gt; from ollama.ai. 5 minutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run &lt;code&gt;ollama run mistral&lt;/code&gt;&lt;/strong&gt; in your terminal. Try talking to it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read your laptop's specs.&lt;/strong&gt; How much RAM? Do you have a GPU? This determines what models you can run.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build something.&lt;/strong&gt; Query the local API from a script. Create a simple chat interface. Add RAG with local embeddings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark.&lt;/strong&gt; Time a cloud API call vs. local inference. See the cost difference over a month.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're building anything AI-powered—and in 2026, what isn't?—running models locally is a superpower you actually have access to right now.&lt;/p&gt;




&lt;h2&gt;
  
  
  Sign-Off
&lt;/h2&gt;

&lt;p&gt;Remember that friend who was bleeding money on API tokens? You can be the one who whispers: "I just spun up a 7B model on my laptop. What were you saying about API costs?"&lt;/p&gt;

&lt;p&gt;It's not some future thing. It's today. Your actual laptop. Right now.&lt;/p&gt;

&lt;p&gt;Go run something.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/ggml-org/llama.cpp" rel="noopener noreferrer"&gt;GitHub - ggml-org/llama.cpp: LLM inference in C/C++&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dasroot.net/posts/2026/01/local-llm-deployment-ollama-llama.cpp/" rel="noopener noreferrer"&gt;Local LLM Deployment with Ollama and llama.cpp: A Comprehensive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.sitepoint.com/definitive-guide-local-llms-2026-privacy-tools-hardware/" rel="noopener noreferrer"&gt;Guide to Local LLMs in 2026: Privacy, Tools &amp;amp; Hardware&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://localllm.in/blog/quantization-explained" rel="noopener noreferrer"&gt;The Complete Guide to LLM Quantization | LocalLLM.in&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Author: thousandmiles-ai-admin&lt;/p&gt;

</description>
      <category>learning</category>
      <category>ai</category>
      <category>softwareengineering</category>
      <category>llms</category>
    </item>
    <item>
      <title>What Are Reasoning Models and Why Do They Think Before Answering?</title>
      <dc:creator>Thousand Miles AI</dc:creator>
      <pubDate>Fri, 06 Mar 2026 05:32:29 +0000</pubDate>
      <link>https://dev.to/thousand_miles_ai/what-are-reasoning-models-and-why-do-they-think-before-answering-2fo</link>
      <guid>https://dev.to/thousand_miles_ai/what-are-reasoning-models-and-why-do-they-think-before-answering-2fo</guid>
      <description>&lt;p&gt;o1, o3, DeepSeek R1 — a new breed of LLMs that literally pause to think. But what does 'thinking' mean for a model? Inside thinking tokens, chain-of-thought training, and why this changes everything about how LLMs solve problems.&lt;/p&gt;




&lt;h1&gt;
  
  
  What Are Reasoning Models and Why Do They Think Before Answering?
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Regular LLMs blurt out answers. Reasoning models stop, think, check their work, and then answer. The difference is bigger than you'd expect.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Model That Argued With Itself
&lt;/h2&gt;

&lt;p&gt;Here's something wild. If you give DeepSeek R1 a tricky math problem and watch its thinking process (which it shows you, unlike most models), you'll see something that looks almost... human. It starts with an approach. Gets halfway through. Realizes something doesn't add up. Literally writes "Wait, that's not right" to itself. Backtracks. Tries a different approach. Checks the answer. Then gives you the final result.&lt;/p&gt;

&lt;p&gt;It's not performing for you. These are internal reasoning tokens — the model's scratch pad. Some models hide this thinking process. R1 shows it to you in full. And it's genuinely fascinating to watch a model second-guess itself, catch its own errors, and course-correct.&lt;/p&gt;

&lt;p&gt;This is what makes reasoning models different from everything that came before. Standard LLMs generate answers one token at a time, left to right, committing to each word as they go. They don't plan ahead. They don't check their work. Reasoning models add a phase before the answer where they think through the problem step by step — and that simple addition dramatically improves performance on math, coding, logic, and scientific reasoning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Should You Care?
&lt;/h2&gt;

&lt;p&gt;Two reasons. First, reasoning models are quickly becoming the go-to choice for any task that requires multi-step logic — coding, data analysis, math, complex question answering. If you're building AI-powered tools, knowing when to use a reasoning model versus a standard one is a practical skill.&lt;/p&gt;

&lt;p&gt;Second, the techniques behind reasoning models — chain-of-thought training, reinforcement learning without human supervision, knowledge distillation — represent a genuine shift in how AI research works. Understanding these concepts puts you ahead of the curve, whether for interviews, research, or building your own systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Let Me Back Up — What's Actually Different?
&lt;/h2&gt;

&lt;p&gt;Regular LLMs (GPT-4, Claude Sonnet, Gemini) take your input and generate output directly. Ask a math question, and the model starts writing the answer immediately. It's fast, but it's also impulsive — the model commits to its first approach without considering alternatives.&lt;/p&gt;

&lt;p&gt;Reasoning models add an intermediate step: a thinking phase where the model generates chain-of-thought tokens before producing the final answer. Think of it as the difference between a student who immediately scribbles an answer on an exam versus one who works through the problem on scratch paper first.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fwhat-are-reasoning-models-and-why-do-they-think-before-answering%2Fmermaid-dc1b470c4aa4a65ae89f0f7493bd9eb0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fwhat-are-reasoning-models-and-why-do-they-think-before-answering%2Fmermaid-dc1b470c4aa4a65ae89f0f7493bd9eb0.png" alt="Mermaid Diagram" width="800" height="2333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Standard LLMs go straight to the answer. Reasoning models generate thinking tokens first, then verify, then answer.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The key insight that made this work: &lt;strong&gt;more thinking tokens = better answers.&lt;/strong&gt; Researchers found that performance improves predictably with the number of tokens the model spends "thinking." This is called test-time compute scaling — spending more computation during inference (when the model generates a response) rather than only during training.&lt;/p&gt;

&lt;h2&gt;
  
  
  Okay, But What Does "Thinking" Actually Look Like?
&lt;/h2&gt;

&lt;p&gt;Let's make this concrete. When DeepSeek R1 encounters a question, it wraps its reasoning in &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; tags. Here's a simplified version of what that looks like internally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User: What's the sum of the first 100 prime numbers?

&amp;lt;think&amp;gt;
I need to find the first 100 prime numbers and add them up.
Let me start listing them: 2, 3, 5, 7, 11, 13...
Wait, instead of listing all 100, let me recall that the 100th prime is 541.
The sum of the first 100 primes is a known value: 24,133.
Let me verify - the first few: 2+3+5+7+11 = 28. That checks out with known partial sums.
I'm confident the answer is 24,133.
&amp;lt;/think&amp;gt;

The sum of the first 100 prime numbers is 24,133.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice what's happening: the model explores an approach (listing primes), reconsiders (too slow), pivots to a more efficient strategy, verifies its answer against known data, and then produces the final output. This self-correction during generation is what makes reasoning models special.&lt;/p&gt;

&lt;p&gt;With OpenAI's o1 and o3, you see something similar — but they use a "private chain of thought." The model still thinks through the problem, but you only see a summarized version. DeepSeek R1's full transparency is one of its distinguishing features.&lt;/p&gt;

&lt;h2&gt;
  
  
  How They're Built — The Training Behind Reasoning
&lt;/h2&gt;

&lt;p&gt;Here's where it gets technically interesting. There are two main approaches to building reasoning models, and they reveal very different philosophies.&lt;/p&gt;

&lt;h3&gt;
  
  
  The OpenAI Approach: Reinforcement Learning on Curated Data
&lt;/h3&gt;

&lt;p&gt;OpenAI hasn't published full details on o1/o3's training, but the broad strokes are known. They use reinforcement learning (RL) to train the model to produce better chain-of-thought reasoning. The model generates reasoning traces, those traces are evaluated (did they lead to correct answers?), and the model is rewarded for reasoning patterns that work.&lt;/p&gt;

&lt;p&gt;The reasoning process is private — OpenAI chose not to expose the raw thinking tokens. You see a summary of the reasoning, not the full internal monologue. This is a deliberate design choice, likely for both user experience and competitive reasons.&lt;/p&gt;

&lt;h3&gt;
  
  
  The DeepSeek Approach: RL from Scratch
&lt;/h3&gt;

&lt;p&gt;DeepSeek took a bolder path. They published their full methodology, and it's remarkable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 1 — R1-Zero (pure RL, no human examples):&lt;/strong&gt; They took a base model and applied reinforcement learning directly, without any human-written chain-of-thought examples. They just rewarded the model for getting correct answers and penalized it for wrong ones. The model discovered chain-of-thought reasoning on its own.&lt;/p&gt;

&lt;p&gt;This is the mind-blowing part: nobody taught R1-Zero to "think step by step." It learned that writing out intermediate reasoning led to better rewards. It independently developed self-verification — checking its own work. It even had what the researchers called an "aha moment," where it suddenly started using the word "Wait" during its reasoning, marking a distinct shift to more self-reflective thinking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 2 — Polishing:&lt;/strong&gt; R1-Zero worked but had issues — repetitive reasoning, language mixing, poor readability. So they added supervised fine-tuning with curated examples, followed by another round of RL for human preference alignment. This produced the final DeepSeek R1.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fwhat-are-reasoning-models-and-why-do-they-think-before-answering%2Fmermaid-3cc295e0dc3a599b60b6ef36c222d7f9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fwhat-are-reasoning-models-and-why-do-they-think-before-answering%2Fmermaid-3cc295e0dc3a599b60b6ef36c222d7f9.png" alt="Mermaid Diagram" width="800" height="3045"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;DeepSeek R1's training pipeline: pure RL discovers reasoning, supervised training polishes it, distillation spreads it to smaller models.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Distillation Trick
&lt;/h3&gt;

&lt;p&gt;One of DeepSeek's most impactful contributions: they showed you can take the reasoning patterns learned by a massive 671B parameter model and distill them into much smaller models (1.5B to 70B parameters). These distilled models perform remarkably well — a 14B distilled model can outperform many full-sized models on reasoning benchmarks.&lt;/p&gt;

&lt;p&gt;This means you don't need a massive model to get reasoning capabilities. The thinking patterns are transferable. That's huge for students and developers working with limited resources.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture Difference: Dense vs. Sparse
&lt;/h2&gt;

&lt;p&gt;There's an interesting architectural split between the major reasoning models.&lt;/p&gt;

&lt;p&gt;OpenAI's o3 uses a &lt;strong&gt;dense transformer&lt;/strong&gt; — all parameters are active for every token. This is computationally expensive but straightforward.&lt;/p&gt;

&lt;p&gt;DeepSeek R1 uses a &lt;strong&gt;Mixture-of-Experts (MoE)&lt;/strong&gt; architecture. Of its 671 billion total parameters, only about 37 billion activate for any given token. The rest sit idle. It's like having a team of 20 specialists, but only sending 2–3 of them to handle each task. This makes R1 dramatically cheaper to run despite having more total parameters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistakes That Bite — Common Misunderstandings
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"Reasoning models are always better."&lt;/strong&gt; Not true. For simple tasks — quick Q&amp;amp;A, summarization, casual conversation — standard models are faster, cheaper, and equally accurate. Reasoning models shine on complex, multi-step problems. Using o3 to answer "What's the capital of France?" is like hiring a math PhD to calculate a restaurant tip.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"More thinking tokens always helps."&lt;/strong&gt; There's a point of diminishing returns. Some problems don't benefit from more thinking — the model just generates redundant reasoning that wastes tokens and money. o3-mini offers three reasoning levels (low, medium, high) for exactly this reason: match the thinking effort to the problem difficulty.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"The thinking tokens are just the model talking to itself."&lt;/strong&gt; It's more structured than that. The thinking phase includes specific learned behaviors: problem decomposition, hypothesis generation, self-verification, and error correction. These aren't random ruminations — they're patterns the model learned lead to correct answers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Now Go Break Something
&lt;/h2&gt;

&lt;p&gt;Want to experience the difference firsthand?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Try DeepSeek R1&lt;/strong&gt; through their API or web interface — it shows the full thinking process. Give it a tricky logic puzzle and watch it reason through it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compare with a standard model&lt;/strong&gt; on the same problem. Ask GPT-4 or Claude a multi-step math problem, then ask R1. Compare the reasoning quality and accuracy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explore distilled versions.&lt;/strong&gt; The DeepSeek-R1-Distill-Qwen-14B and DeepSeek-R1-Distill-Llama-8B models are on Hugging Face. You can run these locally and get reasoning capabilities on your own machine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read the DeepSeek R1 paper&lt;/strong&gt; — search for "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." It's published on arXiv and is one of the most accessible AI research papers of 2025.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Search for "The Illustrated DeepSeek-R1" by Jay Alammar&lt;/strong&gt; — he does visual breakdowns of AI architectures that are incredibly beginner-friendly.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Remember that model arguing with itself — writing "Wait, that's not right" and backtracking mid-thought? That's not a gimmick. It's the result of a model that learned, through pure reinforcement, that slowing down and checking its work leads to better answers. Reasoning models don't know more than standard LLMs. They just take a breath before answering. And that breath — those thinking tokens — turns out to be one of the most powerful improvements in language model history.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;Author: Shibin&lt;/p&gt;

</description>
      <category>learning</category>
      <category>ai</category>
    </item>
    <item>
      <title>The 'Lost in the Middle' Problem — Why LLMs Ignore the Middle of Your Context Window</title>
      <dc:creator>Thousand Miles AI</dc:creator>
      <pubDate>Fri, 06 Mar 2026 05:32:02 +0000</pubDate>
      <link>https://dev.to/thousand_miles_ai/the-lost-in-the-middle-problem-why-llms-ignore-the-middle-of-your-context-window-3al2</link>
      <guid>https://dev.to/thousand_miles_ai/the-lost-in-the-middle-problem-why-llms-ignore-the-middle-of-your-context-window-3al2</guid>
      <description>&lt;p&gt;You stuffed all the right documents into the prompt. The LLM still got the answer wrong. Turns out, language models have a blind spot — and it's right in the middle. Here's the research behind it and what you can do.&lt;/p&gt;




&lt;h1&gt;
  
  
  The "Lost in the Middle" Problem — Why LLMs Ignore the Middle of Your Context Window
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Your LLM has a 128K context window. It can read a novel in one go. But it still misses the one paragraph that matters — because it was in the middle.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Perfect Retrieval That Still Failed
&lt;/h2&gt;

&lt;p&gt;Here's a scenario that frustrates every RAG developer at some point. You've built a solid pipeline. Your retriever returns five relevant chunks, ranked by relevance. The correct answer is sitting right there — chunk #3, smack in the middle of the context. You've done everything right.&lt;/p&gt;

&lt;p&gt;The LLM reads all five chunks, generates a confident response, and... gets it wrong. It pulled information from chunk #1 and chunk #5, blended them together, and produced something that sounds plausible but misses the actual answer. The evidence was right in front of it. It just didn't look at it carefully enough.&lt;/p&gt;

&lt;p&gt;You're not imagining this. It has a name: the "lost in the middle" problem. And it's backed by one of the most cited papers in LLM research from 2023, with follow-up work from MIT in 2025 that finally explained &lt;em&gt;why&lt;/em&gt; it happens at an architectural level.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Should You Care?
&lt;/h2&gt;

&lt;p&gt;If you're building anything that puts multiple pieces of information into an LLM's context — RAG systems, multi-document summarization, long-form analysis — this bias directly affects your output quality. And the bigger your context window, the worse it can get.&lt;/p&gt;

&lt;p&gt;This is also the kind of research-backed knowledge that separates strong candidates in AI interviews. Anyone can explain what attention is. Explaining &lt;em&gt;why attention is systematically biased by position&lt;/em&gt; and what to do about it — that's a different level.&lt;/p&gt;

&lt;h2&gt;
  
  
  Let Me Back Up — What the Research Found
&lt;/h2&gt;

&lt;p&gt;In 2023, researchers from Stanford, UC Berkeley, and Samaya AI published a paper titled "Lost in the Middle" that tested how well LLMs use information at different positions in their context window. They ran a simple experiment: give the model a set of documents where only one contains the answer, and vary where that document appears — beginning, middle, or end.&lt;/p&gt;

&lt;p&gt;The results showed a clear U-shaped performance curve. When the relevant document was at the very beginning of the context, accuracy was high. When it was at the very end, accuracy was also high. But when it was in the middle? Accuracy dropped — sometimes dramatically.&lt;/p&gt;

&lt;p&gt;This wasn't a quirk of one model. They tested multiple LLMs across different architectures and sizes, and the pattern held consistently. Language models pay the most attention to the beginning and end of their context, and systematically under-attend to the middle.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fthe-lost-in-the-middle-problem-why-llms-ignore-the-middle-of-your-context-window%2Fmermaid-080f356d072821389b39f752ae1d91ca.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fthe-lost-in-the-middle-problem-why-llms-ignore-the-middle-of-your-context-window%2Fmermaid-080f356d072821389b39f752ae1d91ca.png" alt="Mermaid Diagram" width="800" height="2628"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The U-shaped attention curve: LLMs attend strongly to the beginning and end of context, with a blind spot in the middle.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Okay, But Why? — The Architecture Behind the Bias
&lt;/h2&gt;

&lt;p&gt;For two years after the original paper, the "why" was unclear. People noticed the pattern but couldn't pinpoint the cause. Was it training data? Model size? Prompt format?&lt;/p&gt;

&lt;p&gt;In 2025, MIT researchers cracked it open. They identified two architectural causes:&lt;/p&gt;

&lt;h3&gt;
  
  
  Cause 1: Causal Attention Masking
&lt;/h3&gt;

&lt;p&gt;Transformer models use something called causal masking in their attention mechanism. This means each token can only attend to tokens that came before it — not after. It's how the model generates text left-to-right.&lt;/p&gt;

&lt;p&gt;Here's the subtle problem: tokens at the beginning of the context get attended to by every subsequent token. Token #1 is visible to token #2, #3, #4... all the way to the end. Token #500, sitting in the middle, is only visible to tokens #501 onward. This means earlier tokens accumulate more "attention weight" across the model, simply because they have more opportunities to be attended to.&lt;/p&gt;

&lt;p&gt;It's not that the model decides the beginning is more important. The architecture makes it structurally easier to attend to earlier tokens. The bias is baked into the attention mask itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cause 2: Positional Encoding Decay
&lt;/h3&gt;

&lt;p&gt;Modern LLMs use positional encodings — typically Rotary Position Embedding (RoPE) — to give the model a sense of token order. RoPE introduces a distance-based decay: tokens that are far apart have their attention scores naturally reduced.&lt;/p&gt;

&lt;p&gt;For tokens at the end of the context (where the model generates its response), nearby tokens (also at the end) have strong attention signals, and very early tokens also maintain attention through a mechanism called "attention sinks." But middle tokens? They're too far from the beginning to benefit from the primacy effect and too far from the end to benefit from recency. They're in a dead zone.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Human Parallel
&lt;/h3&gt;

&lt;p&gt;Here's what makes this even more interesting: this mirrors a well-known phenomenon in human psychology called the &lt;strong&gt;serial position effect&lt;/strong&gt;. When people are asked to remember a list of items, they recall the first items (primacy effect) and the last items (recency effect) much better than items in the middle.&lt;/p&gt;

&lt;p&gt;LLMs weren't designed to mimic human memory. But through the architecture of attention mechanisms and training on human-generated text, they've developed a strikingly similar bias. Whether this is a bug or a feature of learning from human data is still debated.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg1u9m63t8aoifcg53sux.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg1u9m63t8aoifcg53sux.png" alt="Mermaid Diagram" width="800" height="339"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Three contributing factors: structural attention bias, positional encoding decay, and training data patterns.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What Can You Actually Do About It?
&lt;/h2&gt;

&lt;p&gt;Knowing the problem is half the battle. Here are practical mitigations that work in production systems:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Strategic Document Ordering
&lt;/h3&gt;

&lt;p&gt;The simplest fix: don't put your most important information in the middle. In RAG systems, place your highest-confidence retrieved documents at the beginning and end of the context. Put lower-ranked documents in the middle. You're not fighting the bias — you're working with it.&lt;/p&gt;

&lt;p&gt;Specifically: if you retrieve 5 chunks ranked by relevance, arrange them as [rank 1, rank 4, rank 5, rank 3, rank 2] — best at the start, second-best at the end, least important in the middle.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Reduce the Number of Retrieved Documents
&lt;/h3&gt;

&lt;p&gt;More context doesn't always mean better answers. If you're retrieving 20 chunks when 5 would suffice, you're creating more middle ground for information to get lost in. Be surgical: use a reranker to select the top 3–5 most relevant chunks and discard the rest. Less noise means less middle to ignore.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Prompt Compression
&lt;/h3&gt;

&lt;p&gt;Instead of dumping raw chunks into the context, compress them first. Extract only the sentences or facts that are relevant to the query and assemble a tighter, shorter context. When there's less total content, there's less of a middle for information to hide in.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Explicit Instruction
&lt;/h3&gt;

&lt;p&gt;Sometimes the blunt approach works: tell the model to pay attention to all parts of the context. Prompts like "Carefully consider ALL of the provided documents, especially documents that appear in the middle" can measurably reduce the bias. It doesn't eliminate it, but it helps.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Multi-Pass Extraction
&lt;/h3&gt;

&lt;p&gt;For critical applications, run multiple passes. First pass: ask the model to extract relevant facts from each document independently. Second pass: ask it to synthesize those facts into an answer. By processing documents individually first, you avoid the position bias entirely — each document gets the model's full attention.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistakes That Bite
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"Bigger context windows solve this."&lt;/strong&gt; They don't. The 2023 paper showed the U-curve exists even in models with context windows of 4K, 16K, and 32K tokens. Research from 2025 confirmed it persists in models with 128K+ windows. Bigger windows mean more middle, which means more room for information to get lost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"This only matters for RAG."&lt;/strong&gt; It affects any task that puts multiple pieces of information into the context — summarization, question answering over multiple documents, multi-turn conversations where important information was mentioned 20 messages ago. If you're using more than a few hundred tokens of context, this bias applies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Newer models have fixed this."&lt;/strong&gt; Some improvements have been made. Techniques like Multi-scale Positional Encoding (Ms-PoE) and attention calibration can reduce the bias without retraining. But as of 2026, no production model has fully eliminated position bias. It's structural to how transformers work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Now Go Break Something
&lt;/h2&gt;

&lt;p&gt;Want to see this bias for yourself? Here's a simple experiment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create a list of 10 facts. Embed the answer to a specific question as fact #5 (the middle).&lt;/li&gt;
&lt;li&gt;Ask the LLM the question with all 10 facts in context. Note the answer.&lt;/li&gt;
&lt;li&gt;Move the answer to fact #1. Ask again. Move it to fact #10. Ask again.&lt;/li&gt;
&lt;li&gt;Compare the accuracy across positions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For deeper exploration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read the original paper: search for "Lost in the Middle: How Language Models Use Long Contexts" by Liu et al.&lt;/li&gt;
&lt;li&gt;Check out the MIT follow-up from 2025 that explains the causal masking mechanism — search for "Unpacking the bias of large language models MIT"&lt;/li&gt;
&lt;li&gt;Search for "Found in the Middle calibration" — this paper proposes a calibration method that reduces position bias without retraining&lt;/li&gt;
&lt;li&gt;Explore Ms-PoE (Multi-scale Positional Encoding) — a plug-and-play approach that improves middle-context utilization&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Your RAG system retrieved five perfect chunks. The answer was in chunk #3. The LLM read chunk #1 carefully, skimmed chunks #2 through #4, and paid close attention to chunk #5. It's not carelessness — it's architecture. Causal masking and positional encodings create a structural blind spot in the middle. Once you know it's there, you can design around it: reorder your documents, slim down your context, and stop trusting that more tokens always means better answers.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;Author: thousandmiles-ai-admin&lt;/p&gt;

</description>
      <category>learning</category>
      <category>ai</category>
    </item>
    <item>
      <title>LoRA and QLoRA Explained — Fine-Tune LLMs Without Selling Your Kidney for GPUs</title>
      <dc:creator>Thousand Miles AI</dc:creator>
      <pubDate>Fri, 06 Mar 2026 05:31:58 +0000</pubDate>
      <link>https://dev.to/thousand_miles_ai/lora-and-qlora-explained-fine-tune-llms-without-selling-your-kidney-for-gpus-4a09</link>
      <guid>https://dev.to/thousand_miles_ai/lora-and-qlora-explained-fine-tune-llms-without-selling-your-kidney-for-gpus-4a09</guid>
      <description>&lt;p&gt;Full fine-tuning a 7B model needs 4x A100 GPUs. You have a free Colab notebook with 15GB of RAM. Game over? Not even close. LoRA and QLoRA let you fine-tune billion-parameter models on hardware you already have. Here's how they actually work.&lt;/p&gt;




&lt;p&gt;import Mermaid from "@/components/ui/mermaid"&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem We All Face
&lt;/h2&gt;

&lt;p&gt;Imagine this: You just found the perfect dataset to fine-tune an LLM. Something domain-specific. Something that would make your startup, research project, or college assignment actually work. You Google "how to fine-tune Llama 2 7B" with excitement.&lt;/p&gt;

&lt;p&gt;Five minutes later, you're staring at this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"You'll need approximately 100-120 GB of VRAM. An A100 GPU costs $2-3 per hour. Full fine-tuning takes 20 hours minimum."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You check your resources. Free Google Colab. 15GB RAM. T4 GPU.&lt;/p&gt;

&lt;p&gt;Your dreams are crushed. Or are they?&lt;/p&gt;

&lt;p&gt;Enter &lt;strong&gt;LoRA&lt;/strong&gt; and &lt;strong&gt;QLoRA&lt;/strong&gt; — the techniques that say "nope, not today" to expensive GPU clusters and let you fine-tune GPT-scale models on hardware you already have.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Fine-Tuning Actually Matters
&lt;/h2&gt;

&lt;p&gt;Before we dive into the magic, let's talk about why fine-tuning is worth the trouble.&lt;/p&gt;

&lt;p&gt;Pre-trained LLMs are generalists. They're good at everything because they learned from everything. But "good at everything" often means "perfect for nothing." If you want an LLM that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Writes technical documentation in &lt;em&gt;your&lt;/em&gt; specific code style&lt;/li&gt;
&lt;li&gt;Understands domain-specific jargon in medical, legal, or financial contexts&lt;/li&gt;
&lt;li&gt;Responds in a particular tone or personality&lt;/li&gt;
&lt;li&gt;Handles edge cases unique to your problem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;...you need fine-tuning. It's the bridge between "generic chatbot" and "actually useful for my specific task."&lt;/p&gt;

&lt;p&gt;But full fine-tuning? That's expensive. Really expensive.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Wrong with Full Fine-Tuning?
&lt;/h2&gt;

&lt;p&gt;During full fine-tuning, you update &lt;strong&gt;every single weight&lt;/strong&gt; in the model. For a 7-billion parameter model, that's:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;7 billion trainable parameters&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Each parameter needs gradients stored during backpropagation&lt;/li&gt;
&lt;li&gt;GPU memory = (model size) + (gradients) + (optimizer states) + (batch)&lt;/li&gt;
&lt;li&gt;Result: ~100-120 GB of VRAM needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A single A100 GPU (the workhorse of AI) costs $2-3 per hour on cloud platforms. To fine-tune a 7B model, you'd need 4 of them, or find a different way.&lt;/p&gt;

&lt;p&gt;This is where the magic happens.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Insight: Models Are Secretly Low-Rank
&lt;/h2&gt;

&lt;p&gt;Here's the key insight that changed everything: When you fine-tune a model on a new task, the weight updates don't require the &lt;strong&gt;full dimensionality&lt;/strong&gt; of the original weights. Most of the "important" changes can be captured in a &lt;strong&gt;low-rank structure&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;What does that mean?&lt;/p&gt;

&lt;p&gt;Imagine a weight matrix W that's 4096 × 4096 (typical in transformer layers). That's ~16 million individual parameters. But the researchers behind LoRA discovered something: you don't need to update all 16 million parameters. You can approximate the weight updates using two much smaller matrices.&lt;/p&gt;

&lt;p&gt;Let's visualize this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhxfjimq9kjjdvwu40ctw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhxfjimq9kjjdvwu40ctw.png" alt="Mermaid Diagram" width="800" height="286"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Instead of updating 16.7 million parameters, LoRA updates only ~65,000 parameters (the two small matrices). That's &lt;strong&gt;99.6% fewer parameters&lt;/strong&gt; to train.&lt;/p&gt;

&lt;p&gt;The magic formula is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;W_new = W_original + ΔW
ΔW ≈ A × B  (where A is 4096 × r, B is r × 4)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, &lt;code&gt;r&lt;/code&gt; is the &lt;strong&gt;rank&lt;/strong&gt; — a small hyperparameter you choose (typically 8-64). The lower the rank, the fewer parameters you train.&lt;/p&gt;

&lt;h2&gt;
  
  
  How LoRA Actually Works
&lt;/h2&gt;

&lt;p&gt;Let's break down the LoRA training process step by step.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Freeze the Base Model
&lt;/h3&gt;

&lt;p&gt;Your pre-trained model stays completely frozen. Its 7 billion parameters don't change. This is huge for memory savings.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Add Tiny Adapter Matrices
&lt;/h3&gt;

&lt;p&gt;For every weight matrix you want to adapt (typically in the attention layers), you add two small matrices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Matrix A&lt;/strong&gt;: initialized randomly and small (e.g., 4096 × 8)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Matrix B&lt;/strong&gt;: initialized to zero (this is important!)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 3: Train Only the Adapters
&lt;/h3&gt;

&lt;p&gt;During fine-tuning, you only update A and B. These are trained using standard backpropagation on your downstream task.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Merge During Inference (Optional)
&lt;/h3&gt;

&lt;p&gt;After training, you can merge A and B into the original weights: &lt;code&gt;W_new = W_original + A × B&lt;/code&gt;. This takes a few seconds and gives you a single model file with zero inference overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pre-trained weights = vast knowledge from billions of examples&lt;/li&gt;
&lt;li&gt;LoRA adapters = task-specific knowledge from your dataset&lt;/li&gt;
&lt;li&gt;Combined = best of both worlds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's the training flow:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Flora-and-qlora-explained-fine-tune-llms-without-selling-your-kidney-for-gpus%2Fmermaid-5dc4a03bb2411c03abc73bf14f579326.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Flora-and-qlora-explained-fine-tune-llms-without-selling-your-kidney-for-gpus%2Fmermaid-5dc4a03bb2411c03abc73bf14f579326.png" alt="Mermaid Diagram" width="800" height="3755"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Enter QLoRA: The Final Boss of Efficiency
&lt;/h2&gt;

&lt;p&gt;LoRA is already mind-blowingly efficient. But what if you could go further?&lt;/p&gt;

&lt;p&gt;QLoRA combines LoRA with &lt;strong&gt;4-bit quantization&lt;/strong&gt; — a technique that compresses the model weights to use only 4 bits per parameter instead of 32 bits (8x compression).&lt;/p&gt;

&lt;h3&gt;
  
  
  What is 4-Bit Quantization?
&lt;/h3&gt;

&lt;p&gt;Instead of storing weights as full 32-bit floating-point numbers, you store them as 4-bit integers. Quantization is lossy (you lose some precision), but the loss is tiny.&lt;/p&gt;

&lt;p&gt;Quantization formula:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;4-bit weight ≈ original 32-bit weight
Compression: 32 bits → 4 bits = 8x smaller
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  QLoRA's Secret Sauce
&lt;/h3&gt;

&lt;p&gt;QLoRA doesn't just quantize. It uses three clever tricks:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. NF4 (Normalfloat) Data Type&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A new data type mathematically optimal for normally distributed weights (which transformer weights are)&lt;/li&gt;
&lt;li&gt;Better precision than standard 4-bit quantization&lt;/li&gt;
&lt;li&gt;Information-theoretically superior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Double Quantization&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Quantizes the quantization constants themselves&lt;/li&gt;
&lt;li&gt;Reduces memory overhead further&lt;/li&gt;
&lt;li&gt;Example: Instead of storing a 32-bit scaling factor per block, store a quantized version&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Paged Optimizers&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Manages memory spikes during backpropagation&lt;/li&gt;
&lt;li&gt;Moves data to CPU RAM when GPU RAM is full&lt;/li&gt;
&lt;li&gt;No crashes, just slower (but still fast)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Numbers
&lt;/h3&gt;

&lt;p&gt;Here's where it gets ridiculous:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Full Fine-Tuning&lt;/th&gt;
&lt;th&gt;LoRA&lt;/th&gt;
&lt;th&gt;QLoRA&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;7B&lt;/td&gt;
&lt;td&gt;100-120 GB&lt;/td&gt;
&lt;td&gt;20-30 GB&lt;/td&gt;
&lt;td&gt;8-12 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13B&lt;/td&gt;
&lt;td&gt;200+ GB&lt;/td&gt;
&lt;td&gt;40-50 GB&lt;/td&gt;
&lt;td&gt;12-16 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;70B&lt;/td&gt;
&lt;td&gt;2000+ GB&lt;/td&gt;
&lt;td&gt;400 GB&lt;/td&gt;
&lt;td&gt;60-80 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;QLoRA lets you fine-tune a 70B model on a &lt;strong&gt;single A100 80GB GPU&lt;/strong&gt;. Full fine-tuning would need 4-8 of them.&lt;/p&gt;

&lt;p&gt;Or, more relevant to you: fine-tune 7-13B models on a &lt;strong&gt;free Google Colab T4 GPU&lt;/strong&gt; with 15GB RAM.&lt;/p&gt;

&lt;h2&gt;
  
  
  Actually Doing It: The Colab Path
&lt;/h2&gt;

&lt;p&gt;This is where theory meets reality.&lt;/p&gt;

&lt;h3&gt;
  
  
  What You'll Need
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Google Colab (free tier)&lt;/li&gt;
&lt;li&gt;A Hugging Face account (free)&lt;/li&gt;
&lt;li&gt;A small dataset (1,000-10,000 examples minimum)&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;transformers&lt;/code&gt;, &lt;code&gt;peft&lt;/code&gt;, and &lt;code&gt;bitsandbytes&lt;/code&gt; libraries&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  High-Level Steps
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Load a Quantized Model&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="n"&gt;model_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mistralai/Mistral-7B-v0.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;load_in_4bit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# QLoRA magic
&lt;/span&gt;    &lt;span class="n"&gt;bnb_4bit_quant_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nf4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_compute_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2: Add LoRA Adapters&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;peft&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LoraConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;get_peft_model&lt;/span&gt;

&lt;span class="n"&gt;lora_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LoraConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# rank
&lt;/span&gt;    &lt;span class="n"&gt;lora_alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;target_modules&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# which weights to adapt
&lt;/span&gt;    &lt;span class="n"&gt;lora_dropout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CAUSAL_LM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_peft_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lora_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3: Train&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Trainer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TrainingArguments&lt;/span&gt;

&lt;span class="n"&gt;training_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TrainingArguments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;output_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_train_epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;per_device_train_batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# tiny because of RAM
&lt;/span&gt;    &lt;span class="n"&gt;gradient_accumulation_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;save_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;logging_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;trainer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Trainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;training_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;train_dataset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_dataset&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;train&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Seriously. The libraries handle the memory management, quantization, and LoRA logic for you.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Practical Gotchas (Learn From My Pain)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Gotcha 1: Rank Selection is Non-Obvious
&lt;/h3&gt;

&lt;p&gt;Lower rank = fewer parameters = faster, less memory.&lt;br&gt;
Higher rank = more capacity = potentially better quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wrong approach:&lt;/strong&gt; Pick r=8 because it's small.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Right approach:&lt;/strong&gt; Try r=8, 16, 32 and compare. For most 7B models, r=16-32 works well. For very small datasets, r=8 might even be better (less overfitting).&lt;/p&gt;

&lt;h3&gt;
  
  
  Gotcha 2: Small Datasets Overfit Easily
&lt;/h3&gt;

&lt;p&gt;LoRA adapters are &lt;em&gt;tiny&lt;/em&gt; — only ~1-5% of parameters. This is great for efficiency but terrible if your dataset has only 100 examples.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wrong approach:&lt;/strong&gt; "More epochs = better results"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Right approach:&lt;/strong&gt; Start with 1 epoch, monitor validation loss. If validation loss increases while training loss decreases, you're overfitting. Add dropout, reduce rank, or get more data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gotcha 3: Forgetting to Merge
&lt;/h3&gt;

&lt;p&gt;If you train LoRA adapters and then share the fine-tuned model, you have two options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Keep A and B separate&lt;/strong&gt; — saves ~1-5 MB, but requires the PEFT library to load&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Merge into base weights&lt;/strong&gt; — one large file, but works with standard transformers library
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Merge adapters into base model
&lt;/span&gt;&lt;span class="n"&gt;merged_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;merge_and_unload&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;merged_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./final_model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Users downstream will appreciate the merged version. Don't forget this step.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gotcha 4: Quantization Loss is Real
&lt;/h3&gt;

&lt;p&gt;QLoRA is amazing, but 4-bit quantization does lose some information. For most tasks, it's negligible (80-90% of full fine-tuning quality). But for tasks requiring high precision (e.g., mathematical reasoning), consider LoRA without quantization if you have the RAM.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next Steps: Actually Try This
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Get a dataset&lt;/strong&gt; — find one on Hugging Face Hub, or use a simple one like &lt;a href="https://huggingface.co/datasets/databricks/databricks-dolly-15k" rel="noopener noreferrer"&gt;databricks-dolly-15k&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clone a Colab notebook&lt;/strong&gt; — start with &lt;a href="https://colab.research.google.com/drive/1PsW-ld7cAL1LblEV_vNb96oKwcS50C5c?usp=sharing" rel="noopener noreferrer"&gt;Hugging Face's QLoRA example&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Modify for your task&lt;/strong&gt; — change the model, dataset, and LoRA rank&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Train&lt;/strong&gt; — hit play and wait&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test&lt;/strong&gt; — load your fine-tuned model and see if it actually works&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;LoRA and QLoRA democratized fine-tuning. Five years ago, only teams with GPU budgets could customize LLMs. Now, any college student with a laptop and Colab access can do it.&lt;/p&gt;

&lt;p&gt;These techniques are part of a broader movement toward &lt;strong&gt;Parameter-Efficient Fine-Tuning (PEFT)&lt;/strong&gt; — methods that adapt massive models using tiny tweaks. LoRA and QLoRA aren't the only ones (there are prefix tuning, adapters, bitfit), but they're the most practical.&lt;/p&gt;

&lt;p&gt;And they work. Teams fine-tune production models with LoRA every single day. It's not a research curiosity anymore — it's a standard tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Full fine-tuning is expensive.&lt;/strong&gt; Updating 7 billion parameters requires 100+ GB GPU RAM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LoRA trains 0.4% of parameters.&lt;/strong&gt; It approximates weight updates using two small matrices instead of updating the whole model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;QLoRA adds 4-bit quantization.&lt;/strong&gt; This lets you fine-tune 70B models on a single A100, or 7B models on free Colab.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The libraries do the heavy lifting.&lt;/strong&gt; You don't need to implement LoRA — Hugging Face PEFT handles it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Start small, experiment, iterate.&lt;/strong&gt; Find the right rank, dataset size, and hyperparameters for your problem.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The GPU poverty problem isn't solved completely — but it's been heavily negotiated down. And that changes everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;If you want to dig deeper, these resources are essential:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://introl.com/blog/fine-tuning-infrastructure-lora-qlora-peft-scale-guide-2025" rel="noopener noreferrer"&gt;Fine-Tuning Infrastructure: LoRA, QLoRA, and PEFT at Scale&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.index.dev/blog/top-ai-fine-tuning-tools-lora-vs-qlora-vs-full" rel="noopener noreferrer"&gt;LoRA vs QLoRA: Best AI Model Fine-Tuning Platforms &amp;amp; Tools 2026&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://modal.com/blog/lora-qlora" rel="noopener noreferrer"&gt;LoRA vs. QLoRA: Efficient fine-tuning techniques for LLMs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.ibm.com/think/topics/lora" rel="noopener noreferrer"&gt;What is LoRA (Low-Rank Adaption)? - IBM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/learn/llm-course/en/chapter11/4" rel="noopener noreferrer"&gt;LoRA (Low-Rank Adaptation) - Hugging Face LLM Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2106.09685" rel="noopener noreferrer"&gt;LoRA: Low-Rank Adaptation of Large Language Models&lt;/a&gt; (original paper)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2305.14314" rel="noopener noreferrer"&gt;QLoRA: Efficient Finetuning of Quantized LLMs&lt;/a&gt; (original paper)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/artidoro/qlora" rel="noopener noreferrer"&gt;QLoRA GitHub Repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/blog/4bit-transformers-bitsandbytes" rel="noopener noreferrer"&gt;Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.kdnuggets.com/fine-tuning-llamav2-with-qlora-on-google-colab-for-free" rel="noopener noreferrer"&gt;Fine Tuning LLAMAv2 with QLoRA on Google Colab for Free&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/blog/peft" rel="noopener noreferrer"&gt;Parameter-Efficient Fine-Tuning using PEFT&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/huggingface/peft" rel="noopener noreferrer"&gt;Hugging Face PEFT GitHub Repository&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Happy fine-tuning.&lt;/strong&gt; You've got this.&lt;/p&gt;




&lt;p&gt;Author: thousandmiles-ai-admin&lt;/p&gt;

</description>
      <category>learning</category>
      <category>ai</category>
    </item>
    <item>
      <title>Chunking Strategies That Actually Work — Why Your RAG App Retrieves Garbage</title>
      <dc:creator>Thousand Miles AI</dc:creator>
      <pubDate>Fri, 06 Mar 2026 05:31:28 +0000</pubDate>
      <link>https://dev.to/thousand_miles_ai/chunking-strategies-that-actually-work-why-your-rag-app-retrieves-garbage-33md</link>
      <guid>https://dev.to/thousand_miles_ai/chunking-strategies-that-actually-work-why-your-rag-app-retrieves-garbage-33md</guid>
      <description>&lt;p&gt;Fixed-size, recursive, semantic — everyone has an opinion on the 'best' chunking strategy. The 2026 benchmarks are in, and the results will surprise you. Here's what actually works and why.&lt;/p&gt;




&lt;h1&gt;
  
  
  Chunking Strategies That Actually Work — Why Your RAG App Retrieves Garbage
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;The most boring part of your RAG pipeline is also the most consequential. Get chunking wrong, and nothing downstream can save you.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Contract That Lied
&lt;/h2&gt;

&lt;p&gt;Picture this. You're building a legal document assistant. A lawyer asks: "Is the company liable for damages in cases of force majeure?" Your RAG system retrieves a chunk that confidently states: "The company is liable for all damages arising from service interruptions." Clear answer, right?&lt;/p&gt;

&lt;p&gt;Except... the original document said: "The company is liable for all damages arising from service interruptions, &lt;strong&gt;except in cases of force majeure as defined in Section 12.&lt;/strong&gt;" Your chunker, set to split every 500 tokens, sliced the sentence right between "interruptions" and "except." The exception — the most important part — ended up in the next chunk. That chunk didn't get retrieved because the query was about liability, not about Section 12 definitions.&lt;/p&gt;

&lt;p&gt;One bad split. A completely wrong answer. And the user has no idea because the retrieved chunk looked perfectly valid.&lt;/p&gt;

&lt;p&gt;This isn't a contrived example. This pattern plays out constantly in production RAG systems. Tables split in half. Lists separated from their headers. Paragraphs that say "as mentioned above" — but "above" is in a different chunk that wasn't retrieved. Chunking errors are silent, invisible, and devastating.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Should You Care?
&lt;/h2&gt;

&lt;p&gt;Chunking is the first real decision in any RAG pipeline, and it's the one most developers spend the least time on. Everyone obsesses over embedding models, vector databases, and LLM selection — but the quality of your chunks puts a hard ceiling on everything else. You can't retrieve what you've destroyed.&lt;/p&gt;

&lt;p&gt;And here's what makes it interesting: the 2026 benchmarks flipped a lot of assumptions on their head. The fanciest chunking methods? They're not winning. Understanding why requires understanding what each strategy actually does — and that's knowledge that shows up in both system design interviews and production debugging.&lt;/p&gt;

&lt;h2&gt;
  
  
  Let Me Back Up — What Is Chunking and Why Do We Need It?
&lt;/h2&gt;

&lt;p&gt;Before your documents go into a vector database, they need to be broken into smaller pieces. There are two reasons for this.&lt;/p&gt;

&lt;p&gt;First, embedding models have token limits. Most models cap out at 512 or 8,192 tokens. You can't embed a 50-page PDF as a single unit.&lt;/p&gt;

&lt;p&gt;Second — and this is the less obvious reason — you want precision in retrieval. If your entire document is one big chunk, any query about any topic in that document will retrieve the whole thing. The LLM then has to find the needle in a haystack. Small, focused chunks mean the retriever can surface exactly the paragraph that answers the question.&lt;/p&gt;

&lt;p&gt;But "small and focused" creates its own problem: the smaller the chunk, the more context it loses. A chunk that says "this approach" without telling you what "this" refers to is useless. The art of chunking is finding the sweet spot between precision and context.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiqumyzbiouxtci86kjii.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiqumyzbiouxtci86kjii.png" alt="Mermaid Diagram" width="800" height="881"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The chunking step sits between your raw documents and the vector database. It determines the quality of everything downstream.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Strategies — Explained Like You're Pair-Programming
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Strategy 1: Fixed-Size Chunking
&lt;/h3&gt;

&lt;p&gt;This is the "I don't want to think about it" approach. You pick a number — say 500 tokens — and split the text every 500 tokens. Done.&lt;/p&gt;

&lt;p&gt;It's dead simple to implement, fast to run, and produces predictable chunk sizes. Most tutorials use this as the default, which is why most beginners start here.&lt;/p&gt;

&lt;p&gt;The problem? It has zero awareness of your text's structure. It will split mid-sentence, mid-paragraph, mid-table. That liability clause we talked about? Fixed-size chunking is exactly how it gets destroyed.&lt;/p&gt;

&lt;p&gt;The one saving grace is &lt;strong&gt;overlap&lt;/strong&gt;. By repeating the last 50–100 tokens of each chunk at the beginning of the next one, you create a buffer zone where boundary information isn't completely lost. It's a patch, not a fix — but it helps more than you'd expect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it works:&lt;/strong&gt; Flat, unstructured text like logs, transcripts, or scraped web content where there's no meaningful structure to preserve.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strategy 2: Recursive Character Splitting
&lt;/h3&gt;

&lt;p&gt;This is the strategy that actually won the 2026 benchmarks — and it's not even that complicated.&lt;/p&gt;

&lt;p&gt;Instead of blindly splitting every N tokens, recursive splitting tries a hierarchy of separators. First, it tries to split on double newlines (paragraph breaks). If the resulting chunks are still too large, it splits on single newlines. Still too large? Sentences. Still too large? Words.&lt;/p&gt;

&lt;p&gt;The key insight: it respects natural boundaries first and only gets more aggressive when it has to. A 500-token paragraph stays intact. A 2,000-token section gets split at paragraph boundaries, not mid-sentence.&lt;/p&gt;

&lt;p&gt;Think of it like cutting a pizza. Fixed-size is a grid pattern — equal pieces but you cut through toppings. Recursive is cutting along the natural slice lines first, and only cutting slices in half if they're too big.&lt;/p&gt;

&lt;p&gt;Most frameworks (LangChain, LlamaIndex) use this as their &lt;code&gt;RecursiveCharacterTextSplitter&lt;/code&gt;. The default separator hierarchy is: &lt;code&gt;["\n\n", "\n", " ", ""]&lt;/code&gt; — paragraphs, lines, words, characters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it works:&lt;/strong&gt; Almost everything. The 2026 FloTorch benchmark tested seven strategies across thousands of documents, and recursive splitting at 512 tokens achieved the highest answer accuracy and retrieval F1 scores.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strategy 3: Semantic Chunking
&lt;/h3&gt;

&lt;p&gt;Okay, here's where it gets interesting — and controversial.&lt;/p&gt;

&lt;p&gt;Semantic chunking uses embeddings to determine where to split. It embeds each sentence individually, then measures the similarity between consecutive sentences. When the similarity drops below a threshold — meaning the topic changed — it places a split there.&lt;/p&gt;

&lt;p&gt;The idea is elegant: instead of splitting based on character count, you split based on meaning. Each chunk should be a coherent unit about one topic.&lt;/p&gt;

&lt;p&gt;The problem? It's expensive and surprisingly inconsistent. You need to generate embeddings for every sentence just to decide where to split. For a large corpus, that means thousands of API calls or significant local compute before you've even started indexing.&lt;/p&gt;

&lt;p&gt;And the 2026 benchmarks showed something counterintuitive: semantic chunking often produced worse retrieval than recursive splitting. Why? Because semantic chunks vary wildly in size. Some end up with 50 tokens (too small for meaningful embedding), others with 2,000+ tokens (too large for precise retrieval). The inconsistent size makes it harder for the retriever to compare chunks fairly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fchunking-strategies-that-actually-work-why-your-rag-app-retrieves-garbage%2Fmermaid-9b2e150504e2a706d30736fb24d83922.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fchunking-strategies-that-actually-work-why-your-rag-app-retrieves-garbage%2Fmermaid-9b2e150504e2a706d30736fb24d83922.png" alt="Mermaid Diagram" width="800" height="2181"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The same legal clause, chunked three ways. Fixed-size breaks it. Recursive preserves it. Semantic groups it by topic but may bundle too much.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it works:&lt;/strong&gt; Multi-topic narrative documents where topics shift unpredictably — research papers, long blog posts, interview transcripts. But only if you can afford the compute and are willing to tune the similarity threshold.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 2026 Surprise — Why Simpler Is Winning
&lt;/h2&gt;

&lt;p&gt;Here's what caught everyone off guard. When comprehensive benchmarks tested all these strategies head-to-head, the ranking was:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Recursive splitting (512 tokens)&lt;/strong&gt; — highest accuracy, highest retrieval F1&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fixed-size (512 tokens with overlap)&lt;/strong&gt; — close second&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic chunking&lt;/strong&gt; — middle of the pack&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proposition-based chunking&lt;/strong&gt; (using LLMs to decompose) — expensive, marginally better on some tasks&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The reason simpler methods won isn't that they're inherently superior — it's that they produce consistent, predictable chunk sizes. Embedding models and retrievers are optimized for chunks in the 256–512 token range. When your chunks are consistently in that range, the entire pipeline works more predictably.&lt;/p&gt;

&lt;p&gt;Semantic and proposition-based methods also create 3–5x more chunks for the same corpus. More chunks means more embeddings, more storage, more compute, and — counterintuitively — more noise in retrieval. The cost multiplier compounds at every layer.&lt;/p&gt;

&lt;p&gt;Does the 3% accuracy improvement from semantic chunking justify 10x the processing cost? For most applications, no.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistakes That Bite — The Chunking Errors Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"I'll just use the default 1,000-token chunks."&lt;/strong&gt; This is the most common error. Most tutorials and framework defaults use 1,000 tokens. But most embedding models are optimized for 256–512 tokens. Larger chunks dilute the embedding — instead of representing one specific idea, they represent a fuzzy average of several ideas. Drop to 512 with 50-token overlap and measure the difference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Chunking is a one-time decision."&lt;/strong&gt; Different document types need different strategies. Your API docs might work perfectly with recursive splitting, while your meeting transcripts might need semantic chunking. Don't apply one strategy to your entire corpus blindly. Profile your document types and test each one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Tables and code can be chunked like prose."&lt;/strong&gt; They absolutely cannot. A table split in half is worse than useless — it's misleading. Code split mid-function is syntactically invalid. Extract tables and code blocks as separate units, preserve their structure, and add surrounding context (the header before the table, the function name, the paragraph that references it).&lt;/p&gt;

&lt;h2&gt;
  
  
  Now Go Break Something — Where to Go from Here
&lt;/h2&gt;

&lt;p&gt;Here's a weekend experiment that'll teach you more about chunking than any article:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Take a document you know well&lt;/strong&gt; — your project docs, college notes, anything where you can verify the answers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chunk it three ways&lt;/strong&gt; using LangChain's splitters: &lt;code&gt;CharacterTextSplitter&lt;/code&gt; (fixed), &lt;code&gt;RecursiveCharacterTextSplitter&lt;/code&gt; (recursive), and &lt;code&gt;SemanticChunker&lt;/code&gt; from langchain-experimental.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ask the same 10 questions&lt;/strong&gt; to a RAG pipeline using each chunking strategy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compare the retrieved chunks&lt;/strong&gt; side by side. You'll immediately see where fixed-size destroys context, where recursive preserves it, and where semantic produces inconsistent sizes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For deeper exploration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Search for "FloTorch RAG chunking benchmark 2026" — the full benchmark results with methodology&lt;/li&gt;
&lt;li&gt;The LangChain docs have a great comparison page for all their text splitters&lt;/li&gt;
&lt;li&gt;Check out the Weaviate blog's chunking guide — it has practical examples for different document types&lt;/li&gt;
&lt;li&gt;For advanced work, look into "late chunking" — a newer approach where you embed the full document first and then chunk the embeddings, preserving long-range context&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;That legal document assistant that said "the company is liable for all damages"? It wasn't lying — it was reading the only chunk it had, and that chunk had been amputated mid-sentence. Swap to recursive splitting at 512 tokens with overlap, and the full clause — exception included — stays intact. The fix wasn't a better model or a smarter prompt. It was a better cut.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;Author: Shibin&lt;/p&gt;

</description>
      <category>learning</category>
      <category>ai</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Build Your First MCP Server in Python — A Weekend Project That Actually Impresses</title>
      <dc:creator>Thousand Miles AI</dc:creator>
      <pubDate>Fri, 06 Mar 2026 05:31:24 +0000</pubDate>
      <link>https://dev.to/thousand_miles_ai/build-your-first-mcp-server-in-python-a-weekend-project-that-actually-impresses-318f</link>
      <guid>https://dev.to/thousand_miles_ai/build-your-first-mcp-server-in-python-a-weekend-project-that-actually-impresses-318f</guid>
      <description>&lt;p&gt;You've heard MCP is the 'USB-C for AI.' But what does it take to actually build one? A hands-on walkthrough of creating an MCP server from scratch using Python and FastMCP — with tools your LLM can call.&lt;/p&gt;




&lt;h1&gt;
  
  
  Build Your First MCP Server in Python — A Weekend Project That Actually Impresses
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Everyone talks about MCP. Very few people have actually built a server. Here's how to be one of them — in about an hour.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Moment Your LLM Gets Hands
&lt;/h2&gt;

&lt;p&gt;You've been playing with ChatGPT or Claude for months. Asking questions, generating code, summarizing documents. But there's always a wall: the model can only work with what you give it. Want it to check today's weather? You copy-paste from a browser. Want it to query your database? You run the query yourself and paste the results.&lt;/p&gt;

&lt;p&gt;Now imagine you told your AI: "What's the weather in Chennai right now?" — and it actually went and fetched the answer. Not from training data. Not from a cached response. It called a real API, got real-time data, and gave you the result.&lt;/p&gt;

&lt;p&gt;That's what an MCP server lets you do. You build a small Python service that exposes "tools" — functions your LLM can discover and call. The LLM sees what tools are available, decides which one to use, passes the right parameters, and gets the response back. No copy-pasting. No manual plumbing.&lt;/p&gt;

&lt;p&gt;And the best part? The server you build works with any MCP-compatible client — Claude Desktop, Cursor, VS Code, or any custom app.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Should You Care?
&lt;/h2&gt;

&lt;p&gt;Two reasons. First, MCP is becoming the standard way AI agents interact with external tools, adopted by both Anthropic and OpenAI, and governed by the Linux Foundation. Knowing how to build MCP servers is a genuinely useful skill for any AI-focused role.&lt;/p&gt;

&lt;p&gt;Second — and more practically — this is one of the best portfolio projects you can build right now. Most people's AI projects are "I wrapped an API call in a chatbot." Building an MCP server shows you understand protocols, tool design, and how AI agents actually connect to the real world. That's a very different conversation in an interview.&lt;/p&gt;

&lt;h2&gt;
  
  
  Let Me Back Up — What Are We Building?
&lt;/h2&gt;

&lt;p&gt;We're going to build a Python MCP server that exposes three tools to any LLM client:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;get_weather&lt;/strong&gt; — Fetches current weather for any city using a free API&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;calculate&lt;/strong&gt; — Evaluates a math expression safely&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;random_fact&lt;/strong&gt; — Returns a random fun fact (because why not)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The server will run locally on your machine. We'll connect it to Claude Desktop so you can actually chat with an AI that uses your tools. The whole thing takes about 50 lines of Python.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fbuild-your-first-mcp-server-in-python-a-weekend-project-that-actually-impresses%2Fmermaid-056538aabc236d2a39a78aa9be01e7bb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fbuild-your-first-mcp-server-in-python-a-weekend-project-that-actually-impresses%2Fmermaid-056538aabc236d2a39a78aa9be01e7bb.png" alt="Mermaid Diagram" width="800" height="165"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What we're building: a Python server that exposes tools to Claude Desktop via MCP.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Okay, Let's Build It — Step by Step
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Set Up Your Environment
&lt;/h3&gt;

&lt;p&gt;You need Python 3.10 or higher. If you're on a Mac or Linux machine, you probably already have it. Check with &lt;code&gt;python3 --version&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Create a project folder and set up a virtual environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;my-mcp-server &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;my-mcp-server
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv
&lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate  &lt;span class="c"&gt;# On Windows: venv\Scripts\activate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Install the MCP SDK. The recommended way is using &lt;code&gt;pip&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"mcp[cli]"&lt;/span&gt; httpx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;mcp&lt;/code&gt; is the official MCP Python SDK. &lt;code&gt;httpx&lt;/code&gt; is for making HTTP requests to external APIs. We're also pulling in FastMCP, which is included in the SDK and gives us a clean decorator-based API for defining tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Write the Server
&lt;/h3&gt;

&lt;p&gt;Create a file called &lt;code&gt;server.py&lt;/code&gt;. Here's the entire thing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.server.fastmcp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastMCP&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;

&lt;span class="c1"&gt;# Create the MCP server
&lt;/span&gt;&lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastMCP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-first-server&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Tool 1: Get weather for a city
&lt;/span&gt;&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_weather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Get current weather for a city. Returns temperature and conditions.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://wttr.in/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;?format=j1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;current_condition&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;temp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temp_C&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;desc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weatherDesc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;temp&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;°C, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;desc&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Tool 2: Safe calculator
&lt;/span&gt;&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Evaluate a math expression safely. Example: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2 + 3 * 4&lt;/span&gt;&lt;span class="sh"&gt;'"""&lt;/span&gt;
    &lt;span class="n"&gt;allowed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0123456789+-*/.() &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;allowed&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: Only numbers and basic operators allowed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Safe because we filtered input
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;expression&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; = &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Tool 3: Random fun fact
&lt;/span&gt;&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;random_fact&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Return a random fun fact about technology or science.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;facts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The first computer bug was an actual bug — a moth found in a Harvard Mark II computer in 1947.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The first 1GB hard drive, introduced in 1980, weighed about 250 kg and cost $40,000.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;About 90% of the world&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s data was created in the last two years.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The average smartphone today has more computing power than NASA had for the Apollo 11 moon landing.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The first website ever created is still online at info.cern.ch.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;facts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stdio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Seriously. Let's break down what's happening.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Understand What You Just Wrote
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;FastMCP&lt;/strong&gt; is a wrapper from the official SDK that makes defining tools dead simple. You create an instance, decorate your functions with &lt;code&gt;@mcp.tool()&lt;/code&gt;, and FastMCP handles all the MCP protocol stuff — JSON-RPC messages, tool discovery, parameter validation.&lt;/p&gt;

&lt;p&gt;Each tool function has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Type hints&lt;/strong&gt; — &lt;code&gt;city: str&lt;/code&gt; tells the LLM what parameters the tool expects&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A docstring&lt;/strong&gt; — This is critical. The LLM reads this to decide when to use the tool. Write it like you're explaining the tool to a person.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A return value&lt;/strong&gt; — A string that gets sent back to the LLM as the observation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;mcp.run(transport="stdio")&lt;/code&gt; at the bottom starts the server using standard input/output — this is how Claude Desktop communicates with local MCP servers. No HTTP, no ports, just stdin/stdout.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fbuild-your-first-mcp-server-in-python-a-weekend-project-that-actually-impresses%2Fmermaid-9716503505c55b9235c7aae41b08f67b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fthousandmiles-ai%2Fblogs%2Fmain%2Fblog-images%2Fbuild-your-first-mcp-server-in-python-a-weekend-project-that-actually-impresses%2Fmermaid-9716503505c55b9235c7aae41b08f67b.png" alt="Mermaid Diagram" width="800" height="331"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The full flow: you ask a question, Claude decides to use your tool, the server calls the API, and the result flows back.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Connect to Claude Desktop
&lt;/h3&gt;

&lt;p&gt;Now the fun part — making Claude Desktop actually use your server.&lt;/p&gt;

&lt;p&gt;Open Claude Desktop and go to &lt;strong&gt;Settings &amp;gt; Developer &amp;gt; Edit Config&lt;/strong&gt;. This opens a JSON file. Add your server to it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"my-first-server"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"/full/path/to/your/server.py"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace &lt;code&gt;/full/path/to/your/server.py&lt;/code&gt; with the actual path to your file. Save, and restart Claude Desktop.&lt;/p&gt;

&lt;p&gt;You should now see a small hammer icon in the chat input area — that means Claude has discovered your tools. Click it to see the three tools listed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Test It
&lt;/h3&gt;

&lt;p&gt;Type into Claude: "What's the weather in Bangalore right now?"&lt;/p&gt;

&lt;p&gt;Claude should recognize that it needs to use the &lt;code&gt;get_weather&lt;/code&gt; tool, call your server, and return the live weather data. Try the calculator: "What's 15 * 37 + 42?" Try the fun fact: "Tell me a random tech fact."&lt;/p&gt;

&lt;p&gt;Each time, you'll see Claude decide which tool to use, call it through your MCP server, and incorporate the result into its response. You've just given an LLM the ability to do things it couldn't do before.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making It Better — Ideas for Your Next Steps
&lt;/h2&gt;

&lt;p&gt;The basic server works, but here are a few directions to take it further:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add more tools.&lt;/strong&gt; Wrap any API you use regularly — a to-do list API, a movie database, your college's timetable system, a GitHub API for checking your repos. Each tool is just another decorated function.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add resources.&lt;/strong&gt; Tools let the LLM do things. Resources let it read things. You can expose file contents, database records, or API responses as read-only resources that the LLM can pull into its context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try the inspector.&lt;/strong&gt; The MCP SDK ships with a built-in inspector tool. Run &lt;code&gt;mcp dev server.py&lt;/code&gt; to get a web UI where you can test your tools interactively without needing Claude Desktop. Super useful for debugging.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deploy it remotely.&lt;/strong&gt; Local servers are great for development, but for production you'd want an HTTP-based transport (SSE or WebSocket). The FastMCP docs cover this — it's a one-line change from &lt;code&gt;stdio&lt;/code&gt; to &lt;code&gt;sse&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistakes That Bite — Things That Trip Up Beginners
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"My tools don't show up in Claude Desktop."&lt;/strong&gt; The most common issue. Double-check: is the path in the config JSON absolute? Is the virtual environment activated? Did you restart Claude Desktop after editing the config? The server needs to start successfully for tools to appear.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"The docstring doesn't matter, right?"&lt;/strong&gt; Wrong. The LLM uses the docstring to decide whether and when to use your tool. A vague docstring like "does something with weather" will confuse the model. Be specific: "Get current weather for a city. Returns temperature in Celsius and conditions."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"I'll just expose my entire database as a tool."&lt;/strong&gt; Resist this urge. Each tool should do one specific thing. A tool called &lt;code&gt;query_anything&lt;/code&gt; that accepts raw SQL is both a security nightmare and confusing for the LLM. Instead, create focused tools like &lt;code&gt;get_user_by_email&lt;/code&gt; or &lt;code&gt;list_recent_orders&lt;/code&gt;. Smaller, focused tools get used correctly more often.&lt;/p&gt;

&lt;h2&gt;
  
  
  Now Go Break Something
&lt;/h2&gt;

&lt;p&gt;You've just built something that most developers only read about. An MCP server — the standard protocol that major AI labs are converging on — running on your machine, giving an LLM real-world capabilities.&lt;/p&gt;

&lt;p&gt;Here's what to explore next:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The official MCP docs&lt;/strong&gt; at modelcontextprotocol.io have a quickstart guide and full API reference&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The free Hugging Face MCP course&lt;/strong&gt; walks through building servers and connecting them to agents, with hands-on exercises&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FastMCP's GitHub repo&lt;/strong&gt; has examples for advanced patterns like authentication, streaming, and resource management&lt;/li&gt;
&lt;li&gt;Search for "MCP server examples" on GitHub — the community has built servers for Notion, Kubernetes, Spotify, and hundreds of other services. Reading other people's servers is one of the fastest ways to learn&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Remember staring at your LLM, wishing it could just check the weather or run a calculation instead of making you do it? Fifty lines of Python later, it can. That's what MCP is about — not replacing what LLMs do well, but giving them the tools to do what they couldn't. Your server is small. The pattern scales to anything.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;Author: Shibin&lt;/p&gt;

</description>
      <category>learning</category>
      <category>ai</category>
      <category>softwareengineering</category>
    </item>
  </channel>
</rss>
