<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vahid Aghajani</title>
    <description>The latest articles on DEV Community by Vahid Aghajani (@vahid_aghajani_60ce9dbec9).</description>
    <link>https://dev.to/vahid_aghajani_60ce9dbec9</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4015358%2F35ccb2f9-355f-4af6-a004-19ae755a9d8c.png</url>
      <title>DEV Community: Vahid Aghajani</title>
      <link>https://dev.to/vahid_aghajani_60ce9dbec9</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vahid_aghajani_60ce9dbec9"/>
    <language>en</language>
    <item>
      <title>Why Your LLM Keeps Returning Garbage JSON (And How to Stop It)</title>
      <dc:creator>Vahid Aghajani</dc:creator>
      <pubDate>Sat, 04 Jul 2026 19:22:01 +0000</pubDate>
      <link>https://dev.to/vahid_aghajani_60ce9dbec9/why-your-llm-keeps-returning-garbage-json-and-how-to-stop-it-2k74</link>
      <guid>https://dev.to/vahid_aghajani_60ce9dbec9/why-your-llm-keeps-returning-garbage-json-and-how-to-stop-it-2k74</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Originally published on &lt;a href="https://software-engineer-blog.com/content/why-your-llm-keeps-returning-garbage-json-and-how-to-stop-it?id=60" rel="noopener noreferrer"&gt;my blog&lt;/a&gt;. Cross-posted here with a canonical link.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/AlcJ7cdkKZI"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Prefer audio? &lt;a href="https://open.spotify.com/episode/1aTxrbNa3ZwNJNO5Ww6DpN" rel="noopener noreferrer"&gt;Spotify episode&lt;/a&gt; · &lt;a href="https://t.me/SoftwareEngineerBlog" rel="noopener noreferrer"&gt;Telegram&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You wire up an LLM call. The demo is magical. You ship it.&lt;/p&gt;

&lt;p&gt;The next morning Sentry is on fire. &lt;code&gt;json.JSONDecodeError: Expecting value: line 1 column 1 (char 0)&lt;/code&gt;. You open the failed payload and the model has politely returned:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;Sure!&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Here's&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;JSON&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;you&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;asked&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;for:&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;```json&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Acme Corp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"founded"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1998&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;```&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;Let&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;me&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;know&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;you&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;need&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;anything&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;else!&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things broke at once: a chatty preamble, a markdown code fence, and a trailing comma. Every team building on LLMs hits this within the first week of going to production. The fix isn't one trick — it's three layers, each catching what the previous one misses.&lt;/p&gt;

&lt;p&gt;This post is the layered playbook: native structured-output APIs first, typed validation second, repair-and-retry third. It's the same pattern shipping in our own production code today.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why LLMs Fail at JSON in the First Place
&lt;/h2&gt;

&lt;p&gt;A language model doesn't &lt;em&gt;output&lt;/em&gt; JSON. It outputs the most likely next token, repeatedly. JSON is just a particular sequence of tokens it has seen a lot during training. So whenever the prompt is even slightly ambiguous about format — or the model is fine-tuned to be helpful and conversational — those token probabilities drift toward natural language.&lt;/p&gt;

&lt;p&gt;Common failure modes, ranked by frequency:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Wrapped in markdown&lt;/strong&gt; — &lt;code&gt;&lt;/code&gt;&lt;code&gt;json ...&lt;/code&gt;&lt;code&gt;&lt;/code&gt; because the training data is full of code blocks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chatty preamble or trailing chatter&lt;/strong&gt; — &lt;em&gt;"Here's your JSON:"&lt;/em&gt;, &lt;em&gt;"Hope this helps!"&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trailing commas&lt;/strong&gt; — JSON forbids them, JavaScript and Python don't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single quotes&lt;/strong&gt; — looks like JSON, isn't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unescaped quotes inside strings&lt;/strong&gt; — &lt;em&gt;"He said "hi""&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Truncation mid-object&lt;/strong&gt; — token limit hit, last brace missing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucinated fields&lt;/strong&gt; — extra keys you didn't ask for; missing required ones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wrong types&lt;/strong&gt; — &lt;code&gt;"founded": "1998"&lt;/code&gt; (string) when you asked for an integer.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You can't prompt your way out of all of these. You need the model to be &lt;em&gt;constrained&lt;/em&gt;, the output to be &lt;em&gt;typed&lt;/em&gt;, and a &lt;em&gt;fallback&lt;/em&gt; for when both still fail.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 1: Use the Native Structured-Output API
&lt;/h2&gt;

&lt;p&gt;Every major provider now ships a way to constrain decoding to a schema. Use it. This single change kills 80–90% of the failure modes above.&lt;/p&gt;

&lt;h3&gt;
  
  
  OpenAI — &lt;code&gt;response_format={"type": "json_schema", ...}&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The model is forced to produce tokens that satisfy the schema. No prose, no fences.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Company&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;founded&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;industry&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-2024-08-06&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extract from: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Acme Corp, founded 1998, makes industrial widgets.&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;response_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Company&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;company&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Company&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;parsed&lt;/code&gt; is already a typed Pydantic object. No &lt;code&gt;json.loads&lt;/code&gt;. No regex. No prayer.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Set &lt;code&gt;temperature=0&lt;/code&gt; whenever structured outputs are on.&lt;/strong&gt; Creativity &lt;em&gt;inside&lt;/em&gt; a schema doesn't make the output better — it makes the model more likely to invent fields, drift toward the edges of an enum, or produce values that pass the schema but fail your validators. Save the temperature for prose generation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Anthropic — Tool use as a schema enforcer
&lt;/h3&gt;

&lt;p&gt;Claude doesn't have a &lt;code&gt;response_format&lt;/code&gt; field, but tool definitions act the same way: declare a tool with an &lt;code&gt;input_schema&lt;/code&gt;, force the model to call it, and the &lt;code&gt;input&lt;/code&gt; you get back is schema-conformant JSON.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extract_company&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extract structured company data.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_schema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Company&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_json_schema&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;tool_choice&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extract_company&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Acme Corp, founded 1998, makes industrial widgets.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;company&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Company&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;tool_choice&lt;/code&gt; forces the model to call the tool, which forces schema compliance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gemini — &lt;code&gt;response_mime_type&lt;/code&gt; + &lt;code&gt;response_schema&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Gemini is &lt;em&gt;especially&lt;/em&gt; strong at enforcing enums during decoding. If a field has a fixed set of valid values, declare it as a &lt;code&gt;Literal&lt;/code&gt; — Gemini will physically refuse to emit a token outside that set, killing a whole class of Layer-2 validation failures before they happen.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.genai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CompanyStatus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;founded&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;industry&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;software&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;manufacturing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;finance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;other&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;active&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;acquired&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defunct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Acme Corp, founded 1998, makes industrial widgets. Still trading.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerateContentConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;response_mime_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;response_schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CompanyStatus&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;company&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CompanyStatus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_validate_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Rule of thumb:&lt;/strong&gt; if the API has a structured-output mode, use it. Don't hand-roll prompts that say &lt;em&gt;"return only valid JSON, no other text"&lt;/em&gt; — that worked in 2023 and it still kind of works in 2026, but it's strictly worse than constrained decoding.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 2: Validate With Pydantic (Even If Layer 1 Worked)
&lt;/h2&gt;

&lt;p&gt;Constrained decoding gives you syntactic JSON. It does &lt;strong&gt;not&lt;/strong&gt; give you semantic correctness. The model can still:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Return &lt;code&gt;founded: 1&lt;/code&gt; when you wanted a 4-digit year.&lt;/li&gt;
&lt;li&gt;Return an empty &lt;code&gt;name&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Return a plausible-but-wrong industry.&lt;/li&gt;
&lt;li&gt;Return all-nulls because the source text didn't actually contain the fields.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pydantic catches these at the boundary, before bad data enters your domain logic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;field_validator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Company&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;min_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;founded&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ge&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1800&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;le&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2030&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;industry&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;software&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;manufacturing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;finance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;other&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="nd"&gt;@field_validator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nd"&gt;@classmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;name_not_placeholder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;n/a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;}:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name looks like a placeholder&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Validation failures here are &lt;em&gt;useful signal&lt;/em&gt;, not just errors. If &lt;code&gt;founded=1&lt;/code&gt; keeps tripping the validator, your prompt is ambiguous — fix the prompt. If &lt;code&gt;industry="other"&lt;/code&gt; appears too often, your enum is too narrow — fix the schema.&lt;/p&gt;

&lt;p&gt;The pattern: every LLM call returns into a Pydantic model, and &lt;strong&gt;the rest of your codebase only ever sees validated objects&lt;/strong&gt;. Treat the LLM like a third-party API you don't trust.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 3: Repair and Retry When Both Layers Above Fail
&lt;/h2&gt;

&lt;p&gt;For ~95% of calls, layers 1 and 2 are enough. The remaining 5% — long inputs, weird edge cases, model degradation, rate-limit retries that hit a different model version — still fails. You need a fallback.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;json_repair&lt;/code&gt; for almost-JSON
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;json-repair&lt;/code&gt; is a small library that fixes the common malformations: trailing commas, single quotes, missing closing braces, markdown fences, prose-around-JSON.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;json_repair&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;repair_json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;  &lt;span class="c1"&gt;# the model's raw string
&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;repair_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's not magic — it's a forgiving parser. It will succeed on inputs that strict JSON refuses, and it has saved more production calls than any single prompt change.&lt;/p&gt;

&lt;h3&gt;
  
  
  Retry with the validation error fed back to the model
&lt;/h3&gt;

&lt;p&gt;If &lt;code&gt;repair_json&lt;/code&gt; fails &lt;em&gt;and&lt;/em&gt; Pydantic validation fails, retry with the error message in the next prompt. The model is genuinely good at fixing its own mistakes when you tell it what broke:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_with_repair&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema_cls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# native structured-output call
&lt;/span&gt;        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;schema_cls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_validate_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;ValidationError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Your previous response failed validation: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Fix it and return only valid JSON matching the schema.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two retries is the sweet spot. One isn't enough for sticky failures; three burns tokens for almost no extra success rate.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Retries are quadratic in token cost — use prompt caching to flatten the curve.&lt;/strong&gt; A 50K-token prompt that retries twice is 150K billed tokens at full price unless you cache. OpenAI, Anthropic, and Gemini all ship prompt caching in 2026; the second and third attempts should hit the cached prefix at a fraction of the cost (typically 10–25%). Cache the system prompt + the source document, vary only the validation-error feedback message.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Make failure structured, too
&lt;/h3&gt;

&lt;p&gt;When all retries are exhausted, don't &lt;code&gt;raise Exception("LLM failed")&lt;/code&gt;. Raise a typed exception your caller can branch on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;InsufficientDataError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The source material genuinely didn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t contain the requested fields.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SchemaViolation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The model couldn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t conform to the schema after retries.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Different failures deserve different handling. A &lt;code&gt;SchemaViolation&lt;/code&gt; is a model/prompt problem — log it, alert. An &lt;code&gt;InsufficientDataError&lt;/code&gt; is a &lt;em&gt;data&lt;/em&gt; problem — surface it to the user as &lt;em&gt;"we couldn't extract X from this document"&lt;/em&gt;, not as a 500.&lt;/p&gt;




&lt;h2&gt;
  
  
  Putting It All Together
&lt;/h2&gt;

&lt;p&gt;Here's the full pattern, condensed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_company&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Company&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extract structured company data from the text.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;call_with_repair&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema_cls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Company&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three layers, one entry point. The caller never sees a &lt;code&gt;JSONDecodeError&lt;/code&gt;. They get a &lt;code&gt;Company&lt;/code&gt; or a typed exception they can handle.&lt;/p&gt;




&lt;h2&gt;
  
  
  Decision Matrix
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;You're calling…&lt;/th&gt;
&lt;th&gt;Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI GPT-4o or newer&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;response_format=PydanticModel&lt;/code&gt; (&lt;code&gt;.parse()&lt;/code&gt; API)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anthropic Claude&lt;/td&gt;
&lt;td&gt;Tool with forced &lt;code&gt;tool_choice&lt;/code&gt; + &lt;code&gt;model_json_schema()&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 2.5+&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;response_mime_type="application/json"&lt;/code&gt; + &lt;code&gt;response_schema=PydanticModel&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenRouter wrapper&lt;/td&gt;
&lt;td&gt;Whatever the underlying model supports — check, don't assume&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local Llama/Mistral via vLLM&lt;/td&gt;
&lt;td&gt;Outlines or LM Format Enforcer for grammar-constrained decoding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anything older or weirder&lt;/td&gt;
&lt;td&gt;Plain prompt + &lt;code&gt;json_repair&lt;/code&gt; + Pydantic + retry loop&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  And one more axis: model size
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model class&lt;/th&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Frontier&lt;/strong&gt; (GPT-4o, Claude Opus 4.x, Gemini 2.5 Pro)&lt;/td&gt;
&lt;td&gt;Layers 1 + 2 are usually enough. Layer 3 catches the long tail.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Small / edge&lt;/strong&gt; (Gemini Flash, Llama 3.x 8B, Phi-4, Mistral 7B)&lt;/td&gt;
&lt;td&gt;Layer 3 is &lt;strong&gt;mandatory&lt;/strong&gt;. Small models trip on nested schemas, Optional fields, and long enums far more often. Budget for 2–5% retry rate even with structured outputs on.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Gotchas Nobody Tells You
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Schemas with &lt;code&gt;Optional[T]&lt;/code&gt; get filled with &lt;code&gt;None&lt;/code&gt; aggressively.&lt;/strong&gt; The model treats &lt;em&gt;"I don't know"&lt;/em&gt; as a valid answer when nullability is allowed. If you need extraction to be honest about missing data, use a typed exception path instead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enums with too many values regress to "other".&lt;/strong&gt; Keep &lt;code&gt;Literal[...]&lt;/code&gt; lists short. If you need 50 categories, use a two-step pipeline: free-text → embedding → nearest enum.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;additionalProperties: false&lt;/code&gt; matters.&lt;/strong&gt; Without it, the model invents fields. Pydantic v2 emits this by default; double-check if you write the JSON Schema by hand.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming + structured outputs is half-broken everywhere.&lt;/strong&gt; You can stream the JSON, but you can't typed-parse it until the stream finishes. Don't promise users a typewriter effect on extracted data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;JSON mode is not free.&lt;/strong&gt; Constrained decoding adds 10–30% latency on most providers. Worth it for correctness, but budget for it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retries cost money quadratically when prompts are long.&lt;/strong&gt; A 50K-token prompt that retries twice is 150K tokens. Cache the system prompt aggressively.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;The "LLM-as-software-component" problem isn't solved by a smarter model. It's solved by treating the model like every other unreliable upstream service: constrain what it can return, validate what it does return, repair what's almost right, retry with feedback, fail loudly with typed errors when all else breaks.&lt;/p&gt;

&lt;p&gt;Three layers. Each one catches what the previous one missed. Ship that and your Sentry inbox stops lighting up at 6 a.m.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Vision Language Models — When AI Learns to See and Talk (Part 3 of 3)</title>
      <dc:creator>Vahid Aghajani</dc:creator>
      <pubDate>Sat, 04 Jul 2026 18:17:21 +0000</pubDate>
      <link>https://dev.to/vahid_aghajani_60ce9dbec9/vision-language-models-when-ai-learns-to-see-and-talk-part-3-of-3-51ig</link>
      <guid>https://dev.to/vahid_aghajani_60ce9dbec9/vision-language-models-when-ai-learns-to-see-and-talk-part-3-of-3-51ig</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Originally published on &lt;a href="https://software-engineer-blog.com/content/vision-language-models-when-ai-learns-to-see-and-talk-part-3-of-3?id=51" rel="noopener noreferrer"&gt;my blog&lt;/a&gt;. Cross-posted here with a canonical link.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/8oWyMHvepMQ"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;This is Part 3 of a 3-part series on the transformer revolution in vision and language:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Part 1:&lt;/strong&gt; &lt;a href="https://software-engineer-blog.com/content/transformers-the-architecture-that-changed-ai-part-1-of-3?id=49" rel="noopener noreferrer"&gt;Transformers — The Architecture That Changed AI&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 2:&lt;/strong&gt; &lt;a href="https://software-engineer-blog.com/content/vision-transformers-how-transformers-learned-to-see-part-2-of-3?id=50" rel="noopener noreferrer"&gt;Vision Transformers — How Transformers Learned to See&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 3:&lt;/strong&gt; Vision Language Models — When AI Learns to See and Talk (this post)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In Part 1, we covered how the transformer architecture replaced RNNs and CNNs as the backbone of modern AI. In Part 2, we saw how Vision Transformers (ViTs) brought that same architecture to image understanding — splitting images into patches and treating them like tokens.&lt;/p&gt;

&lt;p&gt;Now comes the question that drives this entire field forward: &lt;strong&gt;what happens when you combine both?&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Combine Vision and Language?
&lt;/h2&gt;

&lt;p&gt;Humans don't process the world in isolated channels. When you look at a photo of a dog catching a frisbee in a park, you don't separately "see" the image and then "think" in language. Your understanding is multimodal from the start — you perceive the scene, recognize objects, understand spatial relationships, and can describe it all in natural language without effort.&lt;/p&gt;

&lt;p&gt;Traditional AI couldn't do this. Computer vision models could classify images or detect objects, but they couldn't &lt;em&gt;explain&lt;/em&gt; what they saw. Language models could write eloquently, but they were blind. These were separate systems with separate training pipelines, separate datasets, and no shared understanding.&lt;/p&gt;

&lt;p&gt;Vision Language Models (VLMs) change this. They bridge the gap between pixels and words, creating systems that can look at an image and answer questions about it, generate descriptions, follow visual instructions, or reason about what they see.&lt;/p&gt;

&lt;p&gt;The applications are enormous: a doctor uploads a medical scan and asks "What do you see?"; a warehouse robot reads labels and navigates shelves; a visually impaired user points their phone at a restaurant menu and gets it read aloud. All of these require a model that &lt;em&gt;sees&lt;/em&gt; and &lt;em&gt;speaks&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Evolution: From Captioning to True Multimodal Understanding
&lt;/h2&gt;

&lt;p&gt;The journey from "AI that describes pictures" to "AI that understands images and reasons about them" happened in distinct phases.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1: Image Captioning (2015-2019)
&lt;/h3&gt;

&lt;p&gt;Early systems used a CNN encoder (like ResNet) to extract image features, then fed those features into an RNN or LSTM decoder to generate a caption. The architecture was straightforward: &lt;code&gt;image -&amp;gt; CNN -&amp;gt; feature vector -&amp;gt; RNN -&amp;gt; "A dog catches a frisbee"&lt;/code&gt;. These systems worked but were brittle — they could generate grammatically correct captions, but didn't truly understand the scene. Ask a follow-up question and they'd fall apart.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 2: Contrastive Pre-training (2021)
&lt;/h3&gt;

&lt;p&gt;CLIP changed the game by learning to &lt;em&gt;align&lt;/em&gt; images and text in a shared embedding space, without generating anything. This allowed zero-shot classification, image search, and open-vocabulary recognition. It was the first time a single model could handle visual concepts it had never been explicitly trained on.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 3: Generative Multimodal Models (2022-2023)
&lt;/h3&gt;

&lt;p&gt;Models like Flamingo, BLIP-2, and LLaVA took things further — they could not just align images and text, but &lt;em&gt;generate&lt;/em&gt; free-form text responses about images. You could have a conversation about a photo.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 4: Natively Multimodal Systems (2023-present)
&lt;/h3&gt;

&lt;p&gt;GPT-4V, Gemini, and Claude represent the current frontier: models trained from the ground up to handle text, images, video, and audio as first-class inputs. These aren't vision modules bolted onto a language model — they are unified systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  Four Architectural Approaches to VLMs
&lt;/h2&gt;

&lt;p&gt;Not all VLMs are built the same way. There are four fundamental design patterns, each with distinct trade-offs.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Contrastive Learning (The CLIP Approach)
&lt;/h3&gt;

&lt;p&gt;CLIP (Contrastive Language-Image Pre-training, OpenAI 2021) uses a &lt;strong&gt;dual-encoder&lt;/strong&gt; architecture. One encoder processes images (a ViT or ResNet), and a separate encoder processes text (a transformer). Both encoders map their inputs into the same embedding space.&lt;/p&gt;

&lt;p&gt;During training, CLIP sees 400 million image-text pairs scraped from the internet. For each batch, it computes the cosine similarity between every image embedding and every text embedding. The training objective is simple: maximize the similarity between matching image-text pairs and minimize it for non-matching ones. This is contrastive learning — the model learns by contrasting positives against negatives.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Image Encoder (ViT)  ──→  Image Embedding  ──┐
                                               ├──→  Cosine Similarity
Text Encoder (Transformer) → Text Embedding ──┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why was CLIP revolutionary? Three reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero-shot transfer.&lt;/strong&gt; To classify an image, you don't fine-tune. You just compute similarity between the image embedding and text embeddings like "a photo of a dog" or "a photo of a cat". The highest similarity wins. This means CLIP can classify images into &lt;em&gt;any&lt;/em&gt; categories you define at inference time — no retraining needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open vocabulary.&lt;/strong&gt; Traditional classifiers are limited to their fixed label set. CLIP understands free-form language, so you can classify images using any text description you can think of.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Web-scale training.&lt;/strong&gt; By using image-text pairs from the internet instead of hand-labeled datasets, CLIP trained on far more diverse data than any supervised model.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The limitation: CLIP is an &lt;em&gt;alignment&lt;/em&gt; model, not a &lt;em&gt;generative&lt;/em&gt; model. It can tell you which text best matches an image, but it can't generate a detailed description or answer complex questions.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Cross-Attention Fusion (The Flamingo Approach)
&lt;/h3&gt;

&lt;p&gt;Flamingo (DeepMind, 2022) takes a different strategy. Instead of aligning two separate encoders, it injects visual information directly into a frozen large language model using &lt;strong&gt;cross-attention layers&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The architecture works like this: a vision encoder (a frozen NFNet or ViT) extracts visual features from the input image. These features are then compressed by a &lt;strong&gt;Perceiver Resampler&lt;/strong&gt; — a small transformer module that reduces the variable number of visual tokens into a fixed set (typically 64). These compressed visual tokens are fed into newly added cross-attention layers that are interleaved between the existing self-attention layers of the frozen LLM.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Image ──→ Vision Encoder ──→ Perceiver Resampler ──→ Visual Tokens
                                                         │
Text ──→ [Frozen LLM with interleaved cross-attention] ←─┘
                         │
                    Generated Text
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: the LLM itself stays frozen. Only the Perceiver Resampler and the cross-attention layers are trained. This preserves the language model's capabilities while teaching it to attend to visual information.&lt;/p&gt;

&lt;p&gt;Flamingo excelled at &lt;strong&gt;few-shot learning&lt;/strong&gt;. You could show it a few image-text examples as context, and it would generalize to new tasks — much like how GPT-3 demonstrated few-shot language capabilities.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Visual Tokens into LLM (The LLaVA Approach)
&lt;/h3&gt;

&lt;p&gt;LLaVA (Large Language and Vision Assistant, 2023) takes the simplest possible approach: use a &lt;strong&gt;linear projection&lt;/strong&gt; (or MLP) to map visual features into the token embedding space of an LLM, then just prepend them to the text tokens.&lt;/p&gt;

&lt;p&gt;The architecture is refreshingly minimal:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pass the image through a pre-trained CLIP ViT to get visual features.&lt;/li&gt;
&lt;li&gt;Project those features through a trained MLP to match the LLM's embedding dimension.&lt;/li&gt;
&lt;li&gt;Concatenate the projected visual tokens with the text tokens.&lt;/li&gt;
&lt;li&gt;Feed everything into the LLM as a single sequence.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Image ──→ CLIP ViT ──→ MLP Projection ──→ Visual Tokens
                                              │
                              [v1, v2, ..., vN, text tokens] ──→ LLM ──→ Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From the LLM's perspective, visual tokens look just like any other tokens. The model processes them using its standard self-attention mechanism. No architectural changes to the LLM are needed.&lt;/p&gt;

&lt;p&gt;LLaVA's training happens in two stages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pre-training the projection.&lt;/strong&gt; The vision encoder and LLM are frozen; only the MLP is trained on image-caption pairs to align the visual feature space with the language embedding space.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Visual instruction tuning.&lt;/strong&gt; The MLP and LLM are fine-tuned together on instruction-following data — conversations about images, visual question answering, and complex reasoning tasks.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;LLaVA proved that you don't need exotic architectural innovations. A well-trained projection layer and good instruction-tuning data can produce remarkably capable multimodal models. Its open-source nature made it enormously influential — LLaVA-NeXT, LLaVA-OneVision, and many derivative models followed.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Natively Multimodal (The Gemini Approach)
&lt;/h3&gt;

&lt;p&gt;Gemini (Google, 2023) takes the most ambitious approach: train a single transformer from scratch on interleaved text, images, audio, and video. There is no separate vision encoder bolted on — the model natively processes all modalities through a unified architecture.&lt;/p&gt;

&lt;p&gt;Images are tokenized using SentencePiece-style visual tokenization or learned patch embeddings, and these visual tokens are interleaved with text tokens during both training and inference. The model processes everything through the same transformer layers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[text tokens, image tokens, text tokens, audio tokens, ...] ──→ Unified Transformer ──→ Output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The advantage is deep fusion: visual and textual understanding develop together during training, rather than being stitched together after the fact. The model can reason across modalities in a way that adapter-based approaches struggle with.&lt;/p&gt;

&lt;p&gt;The disadvantage is cost. Training a natively multimodal model from scratch requires enormous compute budgets and carefully curated multimodal training data. This is why only a handful of labs (Google, OpenAI, Anthropic) have built models in this category.&lt;/p&gt;




&lt;h2&gt;
  
  
  Major VLM Models: A Landscape Overview
&lt;/h2&gt;

&lt;p&gt;The VLM space has exploded. Here's a tour of the most important models and what makes each one notable.&lt;/p&gt;

&lt;h3&gt;
  
  
  CLIP (OpenAI, 2021)
&lt;/h3&gt;

&lt;p&gt;The model that started the modern VLM era. Trained on 400M image-text pairs using contrastive learning. CLIP is not generative — it aligns images and text in a shared space — but it became the backbone vision encoder for dozens of subsequent models (LLaVA, BLIP-2, and more all use CLIP's ViT).&lt;/p&gt;

&lt;h3&gt;
  
  
  BLIP and BLIP-2 (Salesforce, 2022-2023)
&lt;/h3&gt;

&lt;p&gt;BLIP introduced "bootstrapping" — using a model to generate and then filter its own training captions, creating higher-quality data. BLIP-2 took this further with the &lt;strong&gt;Q-Former&lt;/strong&gt; (Querying Transformer), a lightweight module that bridges a frozen image encoder and a frozen LLM. The Q-Former uses a set of learnable query tokens that interact with visual features through cross-attention, then pass the result to the LLM. This made it possible to combine powerful pre-trained components with minimal training.&lt;/p&gt;

&lt;h3&gt;
  
  
  Flamingo (DeepMind, 2022)
&lt;/h3&gt;

&lt;p&gt;The few-shot champion. By interleaving cross-attention layers into a frozen Chinchilla LLM, Flamingo showed that you could give a language model vision capabilities without retraining it. Its few-shot performance on visual QA benchmarks was remarkable — you could show it 4-8 example image-text pairs and it would generalize effectively.&lt;/p&gt;

&lt;h3&gt;
  
  
  LLaVA / LLaVA-NeXT (University of Wisconsin, 2023-2024)
&lt;/h3&gt;

&lt;p&gt;The open-source workhorse. LLaVA proved that a simple projection MLP between a CLIP ViT and a Vicuna/LLaMA LLM, combined with high-quality visual instruction-tuning data, could match or exceed far more complex architectures. LLaVA-NeXT improved resolution handling with dynamic image partitioning — splitting high-resolution images into tiles, encoding each tile, and concatenating the features.&lt;/p&gt;

&lt;h3&gt;
  
  
  GPT-4V / GPT-4o (OpenAI, 2023-2024)
&lt;/h3&gt;

&lt;p&gt;GPT-4V brought multimodal capabilities to the most capable commercial LLM. Architectural details are not published, but it handles complex visual reasoning, OCR, chart understanding, and multi-image comparisons. GPT-4o ("omni") extended this to audio and real-time interaction, processing text, images, and audio natively rather than through separate pipelines.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gemini (Google, 2023-2024)
&lt;/h3&gt;

&lt;p&gt;Google's natively multimodal family. Available in Ultra, Pro, and Nano sizes. Gemini processes text, images, audio, and video through a unified transformer trained from scratch on multimodal data. Gemini 1.5 Pro introduced a 1M+ token context window, enabling processing of hour-long videos or hundreds of pages of documents alongside text queries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Claude (Anthropic, 2024-present)
&lt;/h3&gt;

&lt;p&gt;Anthropic's Claude models support image understanding with strong performance on document analysis, chart reading, and visual reasoning. Like GPT-4V and Gemini, the exact architecture is proprietary, but Claude demonstrates particularly strong performance on tasks requiring careful analysis and reduced hallucination.&lt;/p&gt;

&lt;h3&gt;
  
  
  PaliGemma (Google, 2024)
&lt;/h3&gt;

&lt;p&gt;An open-weight, lightweight VLM combining a SigLIP vision encoder with a Gemma language model. Designed for fine-tuning on specific tasks — OCR, visual QA, object detection, image segmentation. At 3B parameters, PaliGemma showed that you don't need massive models for practical VLM applications.&lt;/p&gt;

&lt;h3&gt;
  
  
  Qwen-VL (Alibaba, 2023-2024)
&lt;/h3&gt;

&lt;p&gt;Alibaba's open-source multimodal model. Supports image, video, and text inputs. Qwen2-VL introduced &lt;strong&gt;Naive Dynamic Resolution&lt;/strong&gt; — handling images at their native resolution by dynamically adjusting the number of visual tokens rather than resizing all images to a fixed size. This was a meaningful advance for tasks requiring fine-grained detail.&lt;/p&gt;

&lt;h3&gt;
  
  
  InternVL (Shanghai AI Lab, 2023-2024)
&lt;/h3&gt;

&lt;p&gt;A strong open-source contender, combining an InternViT vision encoder with an InternLM language model. InternVL 2.0 scaled to 108B parameters and achieved competitive performance with commercial models on benchmarks. Notable for its progressive training strategy — scaling both the vision encoder and LLM together.&lt;/p&gt;




&lt;h2&gt;
  
  
  VLM Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Model&lt;/th&gt;
      &lt;th&gt;Architecture Type&lt;/th&gt;
      &lt;th&gt;Open/Closed&lt;/th&gt;
      &lt;th&gt;Key Strength&lt;/th&gt;
      &lt;th&gt;Parameters&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;CLIP&lt;/td&gt;
      &lt;td&gt;Dual encoder (contrastive)&lt;/td&gt;
      &lt;td&gt;Open&lt;/td&gt;
      &lt;td&gt;Zero-shot classification, backbone encoder&lt;/td&gt;
      &lt;td&gt;~400M&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;BLIP-2&lt;/td&gt;
      &lt;td&gt;Q-Former bridge&lt;/td&gt;
      &lt;td&gt;Open&lt;/td&gt;
      &lt;td&gt;Efficient frozen model connection&lt;/td&gt;
      &lt;td&gt;~3-12B&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Flamingo&lt;/td&gt;
      &lt;td&gt;Cross-attention fusion&lt;/td&gt;
      &lt;td&gt;Closed&lt;/td&gt;
      &lt;td&gt;Few-shot multimodal learning&lt;/td&gt;
      &lt;td&gt;80B&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;LLaVA / LLaVA-NeXT&lt;/td&gt;
      &lt;td&gt;Projection into LLM&lt;/td&gt;
      &lt;td&gt;Open&lt;/td&gt;
      &lt;td&gt;Simple, effective, easy to reproduce&lt;/td&gt;
      &lt;td&gt;7-34B&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GPT-4V / GPT-4o&lt;/td&gt;
      &lt;td&gt;Natively multimodal&lt;/td&gt;
      &lt;td&gt;Closed&lt;/td&gt;
      &lt;td&gt;Strongest general reasoning&lt;/td&gt;
      &lt;td&gt;Undisclosed&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Gemini&lt;/td&gt;
      &lt;td&gt;Natively multimodal&lt;/td&gt;
      &lt;td&gt;Closed (API)&lt;/td&gt;
      &lt;td&gt;Long context, video understanding&lt;/td&gt;
      &lt;td&gt;Undisclosed&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude&lt;/td&gt;
      &lt;td&gt;Natively multimodal&lt;/td&gt;
      &lt;td&gt;Closed (API)&lt;/td&gt;
      &lt;td&gt;Document analysis, reduced hallucination&lt;/td&gt;
      &lt;td&gt;Undisclosed&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;PaliGemma&lt;/td&gt;
      &lt;td&gt;SigLIP + Gemma projection&lt;/td&gt;
      &lt;td&gt;Open&lt;/td&gt;
      &lt;td&gt;Lightweight, fine-tunable&lt;/td&gt;
      &lt;td&gt;3B&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Qwen2-VL&lt;/td&gt;
      &lt;td&gt;Dynamic resolution + LLM&lt;/td&gt;
      &lt;td&gt;Open&lt;/td&gt;
      &lt;td&gt;Native resolution, multilingual&lt;/td&gt;
      &lt;td&gt;2-72B&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;InternVL 2.0&lt;/td&gt;
      &lt;td&gt;Progressive scaling ViT + LLM&lt;/td&gt;
      &lt;td&gt;Open&lt;/td&gt;
      &lt;td&gt;Competitive with closed models&lt;/td&gt;
      &lt;td&gt;1-108B&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Real-World Applications
&lt;/h2&gt;

&lt;p&gt;VLMs have moved well beyond research benchmarks into production systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  Visual Question Answering and Conversational AI
&lt;/h3&gt;

&lt;p&gt;The most visible application: upload an image and ask questions about it. This powers customer support (photograph a broken product and describe the issue), education (point at a math problem and get step-by-step solutions), and accessibility (describe scenes for visually impaired users).&lt;/p&gt;

&lt;h3&gt;
  
  
  Document Understanding and OCR
&lt;/h3&gt;

&lt;p&gt;VLMs excel at understanding the &lt;em&gt;structure&lt;/em&gt; of documents, not just the text. They can read invoices, parse tables, understand forms, and extract information from complex layouts that traditional OCR systems struggle with. Financial services, legal, and healthcare all benefit here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Autonomous Driving and Robotics
&lt;/h3&gt;

&lt;p&gt;Self-driving systems need to understand scenes in context: "Is that person about to cross the street?" requires combining visual perception with semantic reasoning. VLMs can provide this contextual understanding as part of the driving stack. In robotics, VLMs enable robots to follow natural language instructions in the physical world — "pick up the red cup next to the keyboard."&lt;/p&gt;

&lt;h3&gt;
  
  
  Medical Imaging
&lt;/h3&gt;

&lt;p&gt;Radiologists can use VLMs as a second-opinion tool — upload a chest X-ray and ask about potential findings. Models like Med-PaLM M (Google) were specifically fine-tuned for medical multimodal tasks. The combination of visual understanding and natural language output makes findings more accessible to non-specialists.&lt;/p&gt;

&lt;h3&gt;
  
  
  Creative and Design Work
&lt;/h3&gt;

&lt;p&gt;VLMs can critique designs ("What's wrong with this UI layout?"), provide alt-text for images at scale, analyze competitors' visual branding, and help with content moderation by understanding both images and their context.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenges and Open Problems
&lt;/h2&gt;

&lt;p&gt;Despite rapid progress, VLMs still have significant limitations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hallucination
&lt;/h3&gt;

&lt;p&gt;This is the biggest problem. VLMs confidently describe objects that don't exist in the image, misread text, or invent details. A model might claim there are three people in an image that shows two, or describe a red car as blue. The language model's tendency to generate plausible-sounding text sometimes overrides what it actually "sees." Reducing multimodal hallucination is an active research area, with approaches like RLHF on visual tasks and grounding mechanisms showing promise.&lt;/p&gt;

&lt;h3&gt;
  
  
  Spatial Reasoning
&lt;/h3&gt;

&lt;p&gt;VLMs struggle with precise spatial relationships. "Is the cup to the left or right of the book?" or "How many windows are on the second floor?" often produce wrong answers. The image-to-tokens pipeline loses fine-grained spatial information, and current training data doesn't emphasize spatial reasoning enough.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fine-Grained Understanding
&lt;/h3&gt;

&lt;p&gt;Counting objects accurately, reading small text in images, distinguishing between visually similar items (different bird species, similar product models) — these remain difficult. Higher image resolutions help, but processing 4K images means thousands of visual tokens, which strains context windows and compute budgets.&lt;/p&gt;

&lt;h3&gt;
  
  
  Temporal Reasoning in Video
&lt;/h3&gt;

&lt;p&gt;While models like Gemini can process video, true temporal reasoning ("What happened right before the person fell?") remains limited. Most video VLMs sample frames rather than processing continuous video, losing temporal dynamics.&lt;/p&gt;

&lt;h3&gt;
  
  
  Safety and Bias
&lt;/h3&gt;

&lt;p&gt;VLMs inherit biases from both their visual and textual training data. They may generate stereotypical descriptions, fail to recognize people from underrepresented groups, or be manipulated through adversarial images. Multimodal safety is harder than text-only safety because the attack surface is larger — adversarial patterns can be embedded in images in ways that are invisible to humans but manipulate model outputs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where Multimodal AI Is Headed
&lt;/h2&gt;

&lt;p&gt;The trajectory is clear: modality barriers are dissolving.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unified models are winning.&lt;/strong&gt; The trend is away from adapter-based approaches (bolt a vision encoder onto an LLM) and toward natively multimodal training. Future models will likely process text, images, video, audio, 3D data, and sensor readings through a single architecture, trained together from the start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reasoning over visual information is improving fast.&lt;/strong&gt; Chain-of-thought prompting is being extended to visual reasoning — models that can "think step by step" about what they see, breaking complex visual scenes into sequential reasoning steps. This addresses spatial reasoning and counting weaknesses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Smaller, specialized models are becoming practical.&lt;/strong&gt; Not every application needs GPT-4V. Models like PaliGemma and Qwen2-VL show that focused, open-weight models in the 2-7B parameter range can handle specific visual tasks effectively. Expect more task-specific VLMs that can run on edge devices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The agentic future is multimodal.&lt;/strong&gt; AI agents that can browse the web, interact with GUIs, and operate in physical environments need vision-language understanding as a core capability. VLMs are the perceptual backbone of autonomous AI systems — from computer-use agents that navigate screens to robots that manipulate objects based on verbal instructions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-time multimodal interaction is arriving.&lt;/strong&gt; GPT-4o's ability to process audio, video, and text simultaneously in real-time conversation points toward a future where AI assistants see, hear, and respond as naturally as a human conversation partner.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The three-part journey from transformers to VLMs tells a coherent story about how a single architectural idea — self-attention over sequences — scaled from text to images to true multimodal understanding.&lt;/p&gt;

&lt;p&gt;Transformers gave us attention. Vision Transformers showed that images are just sequences of patches. And Vision Language Models proved that you can unify perception and language in a single model.&lt;/p&gt;

&lt;p&gt;We are still early. Current VLMs hallucinate, struggle with spatial reasoning, and require enormous compute. But the rate of improvement is remarkable — the gap between the best VLMs in early 2024 and late 2025 is larger than the gap between 2018 and 2023. The models are getting smaller, faster, more capable, and more accessible.&lt;/p&gt;

&lt;p&gt;The endgame isn't just a model that can look at a picture and answer a question. It's an AI that perceives the world as richly as humans do — that can watch a video, read a document, listen to a conversation, and reason across all of it simultaneously. VLMs are the foundation that makes this possible.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>computervision</category>
      <category>llm</category>
    </item>
    <item>
      <title>Vision Transformers — How Transformers Learned to See (Part 2 of 3)</title>
      <dc:creator>Vahid Aghajani</dc:creator>
      <pubDate>Sat, 04 Jul 2026 18:11:58 +0000</pubDate>
      <link>https://dev.to/vahid_aghajani_60ce9dbec9/vision-transformers-how-transformers-learned-to-see-part-2-of-3-gn4</link>
      <guid>https://dev.to/vahid_aghajani_60ce9dbec9/vision-transformers-how-transformers-learned-to-see-part-2-of-3-gn4</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Originally published on &lt;a href="https://software-engineer-blog.com/content/vision-transformers-how-transformers-learned-to-see-part-2-of-3?id=50" rel="noopener noreferrer"&gt;my blog&lt;/a&gt;. Cross-posted here with a canonical link.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/1YitypkPz7U"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  Recap: The Transformer Revolution (Part 1)
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://software-engineer-blog.com/content/transformers-the-architecture-that-changed-ai-part-1-of-3?id=49" rel="noopener noreferrer"&gt;Part 1 of this series&lt;/a&gt;, we explored how the Transformer architecture — introduced in Google's 2017 paper &lt;em&gt;"Attention Is All You Need"&lt;/em&gt; — upended natural language processing. The key ideas were &lt;strong&gt;self-attention&lt;/strong&gt; (letting every token attend to every other token), &lt;strong&gt;positional encodings&lt;/strong&gt; (injecting sequence order without recurrence), and &lt;strong&gt;multi-head attention&lt;/strong&gt; (learning multiple relationship patterns in parallel). Transformers replaced RNNs and LSTMs as the backbone of language models, eventually powering GPT, BERT, and everything that followed.&lt;/p&gt;

&lt;p&gt;But Transformers were designed for sequences of tokens — words, subwords, characters. Images are not sequences. They are 2D grids of pixels with spatial structure, local patterns, and hierarchical features. For decades, a completely different family of architectures dominated vision: convolutional neural networks.&lt;/p&gt;

&lt;p&gt;So how did Transformers learn to see? That is the story of this post.&lt;/p&gt;

&lt;h2&gt;
  
  
  The CNN Era: What Worked and What Did Not
&lt;/h2&gt;

&lt;p&gt;Convolutional Neural Networks (CNNs) have been the workhorse of computer vision since AlexNet won ImageNet in 2012. The architecture is elegant: small learnable filters slide across the image, detecting local patterns like edges, textures, and shapes. Stacking convolutional layers builds a hierarchy — early layers detect edges, middle layers detect parts (eyes, wheels), and deep layers detect entire objects.&lt;/p&gt;

&lt;p&gt;CNNs come with strong &lt;strong&gt;inductive biases&lt;/strong&gt; baked into their design:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Locality&lt;/strong&gt;: Each filter looks at a small patch of the image. A 3x3 convolution only sees 9 pixels at a time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Translation equivariance&lt;/strong&gt;: The same filter is applied everywhere, so a cat detected in the top-left corner uses the same weights as a cat in the bottom-right.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hierarchical feature extraction&lt;/strong&gt;: Pooling layers progressively reduce spatial resolution, forcing the network to build abstract representations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These biases are a gift when data is limited. They tell the network &lt;em&gt;how&lt;/em&gt; to look at images before it sees a single training example. Models like ResNet, EfficientNet, and ConvNeXt achieved remarkable accuracy and efficiency.&lt;/p&gt;

&lt;p&gt;But CNNs have limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Limited receptive field&lt;/strong&gt;: Even deep CNNs struggle to capture long-range dependencies. A pixel in the top-left corner has no direct connection to a pixel in the bottom-right until very late in the network. This matters for understanding scene-level context — knowing that a person is holding a surfboard requires relating distant image regions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fixed geometric structure&lt;/strong&gt;: Convolutions are rigid. They process fixed-size local neighborhoods regardless of content. They cannot dynamically decide which parts of the image are most relevant to each other.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scaling bottlenecks&lt;/strong&gt;: While CNNs scale reasonably, the relationship between model size, data, and performance plateaus compared to what Transformers achieve in NLP.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Researchers asked: could Transformers, with their ability to let any part of the input attend to any other part, do better?&lt;/p&gt;

&lt;h2&gt;
  
  
  The Key Insight: Images as Sequences of Patches
&lt;/h2&gt;

&lt;p&gt;The breakthrough idea behind Vision Transformers is disarmingly simple: &lt;strong&gt;treat an image as a sequence of patches, just like a sentence is a sequence of words.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Take a 224x224 pixel image. Divide it into a grid of non-overlapping 16x16 patches. You get 14 x 14 = 196 patches. Each patch is a small image region containing 16 x 16 x 3 = 768 pixel values (for RGB images). Flatten each patch into a vector, project it through a linear layer, and you have a sequence of 196 "tokens" — each one representing a patch of the image.&lt;/p&gt;

&lt;p&gt;Now you can feed this sequence into a standard Transformer encoder. Self-attention lets every patch attend to every other patch, regardless of spatial distance. The patch in the top-left corner can directly interact with the patch in the bottom-right corner in a single layer. No need to stack dozens of layers to build a large receptive field.&lt;/p&gt;

&lt;p&gt;This is the core idea of the &lt;strong&gt;Vision Transformer (ViT)&lt;/strong&gt;, published by Google Research in late 2020.&lt;/p&gt;

&lt;h2&gt;
  
  
  ViT Architecture: A Step-by-Step Walkthrough
&lt;/h2&gt;

&lt;p&gt;Let us trace how ViT processes a single image from input to classification.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Patch Embedding
&lt;/h3&gt;

&lt;p&gt;The input image (e.g., 224x224x3) is divided into a grid of P x P patches (typically P=16). Each patch is flattened into a vector of length P^2 x C (where C is the number of channels), giving a 768-dimensional vector for 16x16 RGB patches. A learned linear projection maps each flattened patch to the model's hidden dimension D (e.g., 768). The result: a sequence of N patch embeddings, where N = (224/16)^2 = 196.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Image (224x224x3)
  → Split into 196 patches of 16x16x3
  → Flatten each patch to 768-dim vector
  → Linear projection to D-dim embedding
  → Sequence of 196 token embeddings
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: The [CLS] Token
&lt;/h3&gt;

&lt;p&gt;ViT prepends a special learnable &lt;strong&gt;[CLS] token&lt;/strong&gt; to the patch sequence, borrowed directly from BERT. This token does not correspond to any image patch. Instead, it serves as an aggregation point: through self-attention across all layers, it collects information from every patch. After the final Transformer layer, the [CLS] token's representation is used for classification.&lt;/p&gt;

&lt;p&gt;The sequence is now 197 tokens long: 1 [CLS] token + 196 patch tokens.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Position Embeddings
&lt;/h3&gt;

&lt;p&gt;Unlike convolutions, the Transformer has no built-in notion of spatial position. If you shuffle the patch order, the self-attention output is identical (it is permutation-equivariant). To encode spatial information, ViT adds &lt;strong&gt;learnable 1D position embeddings&lt;/strong&gt; to each token. Position 0 is the [CLS] token, positions 1-196 correspond to patches in raster order (left-to-right, top-to-bottom).&lt;/p&gt;

&lt;p&gt;Interestingly, the original ViT paper found that 1D positional embeddings work just as well as explicit 2D positional encodings. The model learns the 2D structure from data — nearby position embeddings end up with similar values, effectively reconstructing a 2D grid.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Transformer Encoder
&lt;/h3&gt;

&lt;p&gt;The sequence of 197 position-encoded embeddings is fed into a standard Transformer encoder — the exact same architecture from the NLP world. Each layer consists of:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Layer Normalization&lt;/strong&gt; (applied before attention, following the Pre-Norm convention)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Head Self-Attention (MHSA)&lt;/strong&gt;: Every token attends to every other token. For 197 tokens, this means a 197x197 attention matrix per head. Each head can learn different relationships — some might focus on nearby patches, others on semantically related distant patches.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Residual Connection&lt;/strong&gt;: Add the attention output back to the input&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Layer Normalization&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MLP (Feed-Forward Network)&lt;/strong&gt;: Two linear layers with GELU activation, expanding the dimension by 4x then projecting back&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Residual Connection&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;ViT-Base uses 12 such layers, ViT-Large uses 24, and ViT-Huge uses 32.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Classification Head
&lt;/h3&gt;

&lt;p&gt;After the final encoder layer, the [CLS] token's output representation is extracted and passed through a simple MLP head (one hidden layer during pre-training, a single linear layer during fine-tuning) to produce class logits.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input Image
  → Patch Embedding (196 patches)
  → Prepend [CLS] token (197 tokens)
  → Add Position Embeddings
  → Transformer Encoder (L layers of MHSA + MLP)
  → Extract [CLS] token output
  → MLP Classification Head
  → Class Prediction
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The elegance is striking: no pooling layers, no convolutions, no hand-crafted feature extractors. Just patches, linear projections, and attention.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Data Hunger Problem
&lt;/h2&gt;

&lt;p&gt;Here is the catch. When ViT was trained on ImageNet alone (1.3 million images), it performed &lt;em&gt;worse&lt;/em&gt; than comparable CNNs like ResNet. The Transformer's lack of inductive bias is both its strength and its weakness.&lt;/p&gt;

&lt;p&gt;CNNs "know" to look locally and share weights spatially. ViT knows nothing — it must learn everything from data, including the fact that nearby pixels are related and that patterns can appear anywhere in the image. Learning these priors from scratch requires enormous amounts of data.&lt;/p&gt;

&lt;p&gt;The original ViT paper showed that the picture flips dramatically with scale: when pre-trained on &lt;strong&gt;JFT-300M&lt;/strong&gt; (Google's internal dataset of 300 million images), ViT-Huge outperformed every CNN on ImageNet, CIFAR-100, and other benchmarks. The takeaway was clear: Transformers for vision work, but they are data-hungry.&lt;/p&gt;

&lt;p&gt;This raised a practical question: most researchers and companies do not have 300 million labeled images. Can Vision Transformers work without massive private datasets?&lt;/p&gt;

&lt;h2&gt;
  
  
  The Ecosystem: Vision Transformer Variants
&lt;/h2&gt;

&lt;p&gt;The original ViT paper sparked an explosion of follow-up work addressing its limitations. Here are the most significant models and what they contribute.&lt;/p&gt;

&lt;h3&gt;
  
  
  DeiT — Data-Efficient Image Transformers (Facebook, 2021)
&lt;/h3&gt;

&lt;p&gt;DeiT proved that ViT can be trained effectively on ImageNet alone (1.3M images) — no JFT needed. The key innovations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strong data augmentation&lt;/strong&gt; (RandAugment, Mixup, CutMix, random erasing) to compensate for the lack of inductive bias&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regularization techniques&lt;/strong&gt; (stochastic depth, repeated augmentation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Knowledge distillation&lt;/strong&gt; from a CNN teacher (RegNet): a special distillation token learns to mimic the CNN's predictions, effectively transferring the CNN's inductive bias to the Transformer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;DeiT-Base matched ViT-Base performance while training only on ImageNet with 4 GPUs — a massive reduction in compute requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  Swin Transformer — Shifted Windows (Microsoft, 2021)
&lt;/h3&gt;

&lt;p&gt;Swin Transformer addressed ViT's two biggest architectural issues: quadratic attention cost and single-scale representation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hierarchical feature maps&lt;/strong&gt;: Like a CNN, Swin produces feature maps at multiple resolutions (1/4, 1/8, 1/16, 1/32 of input size) by merging patches between stages. This makes it a drop-in backbone replacement for CNNs in detection and segmentation frameworks like FPN and UPerNet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Window-based attention&lt;/strong&gt;: Instead of global self-attention over all patches (O(N^2) cost), Swin computes attention within local windows of fixed size (e.g., 7x7 patches). This reduces complexity to O(N).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shifted windows&lt;/strong&gt;: In alternating layers, the window partition is shifted by half the window size, allowing cross-window information flow without the cost of global attention.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Swin became the de facto backbone for dense prediction tasks (object detection, semantic segmentation, instance segmentation) and won the "best paper" at ICCV 2021.&lt;/p&gt;

&lt;h3&gt;
  
  
  BEiT — BERT Pre-Training for Images (Microsoft, 2021)
&lt;/h3&gt;

&lt;p&gt;BEiT brought BERT-style &lt;strong&gt;masked pre-training&lt;/strong&gt; to vision. During pre-training, random image patches are masked, and the model must predict the visual tokens of the masked patches (using a discrete visual codebook from a tokenizer called dVAE). This self-supervised objective lets ViT learn powerful representations without any labels.&lt;/p&gt;

&lt;p&gt;BEiT showed that self-supervised pre-training dramatically improves ViT's performance when fine-tuned on smaller labeled datasets, partially solving the data hunger problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  CvT and CoAtNet — Hybrid CNN + Transformer
&lt;/h3&gt;

&lt;p&gt;These models combine CNN and Transformer strengths:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CvT (Convolutional Vision Transformer)&lt;/strong&gt;: Replaces the linear patch embedding with convolutional token embeddings and uses depthwise convolutions inside the attention projection. This injects locality bias into the Transformer while keeping the global attention mechanism.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CoAtNet (Google, 2021)&lt;/strong&gt;: Stacks depthwise convolution layers (for local patterns in early stages) with Transformer layers (for global attention in later stages). By systematically combining convolutions and attention, CoAtNet achieves state-of-the-art ImageNet accuracy (90.88% top-1) with strong efficiency.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The hybrid approach is pragmatic: use convolutions where locality matters most (early layers processing raw pixels) and attention where global reasoning matters (later layers composing high-level features).&lt;/p&gt;

&lt;h3&gt;
  
  
  DINO and DINOv2 — Self-Supervised Learning (Meta)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;DINO&lt;/strong&gt; (Self-DIstillation with NO Labels, 2021) showed that ViT trained with self-supervised distillation learns remarkably structured features. The attention maps of self-supervised ViTs spontaneously learn to segment objects — without ever seeing a segmentation label. The model learns by having a student network match the output of a momentum-updated teacher network on different augmented views of the same image.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DINOv2&lt;/strong&gt; (2023) scaled this approach with curated data, larger models, and improved training recipes. DINOv2 features are so general that they work as frozen feature extractors for depth estimation, segmentation, classification, and retrieval — often matching or beating supervised models without any fine-tuning.&lt;/p&gt;

&lt;h3&gt;
  
  
  EVA and InternImage — Pushing Scale
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;EVA&lt;/strong&gt; (Exploring the Limits of Masked Visual Representation Learning, 2022): Combined masked image modeling with CLIP-style vision-language alignment at billion-parameter scale. EVA-02 demonstrated that scaling ViT with the right pre-training recipe achieves new state-of-the-art results across many benchmarks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;InternImage&lt;/strong&gt; (2023): Took a different path — a large-scale model based on deformable convolutions (not Transformers) that matched or exceeded ViT performance, proving that the CNN vs. Transformer debate is not settled. InternImage uses dynamic sparse attention patterns through deformable convolutions, getting some of the benefits of attention without the architecture.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SAM — Segment Anything (Meta, 2023)
&lt;/h3&gt;

&lt;p&gt;SAM is a foundation model for image segmentation. Its image encoder is a ViT-Huge pre-trained with MAE (Masked Autoencoder). Given an image and a prompt (point, box, or text), SAM produces high-quality segmentation masks for any object — including objects it has never seen before.&lt;/p&gt;

&lt;p&gt;SAM demonstrated that ViT backbones, trained at scale with the right objectives, can power general-purpose visual understanding that transfers to virtually any segmentation task. SAM 2 extended this to video.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison: CNN vs. ViT vs. Hybrid
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;CNN (e.g., ResNet, ConvNeXt)&lt;/th&gt;
&lt;th&gt;ViT (Pure Transformer)&lt;/th&gt;
&lt;th&gt;Hybrid (e.g., CoAtNet, CvT)&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inductive bias&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Strong (locality, translation equivariance)&lt;/td&gt;
&lt;td&gt;Minimal (learns from data)&lt;/td&gt;
&lt;td&gt;Moderate (conv early, attention late)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data efficiency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Good with small datasets&lt;/td&gt;
&lt;td&gt;Poor without large-scale pre-training&lt;/td&gt;
&lt;td&gt;Good — best of both worlds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Moderate — diminishing returns at scale&lt;/td&gt;
&lt;td&gt;Excellent — performance scales with data and compute&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Global context&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Limited — requires deep stacking&lt;/td&gt;
&lt;td&gt;Full — every patch sees every patch&lt;/td&gt;
&lt;td&gt;Progressive — local to global&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Compute cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Efficient (linear in image size)&lt;/td&gt;
&lt;td&gt;Expensive (quadratic in patch count)&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dense prediction&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Natural multi-scale features&lt;/td&gt;
&lt;td&gt;Single-scale (ViT) — needs adaptation&lt;/td&gt;
&lt;td&gt;Multi-scale (Swin, CvT)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Transfer learning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Strong&lt;/td&gt;
&lt;td&gt;Exceptional at scale&lt;/td&gt;
&lt;td&gt;Strong&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Comparison of Major Vision Transformer Models
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Year&lt;/th&gt;
&lt;th&gt;Key Innovation&lt;/th&gt;
&lt;th&gt;Pre-training&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ViT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2020&lt;/td&gt;
&lt;td&gt;Pure Transformer for vision&lt;/td&gt;
&lt;td&gt;Supervised (JFT-300M)&lt;/td&gt;
&lt;td&gt;Classification at scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DeiT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2021&lt;/td&gt;
&lt;td&gt;Data-efficient training + distillation&lt;/td&gt;
&lt;td&gt;Supervised (ImageNet-1K)&lt;/td&gt;
&lt;td&gt;Classification without massive data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Swin&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2021&lt;/td&gt;
&lt;td&gt;Hierarchical + shifted windows&lt;/td&gt;
&lt;td&gt;Supervised (ImageNet)&lt;/td&gt;
&lt;td&gt;Detection, segmentation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BEiT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2021&lt;/td&gt;
&lt;td&gt;Masked image modeling&lt;/td&gt;
&lt;td&gt;Self-supervised&lt;/td&gt;
&lt;td&gt;Low-data fine-tuning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CoAtNet&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2021&lt;/td&gt;
&lt;td&gt;Conv + Attention hybrid staging&lt;/td&gt;
&lt;td&gt;Supervised (JFT)&lt;/td&gt;
&lt;td&gt;Top accuracy on ImageNet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DINO/DINOv2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2021/2023&lt;/td&gt;
&lt;td&gt;Self-supervised distillation&lt;/td&gt;
&lt;td&gt;Self-supervised&lt;/td&gt;
&lt;td&gt;General-purpose features&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;EVA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2022&lt;/td&gt;
&lt;td&gt;Scaled masked modeling + CLIP alignment&lt;/td&gt;
&lt;td&gt;Self-supervised + CLIP&lt;/td&gt;
&lt;td&gt;Vision-language tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;InternImage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2023&lt;/td&gt;
&lt;td&gt;Large-scale deformable convolutions&lt;/td&gt;
&lt;td&gt;Supervised&lt;/td&gt;
&lt;td&gt;Dense prediction at scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SAM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2023&lt;/td&gt;
&lt;td&gt;Promptable segmentation foundation model&lt;/td&gt;
&lt;td&gt;MAE + SA-1B dataset&lt;/td&gt;
&lt;td&gt;Zero-shot segmentation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  When to Use CNN vs. ViT in Practice
&lt;/h2&gt;

&lt;p&gt;Choosing between a CNN and a Vision Transformer is not about which is "better" — it depends on your constraints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use a CNN (ResNet, EfficientNet, ConvNeXt) when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have a small to medium dataset (under 100K images) and no pre-trained ViT is available for your domain&lt;/li&gt;
&lt;li&gt;You need real-time inference on edge devices or mobile — CNNs are still more efficient at small model sizes&lt;/li&gt;
&lt;li&gt;Your task is straightforward classification or detection with well-established CNN pipelines&lt;/li&gt;
&lt;li&gt;You want a battle-tested, well-understood architecture with extensive tooling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use a ViT (or Swin/DeiT) when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can leverage a strong pre-trained checkpoint (ImageNet-21K, CLIP, DINOv2, etc.)&lt;/li&gt;
&lt;li&gt;Your task benefits from global context (scene understanding, medical imaging with long-range dependencies, satellite imagery)&lt;/li&gt;
&lt;li&gt;You are working at scale — more data and compute reliably improve ViT performance&lt;/li&gt;
&lt;li&gt;You need a backbone that connects to modern multimodal systems (CLIP, LLaVA, GPT-4V)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use a hybrid (CoAtNet, CvT, or ConvNeXt + attention) when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want the best accuracy-efficiency tradeoff&lt;/li&gt;
&lt;li&gt;You need multi-scale features for dense prediction (detection, segmentation) without Swin's complexity&lt;/li&gt;
&lt;li&gt;You are building a production system where you need both the efficiency of convolutions and the expressiveness of attention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A practical note: in 2025-2026, the default starting point for most vision tasks is a &lt;strong&gt;pre-trained ViT or Swin backbone&lt;/strong&gt;, fine-tuned on your data. The pre-training handles the data hunger problem. If you are training from scratch on a small dataset with no relevant pre-trained model, CNNs remain the safer choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;Vision Transformers did more than improve accuracy numbers. They unified the architecture across modalities. The same Transformer that processes text can now process images — and critically, the same architecture can process both at the same time.&lt;/p&gt;

&lt;p&gt;This unification is what enables the next wave: models that see and read simultaneously. CLIP learns to align images and text in a shared embedding space. Flamingo, LLaVA, and GPT-4V combine a ViT image encoder with a language model decoder to answer questions about images, describe scenes, and reason visually.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Next: Part 3 — Vision Language Models
&lt;/h2&gt;

&lt;p&gt;In Part 3 of this series, we will explore &lt;strong&gt;Vision Language Models (VLMs)&lt;/strong&gt; — the architectures that combine Vision Transformers with Large Language Models. We will cover how models like CLIP, LLaVA, and GPT-4V bridge the gap between seeing and understanding, enabling AI systems that can describe images, answer visual questions, and reason about the visual world in natural language.&lt;/p&gt;

&lt;p&gt;The Transformer learned to read in 2017. It learned to see in 2020. Now it is learning to do both at once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Read Part 3: &lt;a href="https://software-engineer-blog.com/content/vision-language-models-when-ai-learns-to-see-and-talk-part-3-of-3?id=51" rel="noopener noreferrer"&gt;Vision-Language Models — When AI Learns to See and Talk&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>deeplearning</category>
      <category>computervision</category>
    </item>
    <item>
      <title>Semantic vs Keyword vs Hybrid Search: What Every RAG Demo Skips</title>
      <dc:creator>Vahid Aghajani</dc:creator>
      <pubDate>Sat, 04 Jul 2026 18:02:34 +0000</pubDate>
      <link>https://dev.to/vahid_aghajani_60ce9dbec9/semantic-vs-keyword-vs-hybrid-search-what-every-rag-demo-skips-5fhk</link>
      <guid>https://dev.to/vahid_aghajani_60ce9dbec9/semantic-vs-keyword-vs-hybrid-search-what-every-rag-demo-skips-5fhk</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Originally published on &lt;a href="https://software-engineer-blog.com/content/semantic-vs-keyword-vs-hybrid-search-what-every-rag-demo-skips?id=59" rel="noopener noreferrer"&gt;my blog&lt;/a&gt;. Cross-posted here with a canonical link.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/4JLxlInaFA0"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Prefer audio? &lt;a href="https://open.spotify.com/episode/3PxcvpaVIyfEqA42vAVdUM" rel="noopener noreferrer"&gt;Spotify episode&lt;/a&gt; · &lt;a href="https://t.me/SoftwareEngineerBlog" rel="noopener noreferrer"&gt;Telegram&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Every RAG tutorial starts the same way: chunk your docs, embed them, throw them in a vector store, query with cosine similarity. Done.&lt;/p&gt;

&lt;p&gt;It's a great demo. It's also not how serious search systems work.&lt;/p&gt;

&lt;p&gt;The moment a user types &lt;code&gt;error code E_1042&lt;/code&gt; or &lt;code&gt;Llama-3.1-70B&lt;/code&gt; or a product SKU, pure semantic search starts quietly failing — because an embedding of &lt;code&gt;E_1042&lt;/code&gt; is a vector of noise. Meanwhile, keyword search has the opposite problem: type &lt;em&gt;"how do I cancel my subscription"&lt;/em&gt; and it misses the document titled &lt;em&gt;"ending your membership"&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;So real systems use both. This post is about what each one is actually doing, why hybrid beats either alone, and how to build it in Postgres in ~40 lines.&lt;/p&gt;




&lt;h2&gt;
  
  
  Keyword Search: BM25 in One Paragraph
&lt;/h2&gt;

&lt;p&gt;Keyword search finds documents containing your query terms. The question is how to &lt;em&gt;rank&lt;/em&gt; them.&lt;/p&gt;

&lt;p&gt;BM25 (the de-facto ranking function since the 90s) is basically three ideas stacked:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Term frequency&lt;/strong&gt; — the more a word appears in a doc, the more relevant, but with diminishing returns (the 20th occurrence of "database" doesn't help much).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inverse document frequency&lt;/strong&gt; — rare words count more. &lt;em&gt;"the"&lt;/em&gt; is useless, &lt;em&gt;"pgvector"&lt;/em&gt; is a strong signal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Length normalization&lt;/strong&gt; — longer documents would otherwise win by accident, so scores are normalized by length.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. No machine learning, no GPU, no training. Under the hood it's an inverted index (word → list of documents containing it), which makes it blindingly fast even at billions of documents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it's good at:&lt;/strong&gt; exact matches, IDs, rare tokens, acronyms, product names, filenames, version numbers, anything literal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it fails at:&lt;/strong&gt; synonyms (&lt;em&gt;"car"&lt;/em&gt; ≠ &lt;em&gt;"automobile"&lt;/em&gt;), paraphrase, conceptual queries, cross-language.&lt;/p&gt;




&lt;h2&gt;
  
  
  Semantic Search: Embeddings in One Paragraph
&lt;/h2&gt;

&lt;p&gt;An embedding model turns text into a vector — say, 768 floats — such that texts with similar meaning end up close in that vector space. &lt;em&gt;"How do I cancel?"&lt;/em&gt; and &lt;em&gt;"ending your subscription"&lt;/em&gt; become near-neighbors.&lt;/p&gt;

&lt;p&gt;To search, you embed the query and find the nearest vectors. Naively that's O(N) — check distance to every doc — which doesn't scale past a few million. So we use &lt;strong&gt;approximate nearest neighbor (ANN)&lt;/strong&gt; indexes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HNSW&lt;/strong&gt; (Hierarchical Navigable Small World) — a graph where each node links to its near neighbors at multiple resolutions. Fast, accurate, memory-hungry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IVF&lt;/strong&gt; (Inverted File) — cluster all vectors first, only search the nearest clusters at query time. Lighter, slightly less accurate.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What it's good at:&lt;/strong&gt; paraphrase, synonyms, conceptual similarity, cross-language, fuzzy intent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it fails at:&lt;/strong&gt; rare tokens (the embedding has never seen them), acronyms, identifiers, numbers, exact-match requirements. Also expensive — every query means an embedding model call plus a vector search.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Hybrid Beats Both
&lt;/h2&gt;

&lt;p&gt;Run the same query through both systems and you get two ranked lists. How do you merge them?&lt;/p&gt;

&lt;p&gt;The surprisingly simple answer that keeps winning benchmarks: &lt;strong&gt;Reciprocal Rank Fusion (RRF)&lt;/strong&gt;. For each document &lt;code&gt;d&lt;/code&gt; that appears in any ranker &lt;code&gt;r&lt;/code&gt;, sum &lt;code&gt;1 / (k + rank_r(d))&lt;/code&gt; across all the rankers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;           ┌── keyword ranker (BM25) ──┐
query ────┤                            ├──► RRF merge ──► final ranking
           └── semantic ranker (vec) ──┘

           score(d) =  Σ   1 / (k + rank_r(d))
                      r∈R
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;k&lt;/code&gt; is a &lt;strong&gt;smoothing constant&lt;/strong&gt; — typically &lt;code&gt;60&lt;/code&gt;. It dampens the gap between rank 1 and rank 2 so the top result doesn't utterly dominate the fused score. Raise &lt;code&gt;k&lt;/code&gt; and lower ranks contribute more; lower &lt;code&gt;k&lt;/code&gt; and the top ranks dominate. The default works for almost everyone.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rrf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ranked_lists&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ranking&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ranked_lists&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ranking&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No calibration between score scales, no tuning, no training. A document that ranks well in either list bubbles up; a document that ranks in &lt;em&gt;both&lt;/em&gt; lists bubbles up strongly. That's why it works — the two systems have different failure modes, and RRF exploits that.&lt;/p&gt;




&lt;h2&gt;
  
  
  Build It In Postgres
&lt;/h2&gt;

&lt;p&gt;Postgres does both keyword (via &lt;code&gt;tsvector&lt;/code&gt; + GIN index) and semantic (via &lt;code&gt;pgvector&lt;/code&gt; + HNSW index) natively. One table, two indexes, one hybrid query.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;EXTENSION&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;SERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;content_tsv&lt;/span&gt; &lt;span class="n"&gt;TSVECTOR&lt;/span&gt; &lt;span class="k"&gt;GENERATED&lt;/span&gt; &lt;span class="n"&gt;ALWAYS&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="n"&gt;STORED&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;VECTOR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;768&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;docs_tsv_idx&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;GIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content_tsv&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;docs_vec_idx&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;HNSW&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;vector_cosine_ops&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Heads-up:&lt;/strong&gt; &lt;code&gt;HNSW&lt;/code&gt; requires &lt;strong&gt;pgvector ≥ 0.5.0&lt;/strong&gt;. On older versions, swap &lt;code&gt;HNSW&lt;/code&gt; for &lt;code&gt;IVFFLAT&lt;/code&gt; — slightly less accurate but identical from the query side.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Now a hybrid query with RRF, all in one SQL statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;kw&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row_number&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;ts_rank_cd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content_tsv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rnk&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;plainto_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'how to cancel subscription'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;content_tsv&lt;/span&gt; &lt;span class="o"&gt;@@&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;
  &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;vec&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row_number&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&amp;gt;&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rnk&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;
  &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&amp;gt;&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;
  &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;kw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rnk&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;vec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rnk&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;kw&lt;/span&gt;  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;kw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;vec&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;vec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;kw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;vec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;$1&lt;/code&gt; is the query embedding — a &lt;code&gt;VECTOR(768)&lt;/code&gt; your &lt;strong&gt;application layer&lt;/strong&gt; (Python, Node, Go) has already computed by calling the embedding model. Postgres doesn't embed on its own; it just stores and searches the result. So your app does: (1) embed the query, (2) send the vector as a parameter to this SQL. That's your hybrid retriever — one database, one query, no extra infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  What About Elasticsearch, Qdrant, Weaviate?
&lt;/h2&gt;

&lt;p&gt;Postgres is a fine choice for most teams. The dedicated tools matter when scale, flexibility, or specific features push you past it.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Best at&lt;/th&gt;
&lt;th&gt;Weak at&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Postgres (pgvector + tsvector)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Teams already on Postgres, moderate scale (&amp;lt; ~10M vectors), transactional data next to embeddings&lt;/td&gt;
&lt;td&gt;Billion-scale vector search, complex BM25 tuning, multi-tenant reranking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Elasticsearch / OpenSearch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mature BM25, aggregations, faceting, geo, log search; native hybrid via RRF since 8.x&lt;/td&gt;
&lt;td&gt;Embeddings are a second-class citizen; heavier ops&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qdrant&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pure vector workloads, clean Rust implementation, fast filters on payload, simple to run&lt;/td&gt;
&lt;td&gt;Keyword search is basic (no BM25) — you'll pair it with something else&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Weaviate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Built-in hybrid (BM25 + vectors) as a first-class feature, strong schema + modules for embedding pipelines&lt;/td&gt;
&lt;td&gt;Opinionated architecture; lock-in on their query language&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vespa&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The only one that plays seriously in all three axes — scale, keyword quality, vector quality — at FAANG scale&lt;/td&gt;
&lt;td&gt;Steep learning curve; overkill for most teams&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Rule of thumb:&lt;/strong&gt; start with Postgres. Move to Qdrant or Weaviate when vector count crosses ~10M or you need low-latency ANN at high QPS. Use Elasticsearch/OpenSearch when keyword quality and faceting are the main product. Reach for Vespa when &lt;em&gt;all three&lt;/em&gt; dimensions matter and nothing else scales.&lt;/p&gt;




&lt;h2&gt;
  
  
  Decision Matrix
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Your query looks like…&lt;/th&gt;
&lt;th&gt;Reach for&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Product SKUs, error codes, version numbers, filenames&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Keyword&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Embeddings have never seen these exact tokens; BM25 treats them as high-IDF signals&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Conceptual, paraphrased, cross-language&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Semantic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Different words, same meaning — keyword can't see the connection, vectors can&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Real user queries in a product — mixed and messy&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Hybrid&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Users type both kinds in the same session; hybrid has no downside when one side is weak&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Most similar to this other document"&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Semantic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pure vector problem — rank docs by distance in embedding space&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Log search, structured fields, faceting&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Keyword / Elasticsearch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Exact matches + aggregations matter more than meaning&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you're not sure — and in production you're usually not — just use hybrid. RRF has no downside: if one side is useless for a particular query, it simply contributes near-zero and the other side wins the ranking.&lt;/p&gt;




&lt;h2&gt;
  
  
  Gotchas Nobody Tells You
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chunking matters more than the embedding model.&lt;/strong&gt; A great model on badly chunked docs loses to a mediocre model on well-chunked docs. Start with ~300-token chunks with overlap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multilingual content breaks keyword search.&lt;/strong&gt; &lt;code&gt;to_tsvector('english', ...)&lt;/code&gt; silently butchers non-English text. Either detect language and use the right dictionary, or lean more on semantic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stop words cut both ways.&lt;/strong&gt; Removing &lt;em&gt;"the"&lt;/em&gt; helps keyword search. Removing &lt;em&gt;"not"&lt;/em&gt; changes the meaning completely for semantic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rerankers beat tuning.&lt;/strong&gt; After your hybrid retrieves 50 candidates, a cross-encoder reranker (e.g., &lt;code&gt;bge-reranker&lt;/code&gt;) re-scores them pairwise against the query. It's the single biggest quality lift you can add in an afternoon.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency budget is a product decision.&lt;/strong&gt; Pure keyword: ~5ms. Pure semantic with HNSW: ~20ms. Hybrid with a reranker: ~200ms. That last one can't live in a typeahead; it can live in RAG.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;The "semantic vs keyword" framing is a false choice. They're complementary — each covers where the other is blind. Hybrid retrieval with RRF is almost free to add, needs no training, and works across every vector store that also supports BM25.&lt;/p&gt;

&lt;p&gt;If you take one thing away: before you pick a new vector database, check whether Postgres already does what you need. For most teams, it does.&lt;/p&gt;

&lt;p&gt;Next time someone shows you a RAG demo with just embeddings, you'll know what's missing.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>database</category>
      <category>postgres</category>
    </item>
    <item>
      <title>Transformers — The Architecture That Changed AI (Part 1 of 3)</title>
      <dc:creator>Vahid Aghajani</dc:creator>
      <pubDate>Sat, 04 Jul 2026 17:53:35 +0000</pubDate>
      <link>https://dev.to/vahid_aghajani_60ce9dbec9/transformers-the-architecture-that-changed-ai-part-1-of-3-29ac</link>
      <guid>https://dev.to/vahid_aghajani_60ce9dbec9/transformers-the-architecture-that-changed-ai-part-1-of-3-29ac</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Originally published on &lt;a href="https://software-engineer-blog.com/content/transformers-the-architecture-that-changed-ai-part-1-of-3?id=49" rel="noopener noreferrer"&gt;my blog&lt;/a&gt;. Cross-posted here with a canonical link.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/es5o83E67jE"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;In June 2017, a team at Google published a paper with a deceptively simple title: &lt;strong&gt;"Attention Is All You Need."&lt;/strong&gt; Eight authors, fourteen pages, and one architecture that would go on to power GPT-4, Claude, Gemini, DALL-E, Stable Diffusion, AlphaFold, and virtually every breakthrough in AI since.&lt;/p&gt;

&lt;p&gt;The Transformer didn't just improve on existing models. It replaced the entire paradigm. Recurrent neural networks, LSTMs, sequence-to-sequence models with attention — all of them became legacy architectures almost overnight.&lt;/p&gt;

&lt;p&gt;This is Part 1 of a 3-part series. Here we cover the Transformer itself — the core architecture, the intuition behind each component, and why it scales so remarkably well. Part 2 will cover Vision Transformers (how this architecture learned to see), and Part 3 will cover Vision-Language Models (when AI learned to see &lt;em&gt;and&lt;/em&gt; talk).&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: Why RNNs Hit a Wall
&lt;/h2&gt;

&lt;p&gt;To understand why Transformers matter, you need to understand what came before.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recurrent Neural Networks (RNNs)&lt;/strong&gt; process sequences one token at a time, left to right. Each step takes the previous hidden state and the current input, produces a new hidden state, and passes it forward. This is elegant in theory: the hidden state is a compressed summary of everything the model has seen so far.&lt;/p&gt;

&lt;p&gt;In practice, it has three devastating problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The bottleneck problem.&lt;/strong&gt; By the time an RNN reaches the 500th word in a paragraph, the information from the 1st word has been compressed through 499 sequential transformations. Important early context gets diluted or lost entirely. Imagine trying to remember the first sentence of a book after reading 500 pages, where each page partially overwrites your memory of the previous one.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No parallelization.&lt;/strong&gt; Because each step depends on the previous step's output, you cannot process tokens in parallel. Training is inherently sequential. On modern GPUs with thousands of cores designed for parallel computation, this is a catastrophic bottleneck.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Vanishing and exploding gradients.&lt;/strong&gt; During backpropagation through time, gradients must flow backwards through every sequential step. Over long sequences, they either shrink to near-zero (vanishing) or blow up to infinity (exploding), making it extremely hard to learn long-range dependencies.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;LSTMs and GRUs&lt;/strong&gt; partially addressed problem 3 by adding gating mechanisms — explicit "remember" and "forget" controls. They helped, but they didn't solve the fundamental sequential nature of the computation (problem 2) or the information bottleneck (problem 1).&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;sequence-to-sequence model with attention&lt;/strong&gt; (Bahdanau et al., 2014) made a crucial step forward. Instead of forcing the decoder to work from a single compressed context vector, it allowed the decoder to "look back" at all encoder hidden states and attend to the most relevant ones at each decoding step. This was the birth of attention as a mechanism.&lt;/p&gt;

&lt;p&gt;But even seq2seq with attention still relied on an RNN backbone. The encoder still processed tokens sequentially. The Transformer's radical insight was: &lt;strong&gt;what if we throw away the recurrence entirely and use only attention?&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Idea: Attention as the Only Mechanism
&lt;/h2&gt;

&lt;p&gt;The Transformer computes relationships between all tokens in a sequence simultaneously. Instead of passing information through a chain of hidden states, every token can directly attend to every other token in a single operation.&lt;/p&gt;

&lt;p&gt;Think of it this way. An RNN is like a game of telephone — each person whispers the message to the next, and by the end of the line, the message is garbled. A Transformer is like a round table where everyone can hear everyone else directly. No information loss from sequential passing. No bottleneck.&lt;/p&gt;

&lt;p&gt;This has a profound consequence: &lt;strong&gt;the entire sequence can be processed in parallel.&lt;/strong&gt; During training, all tokens are known in advance, so every attention computation can happen simultaneously across the GPU. This is why Transformers train orders of magnitude faster than RNNs on the same hardware.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: Component by Component
&lt;/h2&gt;

&lt;p&gt;The original Transformer uses an &lt;strong&gt;encoder-decoder&lt;/strong&gt; structure, designed for sequence-to-sequence tasks like machine translation (English to German, for example). Let's walk through each component.&lt;/p&gt;

&lt;h3&gt;
  
  
  Encoder-Decoder Overview
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;encoder&lt;/strong&gt; takes the input sequence (e.g., an English sentence) and produces a rich representation of it — a set of vectors that capture meaning and context. The &lt;strong&gt;decoder&lt;/strong&gt; takes that representation and generates the output sequence (e.g., the German translation) one token at a time.&lt;/p&gt;

&lt;p&gt;The encoder is a stack of 6 identical layers. Each layer has two sub-components: a multi-head self-attention mechanism and a position-wise feed-forward network. The decoder is also 6 layers, but each layer has three sub-components: masked multi-head self-attention, multi-head cross-attention (attending to the encoder output), and a feed-forward network.&lt;/p&gt;

&lt;p&gt;Every sub-component is wrapped with a &lt;strong&gt;residual connection&lt;/strong&gt; and &lt;strong&gt;layer normalization&lt;/strong&gt;. We'll cover each piece.&lt;/p&gt;

&lt;h3&gt;
  
  
  Input Embeddings and Positional Encoding
&lt;/h3&gt;

&lt;p&gt;Before anything else, input tokens are converted to dense vectors via a learned embedding table. If the model dimension is &lt;code&gt;d_model = 512&lt;/code&gt;, each token becomes a 512-dimensional vector.&lt;/p&gt;

&lt;p&gt;But here's a problem the RNN never had: since the Transformer processes all tokens simultaneously, it has &lt;strong&gt;no inherent notion of order&lt;/strong&gt;. The sentence "the cat sat on the mat" and "mat the on sat cat the" would produce identical attention patterns without some way to encode position.&lt;/p&gt;

&lt;p&gt;The solution is &lt;strong&gt;positional encoding&lt;/strong&gt; — adding a position-dependent signal to each token embedding. The original paper uses sinusoidal functions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PE(pos, 2i)     = sin(pos / 10000^(2i/d_model))
PE(pos, 2i + 1) = cos(pos / 10000^(2i/d_model))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each position gets a unique pattern of sine and cosine values across the embedding dimensions. The key properties: (1) each position has a unique encoding, (2) the encoding is deterministic (no learned parameters), and (3) the model can generalize to sequence lengths longer than those seen during training because the functions are continuous.&lt;/p&gt;

&lt;p&gt;The analogy: think of positional encoding as a unique "address" stamped onto each word. The model learns to read these addresses and factor position into its attention decisions.&lt;/p&gt;

&lt;p&gt;Modern Transformers often use &lt;strong&gt;learned positional embeddings&lt;/strong&gt; (just another embedding table indexed by position) or &lt;strong&gt;Rotary Position Embeddings (RoPE)&lt;/strong&gt;, which encode relative position directly into the attention computation. But the core insight remains the same: you must inject position information explicitly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scaled Dot-Product Attention: The Q/K/V Framework
&lt;/h3&gt;

&lt;p&gt;This is the heart of the Transformer. Every attention mechanism in the architecture is built on this single operation.&lt;/p&gt;

&lt;p&gt;For each token, we compute three vectors from its embedding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Query (Q):&lt;/strong&gt; "What am I looking for?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key (K):&lt;/strong&gt; "What do I contain?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Value (V):&lt;/strong&gt; "What information do I provide if you attend to me?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are produced by multiplying the input by three learned weight matrices: &lt;code&gt;W_Q&lt;/code&gt;, &lt;code&gt;W_K&lt;/code&gt;, and &lt;code&gt;W_V&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The attention computation works in three steps:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Compute compatibility scores.&lt;/strong&gt; Multiply each query by all keys (dot product). This produces a score matrix: how relevant is each key to each query. High dot product = the query and key are aligned = this token is relevant to that token.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Scale and normalize.&lt;/strong&gt; Divide scores by the square root of the key dimension (&lt;code&gt;sqrt(d_k)&lt;/code&gt;). This scaling prevents the dot products from growing too large in magnitude, which would push the softmax into regions with tiny gradients. Then apply softmax row-wise to get attention weights that sum to 1.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Weighted sum of values.&lt;/strong&gt; Multiply the attention weights by the value vectors. Each token's output is a weighted combination of all value vectors, with weights determined by how relevant each key was to that token's query.&lt;/p&gt;

&lt;p&gt;In matrix form:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The analogy: imagine you're at a library (the sequence). Your query is the question you're researching. Each book has a title (key) and content (value). You scan all the titles, figure out which books are most relevant to your question, and then read those books more carefully — weighting your reading time based on relevance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-Head Attention: Attending to Different Things Simultaneously
&lt;/h3&gt;

&lt;p&gt;A single attention head learns one kind of relationship. But language has many simultaneous relationships: syntactic, semantic, coreference, positional, topical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-head attention&lt;/strong&gt; runs multiple attention operations in parallel, each with its own learned Q/K/V projections. The original Transformer uses 8 heads with &lt;code&gt;d_k = d_v = 64&lt;/code&gt; each (total: &lt;code&gt;8 * 64 = 512 = d_model&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Each head can specialize. Research has shown that different heads learn to capture different linguistic phenomena:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One head might track subject-verb agreement&lt;/li&gt;
&lt;li&gt;Another might focus on adjacent word relationships&lt;/li&gt;
&lt;li&gt;Another might capture long-range coreference ("she" referring to "Dr. Smith" from three sentences ago)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The outputs of all heads are concatenated and linearly projected back to &lt;code&gt;d_model&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_O
where head_i = Attention(Q * W_Qi, K * W_Ki, V * W_Vi)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is one of the Transformer's most powerful design choices — it gets multiple "perspectives" on the same data for the cost of one full-dimensional attention computation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Three Types of Attention in the Transformer
&lt;/h3&gt;

&lt;p&gt;The encoder-decoder Transformer uses attention in three distinct ways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Encoder self-attention.&lt;/strong&gt; Each token in the input attends to all other tokens in the input. This builds a contextual representation where each word's embedding is enriched by the full context of the sentence.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Masked decoder self-attention.&lt;/strong&gt; Each token in the output sequence attends to all &lt;em&gt;previous&lt;/em&gt; output tokens (but not future ones). The masking prevents the model from "cheating" by looking ahead during training. Future positions are set to negative infinity before the softmax, zeroing out their attention weights.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Encoder-decoder cross-attention.&lt;/strong&gt; Each decoder token attends to all encoder outputs. The queries come from the decoder; the keys and values come from the encoder. This is how the decoder "reads" the input sentence to produce the translation.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Feed-Forward Network
&lt;/h3&gt;

&lt;p&gt;After each attention sub-layer, every token passes independently through the same two-layer feed-forward network:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FFN(x) = ReLU(x * W_1 + b_1) * W_2 + b_2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The inner dimension is expanded (typically 4x, from 512 to 2048 in the original), then projected back down. This gives the model per-token nonlinear processing capacity — attention handles &lt;em&gt;inter-token&lt;/em&gt; relationships, while the FFN handles &lt;em&gt;intra-token&lt;/em&gt; transformation.&lt;/p&gt;

&lt;p&gt;Recent research suggests the FFN layers serve as the model's "memory," storing factual knowledge, while the attention layers handle relational reasoning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Residual Connections and Layer Normalization
&lt;/h3&gt;

&lt;p&gt;Every sub-layer (attention and FFN) is wrapped with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;output = LayerNorm(x + Sublayer(x))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;strong&gt;residual connection&lt;/strong&gt; (&lt;code&gt;x + Sublayer(x)&lt;/code&gt;) allows gradients to flow directly through the network without degradation, enabling very deep stacks (6, 12, 24, 96+ layers). Without residuals, training deep Transformers would be nearly impossible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer normalization&lt;/strong&gt; stabilizes the hidden state magnitudes, preventing the distribution of activations from drifting as the signal passes through many layers. It normalizes across the feature dimension for each token independently.&lt;/p&gt;

&lt;p&gt;These aren't glamorous components, but they're essential. The Transformer's depth — and therefore its capacity — depends on them.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Design Scales So Well
&lt;/h2&gt;

&lt;p&gt;The Transformer has a property that no previous architecture achieved to the same degree: &lt;strong&gt;predictable, smooth scaling.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In 2020, Kaplan et al. (OpenAI) published the &lt;strong&gt;scaling laws&lt;/strong&gt; paper, showing that Transformer performance improves as a smooth power law with respect to three factors:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Model size&lt;/strong&gt; (number of parameters)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dataset size&lt;/strong&gt; (number of training tokens)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute budget&lt;/strong&gt; (FLOPs spent on training)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Double the parameters, and you get a predictable improvement. Double the data, same thing. This is remarkably different from previous architectures where scaling often hit diminishing returns or instabilities.&lt;/p&gt;

&lt;p&gt;Why do Transformers scale so well?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Full parallelism&lt;/strong&gt; means more compute directly translates to faster training and larger batch sizes. No sequential bottleneck limits how much hardware you can throw at the problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attention across all positions&lt;/strong&gt; means the model's capacity to capture relationships grows with sequence length and model width, without architectural changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Depth composability&lt;/strong&gt; — stacking more identical layers adds representational power smoothly, thanks to residual connections and layer normalization.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No information bottleneck&lt;/strong&gt; — unlike RNNs, there's no fixed-size hidden state that all information must pass through.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Chinchilla paper (Hoffmann et al., 2022) later refined these laws, showing that models should be trained on roughly 20 tokens per parameter for optimal compute efficiency. This led to a shift from the "bigger model" paradigm to the "more data" paradigm.&lt;/p&gt;




&lt;h2&gt;
  
  
  Transformer Variants: A Family of Architectures
&lt;/h2&gt;

&lt;p&gt;The original Transformer is encoder-decoder. But researchers quickly discovered that using &lt;em&gt;parts&lt;/em&gt; of the architecture for specific tasks yielded remarkable results. Three major paradigms emerged.&lt;/p&gt;

&lt;h3&gt;
  
  
  Encoder-Only: BERT and Its Descendants
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;BERT&lt;/strong&gt; (Bidirectional Encoder Representations from Transformers, 2018) uses only the encoder stack. During pre-training, it masks random tokens in the input and trains the model to predict them — this is &lt;strong&gt;Masked Language Modeling (MLM)&lt;/strong&gt;. Because there's no autoregressive generation, every token can attend to every other token bidirectionally.&lt;/p&gt;

&lt;p&gt;BERT excels at understanding tasks: classification, named entity recognition, question answering, semantic similarity. It produces rich contextual embeddings where the same word gets different representations depending on context ("bank" in "river bank" vs. "bank account").&lt;/p&gt;

&lt;p&gt;Key descendants:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RoBERTa&lt;/strong&gt; — Same architecture, better training: more data, longer training, no next-sentence prediction objective. Showed BERT was significantly undertrained.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ALBERT&lt;/strong&gt; — Parameter-efficient BERT with cross-layer parameter sharing and factorized embeddings. Smaller model, competitive performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DeBERTa&lt;/strong&gt; — Disentangles content and position into separate attention streams, then combines them late. Achieved human-level performance on the SuperGLUE benchmark.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;XLNet&lt;/strong&gt; — Uses a permutation-based training objective to capture bidirectional context while maintaining autoregressive formulation. Avoids the pretrain-finetune discrepancy of BERT's masking.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Decoder-Only: GPT and the Autoregressive Revolution
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;GPT&lt;/strong&gt; (Generative Pre-trained Transformer, 2018) uses only the decoder stack. It's trained to predict the next token given all previous tokens — pure autoregressive language modeling. The masked self-attention ensures each position can only attend to earlier positions.&lt;/p&gt;

&lt;p&gt;This paradigm turned out to be the one that scales the furthest. The progression:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPT-1&lt;/strong&gt; (2018): 117M parameters. Showed that unsupervised pre-training + supervised fine-tuning works.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT-2&lt;/strong&gt; (2019): 1.5B parameters. Showed that scale enables zero-shot task performance. "Too dangerous to release" (it wasn't, but the PR worked).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT-3&lt;/strong&gt; (2020): 175B parameters. Showed that in-context learning emerges at scale — the model can perform tasks from a few examples in the prompt, no fine-tuning needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT-4&lt;/strong&gt; (2023): Architecture undisclosed, rumored mixture-of-experts. Multimodal (text + vision). State-of-the-art across dozens of benchmarks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Other major decoder-only models:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PaLM&lt;/strong&gt; (Google, 540B) — Showed breakthrough performance on reasoning tasks with chain-of-thought prompting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLaMA&lt;/strong&gt; (Meta) — Open-weight models proving that smaller, well-trained models can match much larger ones. LLaMA 2 (7B-70B) and LLaMA 3 catalyzed the open-source AI ecosystem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mistral&lt;/strong&gt; — Efficient open models using Grouped-Query Attention and Sliding Window Attention. Mistral 7B outperformed LLaMA 2 13B.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude&lt;/strong&gt; (Anthropic) — Constitutional AI approach with RLHF. Strong reasoning and instruction-following with emphasis on safety and helpfulness.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Encoder-Decoder: T5 and Unified Frameworks
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;T5&lt;/strong&gt; (Text-to-Text Transfer Transformer, 2019) keeps the full encoder-decoder architecture but reframes every NLP task as a text-to-text problem. Classification? Input: "classify: this movie was great", output: "positive". Translation? Input: "translate English to German: Hello", output: "Hallo".&lt;/p&gt;

&lt;p&gt;This unified framing is elegant — one architecture, one training procedure, one format for everything. T5 also systematically studied every architectural choice (model size, pre-training objective, dataset), making it one of the most thorough papers in the field.&lt;/p&gt;




&lt;h2&gt;
  
  
  Model Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Year&lt;/th&gt;
&lt;th&gt;Parameters&lt;/th&gt;
&lt;th&gt;Key Innovation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Original Transformer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Encoder-Decoder&lt;/td&gt;
&lt;td&gt;2017&lt;/td&gt;
&lt;td&gt;65M&lt;/td&gt;
&lt;td&gt;Self-attention replacing recurrence entirely&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BERT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Encoder-only&lt;/td&gt;
&lt;td&gt;2018&lt;/td&gt;
&lt;td&gt;110M / 340M&lt;/td&gt;
&lt;td&gt;Bidirectional pre-training with masked language modeling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPT-1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Decoder-only&lt;/td&gt;
&lt;td&gt;2018&lt;/td&gt;
&lt;td&gt;117M&lt;/td&gt;
&lt;td&gt;Unsupervised pre-training + fine-tuning paradigm&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPT-2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Decoder-only&lt;/td&gt;
&lt;td&gt;2019&lt;/td&gt;
&lt;td&gt;1.5B&lt;/td&gt;
&lt;td&gt;Zero-shot task transfer via scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;T5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Encoder-Decoder&lt;/td&gt;
&lt;td&gt;2019&lt;/td&gt;
&lt;td&gt;220M - 11B&lt;/td&gt;
&lt;td&gt;Unified text-to-text framing for all NLP tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;XLNet&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Autoregressive&lt;/td&gt;
&lt;td&gt;2019&lt;/td&gt;
&lt;td&gt;340M&lt;/td&gt;
&lt;td&gt;Permutation-based training for bidirectional context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RoBERTa&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Encoder-only&lt;/td&gt;
&lt;td&gt;2019&lt;/td&gt;
&lt;td&gt;355M&lt;/td&gt;
&lt;td&gt;Optimized BERT training procedure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ALBERT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Encoder-only&lt;/td&gt;
&lt;td&gt;2019&lt;/td&gt;
&lt;td&gt;12M - 235M&lt;/td&gt;
&lt;td&gt;Cross-layer parameter sharing, factorized embeddings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DeBERTa&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Encoder-only&lt;/td&gt;
&lt;td&gt;2020&lt;/td&gt;
&lt;td&gt;134M - 1.5B&lt;/td&gt;
&lt;td&gt;Disentangled attention for content and position&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPT-3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Decoder-only&lt;/td&gt;
&lt;td&gt;2020&lt;/td&gt;
&lt;td&gt;175B&lt;/td&gt;
&lt;td&gt;In-context learning, few-shot capabilities&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PaLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Decoder-only&lt;/td&gt;
&lt;td&gt;2022&lt;/td&gt;
&lt;td&gt;540B&lt;/td&gt;
&lt;td&gt;Pathways system, breakthrough reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLaMA 2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Decoder-only&lt;/td&gt;
&lt;td&gt;2023&lt;/td&gt;
&lt;td&gt;7B - 70B&lt;/td&gt;
&lt;td&gt;Open-weight, efficient training, GQA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mistral 7B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Decoder-only&lt;/td&gt;
&lt;td&gt;2023&lt;/td&gt;
&lt;td&gt;7B&lt;/td&gt;
&lt;td&gt;Sliding window attention, grouped-query attention&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPT-4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Decoder-only (MoE?)&lt;/td&gt;
&lt;td&gt;2023&lt;/td&gt;
&lt;td&gt;Undisclosed&lt;/td&gt;
&lt;td&gt;Multimodal, state-of-the-art reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Claude 3.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Decoder-only&lt;/td&gt;
&lt;td&gt;2024&lt;/td&gt;
&lt;td&gt;Undisclosed&lt;/td&gt;
&lt;td&gt;Constitutional AI, strong reasoning + safety&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLaMA 3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Decoder-only&lt;/td&gt;
&lt;td&gt;2024&lt;/td&gt;
&lt;td&gt;8B - 405B&lt;/td&gt;
&lt;td&gt;15T tokens training data, extended context&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Why Decoder-Only Won (For Now)
&lt;/h2&gt;

&lt;p&gt;A natural question: if the original Transformer is encoder-decoder, why are the largest and most capable models decoder-only?&lt;/p&gt;

&lt;p&gt;Several factors converged:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Simplicity.&lt;/strong&gt; One stack is easier to scale than two. Fewer architectural decisions, fewer hyperparameters, simpler training pipelines.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Unification of understanding and generation.&lt;/strong&gt; Encoder-only models are great at understanding but cannot generate. Decoder-only models can do both — they understand context &lt;em&gt;through&lt;/em&gt; the process of predicting what comes next.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Emergent capabilities.&lt;/strong&gt; As decoder-only models scaled, unexpected abilities appeared: chain-of-thought reasoning, in-context learning, instruction following. These emergent behaviors were less pronounced in encoder-only or encoder-decoder models at similar scales.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Training efficiency.&lt;/strong&gt; Next-token prediction is a dense supervision signal — every token in the training data provides a training signal. Masked language modeling only trains on the ~15% of tokens that are masked.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That said, encoder-decoder architectures aren't dead. They excel at tasks where you have a clear input-output mapping (translation, summarization), and models like T5 and its successors remain competitive in many benchmarks.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Lasting Impact
&lt;/h2&gt;

&lt;p&gt;The Transformer didn't just change NLP. It became the &lt;strong&gt;universal architecture&lt;/strong&gt; for deep learning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Computer Vision:&lt;/strong&gt; Vision Transformers (ViT) treat image patches as tokens and achieve state-of-the-art classification, detection, and segmentation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audio:&lt;/strong&gt; Whisper uses Transformers for speech recognition. MusicLM generates music.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Protein Structure:&lt;/strong&gt; AlphaFold 2 uses a modified attention mechanism to predict 3D protein structures, solving a 50-year-old biology problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Robotics:&lt;/strong&gt; RT-2 uses Transformers to translate language instructions into robot actions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code:&lt;/strong&gt; Codex, CodeLlama, and StarCoder generate and understand programming languages.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architecture is so general that the main research question shifted from "what architecture should we use?" to "how much data and compute should we invest?"&lt;/p&gt;




&lt;h2&gt;
  
  
  What Comes Next: From Text to Vision
&lt;/h2&gt;

&lt;p&gt;The Transformer started with language, but it didn't stay there. In Part 2 of this series, we'll explore &lt;strong&gt;Vision Transformers (ViT)&lt;/strong&gt; — how researchers adapted the attention mechanism to work with images, why it works so well, and how it dethroned CNNs as the dominant architecture in computer vision.&lt;/p&gt;

&lt;p&gt;From pixels to patches to attention maps — the next chapter of the Transformer story is just as transformative.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next up: &lt;a href="https://software-engineer-blog.com/content/vision-transformers-how-transformers-learned-to-see-part-2-of-3?id=50" rel="noopener noreferrer"&gt;Part 2 — Vision Transformers: How Transformers Learned to See&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>deeplearning</category>
      <category>nlp</category>
    </item>
    <item>
      <title>How LLM Function Calling Actually Works — From Tokens to Tool Orchestration</title>
      <dc:creator>Vahid Aghajani</dc:creator>
      <pubDate>Sat, 04 Jul 2026 17:42:25 +0000</pubDate>
      <link>https://dev.to/vahid_aghajani_60ce9dbec9/how-llm-function-calling-actually-works-from-tokens-to-tool-orchestration-27fb</link>
      <guid>https://dev.to/vahid_aghajani_60ce9dbec9/how-llm-function-calling-actually-works-from-tokens-to-tool-orchestration-27fb</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Originally published on &lt;a href="https://software-engineer-blog.com/content/how-llm-function-calling-actually-works-from-tokens-to-tool-orchestration?id=42" rel="noopener noreferrer"&gt;my blog&lt;/a&gt;. Cross-posted here with a canonical link.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When you ask an LLM "Compare the weather in Tokyo and Berlin," what actually happens? The model can't browse the internet — but it &lt;em&gt;can&lt;/em&gt; decide to call a weather API. Twice. In the same turn.&lt;/p&gt;

&lt;p&gt;This article covers how function calling works, how the LLM returns structured data despite generating tokens one by one, and what happens when the model needs to orchestrate multiple tool calls to answer a single question.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1: What Is Function Calling?
&lt;/h2&gt;

&lt;p&gt;"Function calling" is one of several ways to get output from an LLM API. Here are the three main methods:&lt;/p&gt;

&lt;h3&gt;
  
  
  Method A: Plain Text Completion (Simplest)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash-lite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Classify this email: ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;span class="c1"&gt;# "This is a job_search email because..."
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get back &lt;strong&gt;free-form text&lt;/strong&gt;. Then you'd have to parse it yourself — maybe with regex, or hoping the LLM follows instructions like "respond with JSON only". This is fragile because the LLM might say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"I think this email is in the job_search category because..."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;...and now your regex breaks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Method B: JSON Mode
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[...],&lt;/span&gt;
    &lt;span class="n"&gt;response_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json_object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;  &lt;span class="c1"&gt;# force JSON output
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM is &lt;strong&gt;forced to output valid JSON&lt;/strong&gt;, but you still have no guarantee of the schema — it might return &lt;code&gt;{"cat": "job"}&lt;/code&gt; instead of &lt;code&gt;{"category": "job_search"}&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Method C: Function Calling (What We Use)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[...],&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;classify_email&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enum&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spam&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;newsletter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...]&lt;/span&gt;
                    &lt;span class="p"&gt;},&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;tool_choice&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;classify_email&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You define the &lt;strong&gt;exact schema&lt;/strong&gt; you want — field names, types, enums, required fields. The API forces the LLM to fill in that schema. The result comes back as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"job_search"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"LinkedIn job alert for Senior Python Developer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"reasoning"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Sender is LinkedIn, contains job listings..."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the most reliable way to get structured output. The LLM &lt;strong&gt;cannot&lt;/strong&gt; deviate from the schema.&lt;/p&gt;

&lt;h3&gt;
  
  
  Comparison
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Output&lt;/th&gt;
&lt;th&gt;Schema Guarantee&lt;/th&gt;
&lt;th&gt;Reliability&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Plain text&lt;/td&gt;
&lt;td&gt;Free-form string&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Low — requires manual parsing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JSON mode&lt;/td&gt;
&lt;td&gt;Valid JSON&lt;/td&gt;
&lt;td&gt;No schema enforcement&lt;/td&gt;
&lt;td&gt;Medium — valid JSON but unpredictable keys&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Function calling&lt;/td&gt;
&lt;td&gt;Schema-constrained JSON&lt;/td&gt;
&lt;td&gt;Full schema + types + enums&lt;/td&gt;
&lt;td&gt;High — enforced at token generation level&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Part 2: How Does an LLM Return a Dict If It Generates Tokens?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The LLM still generates tokens one by one.&lt;/strong&gt; It doesn't "natively" return a Python dictionary. Here's what actually happens under the hood.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the LLM Actually Generates
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;Tokens:&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;"  category  "&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s2"&gt;"  job  _  search  "&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;"  confidence  "&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="mi"&gt;95&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
         &lt;/span&gt;&lt;span class="err"&gt;↑&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="err"&gt;↑&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="err"&gt;↑&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;↑&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;↑&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="err"&gt;↑&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;↑&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;↑&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="err"&gt;↑&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;↑&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="err"&gt;↑&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="err"&gt;↑&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;↑&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;↑&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;↑&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="err"&gt;token&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;token&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;token&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;(still&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;just&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;text&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;tokens)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM is still producing &lt;strong&gt;text&lt;/strong&gt; — it's just text that happens to be valid JSON. The API layer does the magic.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Constrained Decoding Process
&lt;/h3&gt;

&lt;p&gt;When you use function calling, the API applies &lt;strong&gt;constrained decoding&lt;/strong&gt; (also called "guided generation"):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The API receives your &lt;code&gt;tools&lt;/code&gt; schema definition&lt;/li&gt;
&lt;li&gt;It &lt;strong&gt;constrains&lt;/strong&gt; the LLM's token generation — at each step, only tokens that would produce valid JSON matching your schema are allowed&lt;/li&gt;
&lt;li&gt;For &lt;code&gt;"enum": ["job_search", "spam", ...]&lt;/code&gt;, the LLM can &lt;strong&gt;literally only pick&lt;/strong&gt; from those values&lt;/li&gt;
&lt;li&gt;For &lt;code&gt;"type": "number"&lt;/code&gt;, only numeric tokens are valid at that position&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is fundamentally different from just asking "please reply in JSON" in the prompt. The constraints are enforced at the &lt;strong&gt;token-generation level&lt;/strong&gt;, not via prompt instructions.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Full Flow
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LLM brain
  │
  ▼ (generates tokens, constrained by schema)
'{"category":"job_search","confidence":0.95,"summary":"..."}'
  │
  ▼ (API parses &amp;amp; validates)
response.choices[0].message.tool_calls[0].function.arguments
  │  (this is still a STRING)
  ▼
json.loads(arguments)
  │
  ▼ (now it's a Python dict)
{"category": "job_search", "confidence": 0.95, "summary": "..."}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  In Code
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The API returns tool_calls as part of the response
&lt;/span&gt;&lt;span class="n"&gt;tool_call&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# .arguments is a STRING containing JSON
&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt;
&lt;span class="c1"&gt;# '{"category":"job_search","confidence":0.95,...}'
&lt;/span&gt;
&lt;span class="c1"&gt;# We parse it into a Python dict
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# {"category": "job_search", "confidence": 0.95, ...}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: The LLM is still generating text/tokens. Function calling constrains &lt;em&gt;which&lt;/em&gt; tokens it can generate (must match your schema), and the API wraps the result in a structured format. We then &lt;code&gt;json.loads()&lt;/code&gt; that string into a Python dict.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 3: Multiple Tools — "Compare the Weather in Tokyo and Berlin"
&lt;/h2&gt;

&lt;p&gt;So far we've seen one function called once. But what happens when the user's question requires &lt;strong&gt;multiple tool calls&lt;/strong&gt;?&lt;/p&gt;

&lt;h3&gt;
  
  
  The Setup: Defining a Weather Tool
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Get current weather for a city&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;city&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;City name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;units&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enum&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;celsius&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fahrenheit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Temperature units&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;city&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We have &lt;strong&gt;one&lt;/strong&gt; tool definition — &lt;code&gt;get_weather&lt;/code&gt;. Now watch what happens when the user asks a question that requires it twice.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Request
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Compare the weather in Tokyo and Berlin right now&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What the LLM Returns: Two Tool Calls
&lt;/h3&gt;

&lt;p&gt;The LLM doesn't return a text answer. Instead, it returns &lt;strong&gt;two&lt;/strong&gt; tool calls in a single response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;

&lt;span class="c1"&gt;# message.content is None — no text response
# message.tool_calls has TWO entries:
&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ID: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Function: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Args: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;ID&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;call_abc123&lt;/span&gt;
&lt;span class="na"&gt;Function&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;get_weather&lt;/span&gt;
&lt;span class="na"&gt;Args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;city"&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tokyo"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;units"&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;celsius"&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;

&lt;span class="na"&gt;ID&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;call_def456&lt;/span&gt;
&lt;span class="na"&gt;Function&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;get_weather&lt;/span&gt;
&lt;span class="na"&gt;Args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;city"&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Berlin"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;units"&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;celsius"&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM &lt;strong&gt;decided on its own&lt;/strong&gt; to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Call the same function twice with different arguments&lt;/li&gt;
&lt;li&gt;Pick "celsius" as the unit (reasonable default for these cities)&lt;/li&gt;
&lt;li&gt;Return both calls in the same turn (parallel tool calls)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Your Code Executes the Tools
&lt;/h3&gt;

&lt;p&gt;Now you run both calls and feed the results back:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="c1"&gt;# Execute each tool call
&lt;/span&gt;&lt;span class="n"&gt;tool_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Call your actual weather API
&lt;/span&gt;    &lt;span class="n"&gt;weather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_weather_from_api&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;city&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;units&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;celsius&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_call_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# must match the ID from the LLM
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;weather&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# Send results back to the LLM
&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Compare the weather in Tokyo and Berlin right now&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# the assistant's tool_calls response
&lt;/span&gt;    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# both tool results
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;final&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;final&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Final Answer
&lt;/h3&gt;

&lt;p&gt;The LLM now has both weather results and generates a natural comparison:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Right now, Tokyo is 22°C with partly cloudy skies, while Berlin is 8°C and raining. Tokyo is 14 degrees warmer. If you're choosing between the two today, Tokyo has the better weather."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  The Complete Flow
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User: "Compare weather in Tokyo and Berlin"
  │
  ▼
LLM (Turn 1): I need weather for both cities
  │
  ├─→ tool_call: get_weather(city="Tokyo")    ──→ Your code calls API ──→ {"temp": 22, ...}
  │
  └─→ tool_call: get_weather(city="Berlin")   ──→ Your code calls API ──→ {"temp": 8, ...}
  │
  ▼
LLM (Turn 2): Now I have both results
  │
  └─→ "Tokyo is 22°C, Berlin is 8°C. Tokyo is 14 degrees warmer..."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Parallel vs Sequential Tool Calls
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Parallel&lt;/strong&gt; (what happened above): The LLM returns multiple &lt;code&gt;tool_calls&lt;/code&gt; in a single response. Both calls are independent — your code can execute them concurrently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute_tools_parallel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;execute_single_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Sequential&lt;/strong&gt;: Sometimes the LLM needs the result of one call before making the next. For example: "What's the weather in the capital of France?"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Turn 1: LLM calls get_capital(country="France")
  → Your code returns "Paris"
Turn 2: LLM calls get_weather(city="Paris")
  → Your code returns weather data
Turn 3: LLM generates final answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM decides which pattern to use based on whether the calls depend on each other.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 4: Different Tools in One Turn
&lt;/h2&gt;

&lt;p&gt;The LLM can also call &lt;strong&gt;different&lt;/strong&gt; tools in the same turn. Suppose you define two tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Get current weather for a city&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;city&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;city&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_exchange_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Get currency exchange rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;from_currency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to_currency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;from_currency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to_currency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the user asks: &lt;em&gt;"I'm traveling from NYC to Tokyo next week. What's the weather like and how much is 1 USD in Yen?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The LLM returns &lt;strong&gt;two different tool calls&lt;/strong&gt; in one turn:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tool_calls[0]
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arguments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;city&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tokyo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# tool_calls[1]
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_exchange_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arguments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;from_currency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;USD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to_currency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;JPY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your code routes each call to the right function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;tool_handlers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;handle_weather&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_exchange_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;handle_exchange_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tool_handlers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# ... send result back
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is essentially the &lt;strong&gt;registry pattern&lt;/strong&gt; — a dictionary maps function names to handlers. No if/else chains needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Controls This Behavior?
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;tool_choice&lt;/code&gt; parameter controls whether and how the LLM uses tools:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;code&gt;tool_choice&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;"auto"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;LLM decides whether to call tools or respond with text&lt;/td&gt;
&lt;td&gt;General-purpose agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;"required"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;LLM must call at least one tool&lt;/td&gt;
&lt;td&gt;When you always need structured output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;{"type": "function", "function": {"name": "..."}}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;LLM must call this specific function&lt;/td&gt;
&lt;td&gt;Email classification (always classify)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;"none"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;LLM cannot call any tools&lt;/td&gt;
&lt;td&gt;Force a text-only response&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For the weather comparison, we use &lt;code&gt;"auto"&lt;/code&gt; — the LLM decides on its own that it needs to call &lt;code&gt;get_weather&lt;/code&gt; twice.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Function calling &amp;gt; JSON mode &amp;gt; plain text&lt;/strong&gt; for getting structured data from LLMs. Function calling enforces your schema at the token generation level, not just via prompt instructions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;LLMs still generate tokens&lt;/strong&gt; — they don't natively return dicts. The API layer applies constrained decoding to ensure the token output matches your schema, then you &lt;code&gt;json.loads()&lt;/code&gt; the resulting string.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;One question can trigger multiple tool calls.&lt;/strong&gt; The LLM decides whether to call the same tool with different arguments (Tokyo + Berlin) or different tools entirely (weather + exchange rate) — all in a single turn.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Parallel vs sequential is decided by the LLM.&lt;/strong&gt; Independent calls (two cities) come back in one turn. Dependent calls (get capital → get weather) happen across multiple turns.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Route tool calls with a registry, not if/else.&lt;/strong&gt; A dictionary mapping function names to handlers keeps your code clean and extensible.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
