<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: delimitter</title>
    <description>The latest articles on DEV Community by delimitter (@delimitter_8b9077911a3848).</description>
    <link>https://dev.to/delimitter_8b9077911a3848</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1878195%2F7d6143ad-5ffb-4b2f-bd25-94c0e9b21b1d.png</url>
      <title>DEV Community: delimitter</title>
      <link>https://dev.to/delimitter_8b9077911a3848</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/delimitter_8b9077911a3848"/>
    <language>en</language>
    <item>
      <title>Executable Documentation: When Your Comments Become Tests</title>
      <dc:creator>delimitter</dc:creator>
      <pubDate>Sat, 04 Apr 2026 22:06:16 +0000</pubDate>
      <link>https://dev.to/delimitter_8b9077911a3848/executable-documentation-when-your-comments-become-tests-2p53</link>
      <guid>https://dev.to/delimitter_8b9077911a3848/executable-documentation-when-your-comments-become-tests-2p53</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Synoema stores documentation as executable state in the AST. Doc comments (&lt;code&gt;---&lt;/code&gt;) and their &lt;code&gt;example:&lt;/code&gt; assertions are parsed, tested, and rendered from a single source of truth. Stale docs fail the test suite. 56% fewer tokens than Python equivalent.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Who this is for.&lt;/strong&gt; If you've ever found an outdated docstring that claimed a function returned a string when it actually returns a list, this article is for you. Whether you maintain a library, write code with AI assistants, or just want documentation that can't lie вАФ read on.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Documentation lies. Not intentionally, but inevitably. A developer changes a function's behavior, forgets to update the docstring, and now the documentation describes code that no longer exists. This drift is not a personal failing вАФ it's a structural problem. Traditional programming languages treat documentation as metadata: a passive comment attached to code, never executed, never verified.&lt;/p&gt;

&lt;p&gt;What if the language itself made stale documentation structurally impossible?&lt;/p&gt;

&lt;p&gt;In this article вАФ Part 14 of &lt;em&gt;Token Economics of Code&lt;/em&gt; вАФ I'll describe a paradigm where documentation is stored as executable state directly in the AST, verified on every test run, and consumed by both humans and LLMs from a single source of truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Scale of the Problem
&lt;/h2&gt;

&lt;p&gt;The disconnect between documentation and code is well-documented (ironically). A 2023 study by Wen et al. found that 25.5% of Python docstrings in popular open-source projects are inconsistent with their corresponding function signatures. One in four.&lt;/p&gt;

&lt;p&gt;The cost isn't just confusion. When an LLM reads a stale docstring to understand your codebase, it generates code based on incorrect context. That code fails. The failure triggers a retry. The retry consumes tokens. The tokens cost money and energy. Documentation debt becomes inference debt.&lt;/p&gt;

&lt;p&gt;Three dominant approaches exist today. None solve the problem:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;Verification&lt;/th&gt;
&lt;th&gt;Drift risk&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Docstrings&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;doctest&lt;/code&gt; (opt-in, fragile)&lt;/td&gt;
&lt;td&gt;High вАФ separate from tests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSDoc / TSDoc&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;JS / TS&lt;/td&gt;
&lt;td&gt;None вАФ comments only&lt;/td&gt;
&lt;td&gt;Very high&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Haddock&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Haskell&lt;/td&gt;
&lt;td&gt;None вАФ rendered to HTML&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Python's &lt;code&gt;doctest&lt;/code&gt; module comes closest, but it has a fundamental limitation: it compares string representations of output, not semantic values. A change in &lt;code&gt;__repr__&lt;/code&gt; breaks every doctest. And doctest extraction relies on regex-level parsing, not the language's own AST.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Paradigm: Documentation as AST State
&lt;/h2&gt;

&lt;p&gt;Synoema takes a different approach. Documentation is a &lt;strong&gt;first-class syntactic element&lt;/strong&gt; вАФ not a comment convention, but a token type recognized by the lexer, preserved in the AST, and consumed by the compiler toolchain.&lt;/p&gt;

&lt;p&gt;The syntax uses triple-dash &lt;code&gt;---&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--- Compute factorial.
--- example: fact 5 == 120
fact 0 = 1
fact n = n * fact (n - 1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things happen when the parser encounters &lt;code&gt;---&lt;/code&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The lexer emits a &lt;code&gt;Token::DocComment(String)&lt;/code&gt; вАФ distinct from &lt;code&gt;--&lt;/code&gt; (regular comment, stripped during tokenization)&lt;/li&gt;
&lt;li&gt;The parser collects consecutive doc lines and attaches them to the next declaration as &lt;code&gt;doc: Vec&amp;lt;String&amp;gt;&lt;/code&gt; in the AST&lt;/li&gt;
&lt;li&gt;Lines starting with &lt;code&gt;example:&lt;/code&gt; are flagged as executable assertions&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is not a wrapper around regular comments. It's a distinct token class, occupying exactly 1 BPE token in the &lt;code&gt;cl100k_base&lt;/code&gt; vocabulary. Regular comments (&lt;code&gt;--&lt;/code&gt;) are invisible to the AST. Doc comments (&lt;code&gt;---&lt;/code&gt;) persist through the entire compilation pipeline.&lt;/p&gt;

&lt;p&gt;Here is the key difference from traditional approaches:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Traditional:  Source вЖТ [strip comments] вЖТ AST вЖТ Compile
Synoema:      Source вЖТ AST (with doc: Vec&amp;lt;String&amp;gt;) вЖТ Compile + Test + Doc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Documentation is not stripped. It travels with the code.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works: From Lexer to Test Runner
&lt;/h2&gt;

&lt;p&gt;Let me trace the full pipeline for a single doctest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Lexing.&lt;/strong&gt; The scanner encounters &lt;code&gt;---&lt;/code&gt; and calls &lt;code&gt;scan_doc_comment()&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input:  "--- example: fact 5 == 120\n"
Output: Token::DocComment("example: fact 5 == 120")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The text after &lt;code&gt;---&lt;/code&gt; is captured verbatim, with leading whitespace trimmed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Parsing.&lt;/strong&gt; The parser's &lt;code&gt;collect_doc_comments()&lt;/code&gt; method gathers consecutive &lt;code&gt;DocComment&lt;/code&gt; tokens into a vector. When it hits a function declaration, it attaches the vector:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="nn"&gt;Decl&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Func&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"fact"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;equations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"Compute factorial."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"example: fact 5 == 120"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both the human-readable description ("Compute factorial") and the executable assertion ("example: fact 5 == 120") live in the same &lt;code&gt;Vec&amp;lt;String&amp;gt;&lt;/code&gt;. No separate metadata structure. No JSON sidecar. One field.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Test extraction.&lt;/strong&gt; When you run &lt;code&gt;synoema test&lt;/code&gt;, the &lt;code&gt;extract_doctests()&lt;/code&gt; function walks every declaration, finds lines starting with &lt;code&gt;example:&lt;/code&gt;, and splits them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"example: fact 5 == 120"
         ^^^^^^    ^^^
         expr      expected
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The split respects bracket nesting вАФ &lt;code&gt;example: head [1 2 3] == 1&lt;/code&gt; correctly identifies &lt;code&gt;head [1 2 3]&lt;/code&gt; as the expression and &lt;code&gt;1&lt;/code&gt; as the expected value.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Execution.&lt;/strong&gt; Each doctest is evaluated by appending it to the full module source:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-- Original source loaded here --
__doctest_val = fact 5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The result is compared against the expected value (also evaluated in the same context). If they match вАФ pass. If not вАФ fail with a diagnostic showing the expression, expected value, and actual value.&lt;/p&gt;

&lt;p&gt;This means doctests have access to every definition in the file. They run in the real evaluation environment, not a sandboxed mock. If the function changes behavior, the doctest catches it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Testing Tiers, One Pipeline
&lt;/h2&gt;

&lt;p&gt;Synoema unifies three kinds of verification into a single &lt;code&gt;synoema test&lt;/code&gt; command:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 1: Doctests&lt;/strong&gt; вАФ inline assertions in doc comments.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--- Reverse a list.
--- example: reverse [1 2 3] == [3 2 1]
reverse [] = []
reverse (x:xs) = reverse xs ++ [x]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Tier 2: Unit tests&lt;/strong&gt; вАФ named boolean assertions using the &lt;code&gt;test&lt;/code&gt; keyword.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;test "fact base" = fact 0 == 1
test "fact 10" = fact 10 == 3628800
test "sort then reverse" = reverse (qsort [3 1 2]) == [3 2 1]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Tier 3: Property tests&lt;/strong&gt; вАФ generative testing with the &lt;code&gt;prop&lt;/code&gt; keyword.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;test "reverse involution" = prop xs -&amp;gt; reverse (reverse xs) == xs
test "sort idempotent" = prop xs -&amp;gt; qsort (qsort xs) == qsort xs
test "fact positive" = prop n -&amp;gt; fact n &amp;gt;= 1 when n &amp;gt;= 0 &amp;amp;&amp;amp; n &amp;lt;= 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Property tests use Hindley-Milner type inference to determine what values to generate. The variable &lt;code&gt;xs&lt;/code&gt; in &lt;code&gt;prop xs -&amp;gt; reverse (reverse xs) == xs&lt;/code&gt; is inferred as &lt;code&gt;List a&lt;/code&gt; вАФ so the test runner generates random lists. The variable &lt;code&gt;n&lt;/code&gt; in &lt;code&gt;prop n -&amp;gt; fact n &amp;gt;= 1&lt;/code&gt; is inferred as &lt;code&gt;Int&lt;/code&gt; вАФ so it generates random integers. No manual type annotations required.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;when&lt;/code&gt; clause filters generated values: &lt;code&gt;when n &amp;gt;= 0 &amp;amp;&amp;amp; n &amp;lt;= 10&lt;/code&gt; discards any &lt;code&gt;n&lt;/code&gt; outside that range before evaluating the property. 100 valid trials per property, deterministic seed for reproducibility.&lt;/p&gt;

&lt;p&gt;All three tiers run together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;synoema &lt;span class="nb"&gt;test &lt;/span&gt;examples/testing.sno

  testing.sno
    doctests:    4 passed, 0 failed
    unit tests:  4 passed, 0 failed
    properties:  5 passed, 0 failed &lt;span class="o"&gt;(&lt;/span&gt;500 trials&lt;span class="o"&gt;)&lt;/span&gt;

  Total: 13 passed, 0 failed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Documentation Generation: Same Source, Different Output
&lt;/h2&gt;

&lt;p&gt;The same &lt;code&gt;doc: Vec&amp;lt;String&amp;gt;&lt;/code&gt; that drives testing also drives documentation generation. The &lt;code&gt;synoema doc&lt;/code&gt; command reads the AST and renders it вАФ without re-parsing, without a separate doc format, without Markdown source files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Markdown output&lt;/strong&gt; (&lt;code&gt;synoema doc --format md&lt;/code&gt;):&lt;/p&gt;

&lt;p&gt;Interleaves doc lines as prose and declarations as code blocks. Lines starting with &lt;code&gt;example:&lt;/code&gt; are rendered as highlighted code snippets. Metadata lines (&lt;code&gt;guide:&lt;/code&gt;, &lt;code&gt;order:&lt;/code&gt;, &lt;code&gt;requires:&lt;/code&gt;) control page title, ordering, and dependency tracking вАФ but are invisible in the rendered output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JSON output&lt;/strong&gt; (&lt;code&gt;synoema doc --format json&lt;/code&gt;):&lt;/p&gt;

&lt;p&gt;Exports structured metadata for tooling:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"file"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"examples/testing.sno"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"functions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fact"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"doc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Compute factorial."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"example: fact 5 == 120"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"line"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This JSON is consumed by the MCP server, which exposes Synoema documentation to LLM agents. The documentation that LLMs read is the same documentation that tests verify. There is no gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why LLMs Care
&lt;/h2&gt;

&lt;p&gt;When an LLM generates or modifies Synoema code, it reads doc comments as part of the source context. Those comments are guaranteed to be accurate вАФ because if they weren't, &lt;code&gt;synoema test&lt;/code&gt; would have failed.&lt;/p&gt;

&lt;p&gt;This creates a feedback loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LLM reads doc вЖТ generates code вЖТ code changes behavior вЖТ
  synoema test catches stale docs вЖТ developer updates docs вЖТ
    LLM reads updated docs вЖТ ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In traditional languages, the loop has a silent gap: nothing catches stale docs. The LLM operates on incorrect context, generates incorrect code, and the developer blames the LLM rather than the documentation.&lt;/p&gt;

&lt;p&gt;There's a second benefit specific to token economics. Doc comments in Synoema use &lt;code&gt;---&lt;/code&gt; (1 BPE token) instead of Python's &lt;code&gt;"""..."""&lt;/code&gt; (at least 2 tokens for delimiters) or JSDoc's &lt;code&gt;/** ... */&lt;/code&gt; (3+ tokens). Each &lt;code&gt;example:&lt;/code&gt; line is 1 token for the keyword. The documentation syntax itself is token-efficient вАФ consistent with the language's design principle of minimizing BPE token count.&lt;/p&gt;

&lt;h2&gt;
  
  
  Side-by-Side: Python vs Synoema
&lt;/h2&gt;

&lt;p&gt;Let's compare equivalent documented, tested code:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python (32 tokens):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fact&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Compute factorial.
&lt;/span&gt;&lt;span class="gp"&gt;
    &amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;fact&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="mi"&gt;120&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;fact&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Synoema (14 tokens):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--- Compute factorial.
--- example: fact 5 == 120
fact 0 = 1
fact n = n * fact (n - 1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same function. Same documentation. Same executable test. 56% fewer tokens. But the meaningful difference isn't token count вАФ it's that Python's doctest compares string output (&lt;code&gt;"120"&lt;/code&gt;) while Synoema's compares evaluated values (&lt;code&gt;120 == 120&lt;/code&gt;). Change &lt;code&gt;fact&lt;/code&gt; to return a float, and Python's doctest breaks on &lt;code&gt;"120.0"&lt;/code&gt;. Synoema's doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Install and run doctests on the example suite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Synoema&lt;/span&gt;
cargo run &lt;span class="nt"&gt;-p&lt;/span&gt; synoema-repl &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nb"&gt;install&lt;/span&gt;

&lt;span class="c"&gt;# Run all tests (doctests + unit + property)&lt;/span&gt;
synoema &lt;span class="nb"&gt;test &lt;/span&gt;examples/

&lt;span class="c"&gt;# Run tests for a specific file&lt;/span&gt;
synoema &lt;span class="nb"&gt;test &lt;/span&gt;examples/testing.sno

&lt;span class="c"&gt;# Generate documentation&lt;/span&gt;
synoema doc examples/testing.sno
synoema doc &lt;span class="nt"&gt;--format&lt;/span&gt; json examples/testing.sno
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Write your own documented function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--- Double every element in a list.
--- example: double_all [1 2 3] == [2 4 6]
double_all xs = [x * 2 | x &amp;lt;- xs]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Save it as &lt;code&gt;my_funcs.sno&lt;/code&gt; and run &lt;code&gt;synoema test my_funcs.sno&lt;/code&gt;. The example assertion becomes a test. The description becomes documentation. One source, two outputs, zero drift.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;In the next article, we'll explore the &lt;strong&gt;future of code generation&lt;/strong&gt; вАФ how compilation, type inference, and executable documentation combine into an agentic pipeline where LLMs don't just write code, but verify it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Part 14 of "Token Economics of Code" by @andbubnov. Synoema is open-source: &lt;a href="https://github.com/Delimitter/synoema" rel="noopener noreferrer"&gt;github.com/Delimitter/synoema&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Glossary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;Explanation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AST&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Abstract Syntax Tree вАФ the parsed structure of source code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BPE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Byte Pair Encoding вАФ how LLMs split text into tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Doctest&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;An executable example embedded in documentation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Doc comment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A &lt;code&gt;---&lt;/code&gt; line in Synoema that persists in the AST&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hindley-Milner&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Type inference algorithm вАФ determines types without annotations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Model Context Protocol вАФ connects LLM agents to external tools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Property test&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A test that verifies a property holds for random inputs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Token&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Smallest text unit for an LLM, roughly 3-4 characters&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

</description>
      <category>documentation</category>
      <category>rust</category>
      <category>llm</category>
      <category>programming</category>
    </item>
    <item>
      <title>The Real Cost: Token Savings Calculator for Engineering Teams</title>
      <dc:creator>delimitter</dc:creator>
      <pubDate>Sat, 04 Apr 2026 16:29:11 +0000</pubDate>
      <link>https://dev.to/delimitter_8b9077911a3848/the-real-cost-token-savings-calculator-for-engineering-teams-1286</link>
      <guid>https://dev.to/delimitter_8b9077911a3848/the-real-cost-token-savings-calculator-for-engineering-teams-1286</guid>
      <description>&lt;h2&gt;
  
  
  How Much Is Your Team Actually Spending on Syntactic Overhead?
&lt;/h2&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Who this is for.&lt;/strong&gt; Engineering managers, team leads, and developers who pay for LLM API tokens. This article turns benchmark data into dollar amounts for teams of different sizes.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;We've shown that Synoema uses up to 33% fewer tokens than Python on functional code and that every token costs quadratically more than you think. Now let's do the math for real teams.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Part of &lt;a href="https://dev.to/delimitter/series/37505"&gt;Token Economics of Code&lt;/a&gt; series.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Formula
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Monthly cost = requests/day x tokens/request x price/token x 30 x quadratic_factor
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Current API Pricing (April 2026)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input ($/M tokens)&lt;/th&gt;
&lt;th&gt;Output ($/M tokens)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o&lt;/td&gt;
&lt;td&gt;$2.50&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o-mini&lt;/td&gt;
&lt;td&gt;$0.15&lt;/td&gt;
&lt;td&gt;$0.60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Haiku 4.5&lt;/td&gt;
&lt;td&gt;$0.80&lt;/td&gt;
&lt;td&gt;$4.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek V3&lt;/td&gt;
&lt;td&gt;$0.14&lt;/td&gt;
&lt;td&gt;$0.28&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 2.5 Pro&lt;/td&gt;
&lt;td&gt;$1.25&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Savings: Functional Code (-33% tokens)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Python&lt;/th&gt;
&lt;th&gt;Synoema&lt;/th&gt;
&lt;th&gt;Saving&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;System + prompt&lt;/td&gt;
&lt;td&gt;250&lt;/td&gt;
&lt;td&gt;250&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code context&lt;/td&gt;
&lt;td&gt;1,500&lt;/td&gt;
&lt;td&gt;1,005&lt;/td&gt;
&lt;td&gt;495&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output&lt;/td&gt;
&lt;td&gt;400&lt;/td&gt;
&lt;td&gt;268&lt;/td&gt;
&lt;td&gt;132&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total per request&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2,150&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,523&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;627 (29%)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Dollar Savings by Team Size (GPT-4o)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Team size&lt;/th&gt;
&lt;th&gt;Python monthly&lt;/th&gt;
&lt;th&gt;Synoema monthly&lt;/th&gt;
&lt;th&gt;Monthly saving&lt;/th&gt;
&lt;th&gt;Annual saving&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;5 devs&lt;/td&gt;
&lt;td&gt;$424&lt;/td&gt;
&lt;td&gt;$301&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$123&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$1,476&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25 devs&lt;/td&gt;
&lt;td&gt;$2,118&lt;/td&gt;
&lt;td&gt;$1,504&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$614&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$7,368&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 devs&lt;/td&gt;
&lt;td&gt;$8,470&lt;/td&gt;
&lt;td&gt;$6,014&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$2,456&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$29,472&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;500 devs&lt;/td&gt;
&lt;td&gt;$42,350&lt;/td&gt;
&lt;td&gt;$30,069&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$12,281&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$147,372&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Beyond Direct Token Cost
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Latency Savings
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Python&lt;/th&gt;
&lt;th&gt;Synoema&lt;/th&gt;
&lt;th&gt;Time saved&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Output generation&lt;/td&gt;
&lt;td&gt;8.0s&lt;/td&gt;
&lt;td&gt;5.4s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.6s per request&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Team of 25, per month&lt;/td&gt;
&lt;td&gt;55.8 hrs&lt;/td&gt;
&lt;td&gt;37.1 hrs&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;18.7 hrs/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Quadratic Compute
&lt;/h3&gt;

&lt;p&gt;29% fewer tokens = &lt;strong&gt;50% reduction&lt;/strong&gt; in attention compute (O(n^2)).&lt;/p&gt;

&lt;h3&gt;
  
  
  Error Rate Reduction
&lt;/h3&gt;

&lt;p&gt;Type-guided constrained decoding: &lt;strong&gt;74.8% fewer type errors&lt;/strong&gt;. Fewer retries = fewer total tokens.&lt;/p&gt;

&lt;h2&gt;
  
  
  Break-Even Analysis
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Team size&lt;/th&gt;
&lt;th&gt;Monthly saving (GPT-4o)&lt;/th&gt;
&lt;th&gt;Break-even&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;5 devs&lt;/td&gt;
&lt;td&gt;$123/mo&lt;/td&gt;
&lt;td&gt;~3 months&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25 devs&lt;/td&gt;
&lt;td&gt;$614/mo&lt;/td&gt;
&lt;td&gt;~1 month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 devs&lt;/td&gt;
&lt;td&gt;$2,456/mo&lt;/td&gt;
&lt;td&gt;&amp;lt; 2 weeks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Delimitter/synoema
&lt;span class="nb"&gt;cd &lt;/span&gt;synoema/lang
cargo run &lt;span class="nt"&gt;-p&lt;/span&gt; synoema-repl &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nb"&gt;eval&lt;/span&gt; &lt;span class="s2"&gt;"map f [] = []; map f (x:xs) = f x : map f xs; map (&lt;/span&gt;&lt;span class="se"&gt;\x&lt;/span&gt;&lt;span class="s2"&gt; -&amp;gt; x * 2) [1 2 3]"&lt;/span&gt;

&lt;span class="c"&gt;# Full benchmark:&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; .. &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; cargo run &lt;span class="nt"&gt;--manifest-path&lt;/span&gt; benchmarks/runner/Cargo.toml &lt;span class="nt"&gt;--&lt;/span&gt; run &lt;span class="nt"&gt;--phases&lt;/span&gt; tokens &lt;span class="nt"&gt;-v&lt;/span&gt;

&lt;span class="c"&gt;# MCP integration:&lt;/span&gt;
npx synoema-mcp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Build Your Own Estimate
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Monthly saving = team x requests x 22 x token_saving x price/M

Where:
  token_saving (input)  = context_tokens x 0.33
  token_saving (output) = output_tokens x 0.33
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Next: all the pieces together -- getting started, architecture, benchmarks, and the project roadmap.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Part of &lt;a href="https://dev.to/delimitter/series/37505"&gt;Token Economics of Code&lt;/a&gt; series by @andbubnov. Pricing: public API rates, April 2026.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>llm</category>
      <category>computerscience</category>
    </item>
    <item>
      <title>Hindley-Milner for LLMs: Type Inference Without Annotations</title>
      <dc:creator>delimitter</dc:creator>
      <pubDate>Sat, 04 Apr 2026 16:26:30 +0000</pubDate>
      <link>https://dev.to/delimitter_8b9077911a3848/hindley-milner-for-llms-type-inference-without-annotations-306e</link>
      <guid>https://dev.to/delimitter_8b9077911a3848/hindley-milner-for-llms-type-inference-without-annotations-306e</guid>
      <description>&lt;h2&gt;
  
  
  Polymorphic Typing: Fewer Tokens, Stronger Guarantees
&lt;/h2&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Who this is for.&lt;/strong&gt; If you've wondered whether it's possible to have strict typing without verbose annotations like Java or TypeScript — the answer is yes. This article explains how, and why it's critical for LLMs.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;33.6% of all LLM-generated code failures are type errors. Can we eliminate them without making the LLM generate type annotations?&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Part of &lt;a href="https://dev.to/delimitter/series/37505"&gt;Token Economics of Code&lt;/a&gt; series.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Types Cost Tokens
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// TypeScript: ~50% of tokens are type annotations&lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each annotation means more tokens, more context consumed (quadratic attention cost), and more opportunities for mistakes.&lt;/p&gt;

&lt;p&gt;The ideal: &lt;strong&gt;100% type safety with zero type annotations.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Hindley-Milner: A 1960s Solution to a 2020s Problem
&lt;/h2&gt;

&lt;p&gt;The Hindley-Milner algorithm infers types automatically, requiring zero annotations. Used in Haskell, OCaml, F#, Elm.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight haskell"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Synoema: zero annotations, 100% type safety&lt;/span&gt;
&lt;span class="n"&gt;add&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;
&lt;span class="c1"&gt;-- Compiler infers: Int -&amp;gt; Int -&amp;gt; Int&lt;/span&gt;

&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
&lt;span class="c1"&gt;-- Compiler infers: forall a. a -&amp;gt; a&lt;/span&gt;

&lt;span class="n"&gt;map&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="kt"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;[]&lt;/span&gt;
&lt;span class="n"&gt;map&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;map&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="n"&gt;xs&lt;/span&gt;
&lt;span class="c1"&gt;-- Compiler infers: forall a b. (a -&amp;gt; b) -&amp;gt; List a -&amp;gt; List b&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;Algorithm W works in three steps:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Constraint generation.&lt;/strong&gt; For each expression, create type variables and record constraints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Unification.&lt;/strong&gt; Solve the constraint system — replace type variables with concrete types. Conflicts = type errors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Generalization.&lt;/strong&gt; Remaining free variables become polymorphic: &lt;code&gt;id : a -&amp;gt; a&lt;/code&gt; becomes &lt;code&gt;id : forall a. a -&amp;gt; a&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Key property: &lt;strong&gt;HM always finds the most general type.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Let-Polymorphism
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight haskell"&gt;&lt;code&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
&lt;span class="n"&gt;main&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
  &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;        &lt;span class="c1"&gt;-- id used as Int -&amp;gt; Int&lt;/span&gt;
  &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="n"&gt;true&lt;/span&gt;       &lt;span class="c1"&gt;-- id used as Bool -&amp;gt; Bool&lt;/span&gt;
  &lt;span class="n"&gt;a&lt;/span&gt;                 &lt;span class="c1"&gt;-- No error! id is polymorphic.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Interaction with Constrained Decoding
&lt;/h2&gt;

&lt;p&gt;At each generation step, the compiler determines &lt;strong&gt;valid types&lt;/strong&gt; for the next expression:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight haskell"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- LLM generates: map ??? [1 2 3]&lt;/span&gt;
&lt;span class="c1"&gt;-- Compiler knows: ??? : Int -&amp;gt; t  (function from Int)&lt;/span&gt;
&lt;span class="c1"&gt;-- Valid: \x -&amp;gt; x + 1, \x -&amp;gt; x * 2&lt;/span&gt;
&lt;span class="c1"&gt;-- Invalid: \x -&amp;gt; x ++ "hello" (String != Int)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Grammar + type constraints narrow valid continuations by orders of magnitude.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Type guarantees&lt;/th&gt;
&lt;th&gt;Tokens on types&lt;/th&gt;
&lt;th&gt;Runtime errors&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Python (duck typing)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Many&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TypeScript&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;~30-50% of code&lt;/td&gt;
&lt;td&gt;Few&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Java&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;~40-60% of code&lt;/td&gt;
&lt;td&gt;Few&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Synoema (HM)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;None&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Delimitter/synoema
&lt;span class="nb"&gt;cd &lt;/span&gt;synoema/lang
cargo run &lt;span class="nt"&gt;-p&lt;/span&gt; synoema-repl &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nb"&gt;eval&lt;/span&gt; &lt;span class="s2"&gt;"id x = x; id 42"&lt;/span&gt;
cargo run &lt;span class="nt"&gt;-p&lt;/span&gt; synoema-repl &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nb"&gt;eval&lt;/span&gt; &lt;span class="s2"&gt;"map f [] = []; map f (x:xs) = f x : map f xs; map (&lt;/span&gt;&lt;span class="se"&gt;\x&lt;/span&gt;&lt;span class="s2"&gt; -&amp;gt; x * 2) [1 2 3]"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;TypeScript vs Synoema — same guarantees, &lt;strong&gt;44% fewer tokens&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// TypeScript: 25 tokens&lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;B&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;A&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[]):&lt;/span&gt; &lt;span class="nx"&gt;B&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight haskell"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Synoema: 14 tokens&lt;/span&gt;
&lt;span class="n"&gt;map&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="kt"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;[]&lt;/span&gt;
&lt;span class="n"&gt;map&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;map&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="n"&gt;xs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Impact on LLM Code Generation
&lt;/h2&gt;

&lt;p&gt;With HM, LLMs generate &lt;strong&gt;only semantics&lt;/strong&gt; — the compiler handles types. This is why Synoema achieves 74.8% fewer type errors than syntax-only constrained decoding.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Next: we measured every token across 16 algorithms in 5 languages.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Part of &lt;a href="https://dev.to/delimitter/series/37505"&gt;Token Economics of Code&lt;/a&gt; series by @andbubnov. HM type inference: 1,908 lines of Rust, 61 tests.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>computerscience</category>
      <category>llm</category>
      <category>programming</category>
    </item>
    <item>
      <title>JIT vs Interpreters: Benchmarking LLM-Generated Code Execution</title>
      <dc:creator>delimitter</dc:creator>
      <pubDate>Sat, 04 Apr 2026 16:22:09 +0000</pubDate>
      <link>https://dev.to/delimitter_8b9077911a3848/jit-vs-interpreters-benchmarking-llm-generated-code-execution-1486</link>
      <guid>https://dev.to/delimitter_8b9077911a3848/jit-vs-interpreters-benchmarking-llm-generated-code-execution-1486</guid>
      <description>&lt;h2&gt;
  
  
  Your AI Agent Writes Python. What If It Compiled to Native?
&lt;/h2&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Who this is for.&lt;/strong&gt; If you're building agentic workflows where LLMs generate and execute code — the execution speed of that code directly affects your agent's throughput. This article measures it.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Token efficiency is half the story. The other half: how fast does the generated code actually &lt;strong&gt;run&lt;/strong&gt;? We benchmarked Synoema's Cranelift JIT against Python, Node.js, TypeScript (tsx), and C++ (-O2) across 12 algorithmic tasks.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Part of &lt;a href="https://dev.to/delimitter/series/37505"&gt;Token Economics of Code&lt;/a&gt; series.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Hardware:&lt;/strong&gt; Apple Silicon (macOS Darwin 25.3.0)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Runtimes:&lt;/strong&gt; Synoema JIT (Cranelift, --release), CPython 3.12, Node.js (V8), TypeScript via tsx, C++ (g++ -O2)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Measurement:&lt;/strong&gt; 3 warm-up runs discarded, 5 measured runs, &lt;strong&gt;median&lt;/strong&gt; reported with p5/p95 percentiles.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fairness:&lt;/strong&gt; Identical algorithms across all languages. No language-specific optimizations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cargo run &lt;span class="nt"&gt;--manifest-path&lt;/span&gt; benchmarks/runner/Cargo.toml &lt;span class="nt"&gt;--&lt;/span&gt; run &lt;span class="nt"&gt;--phases&lt;/span&gt; runtime &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Results: Overview
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;Avg median (ms)&lt;/th&gt;
&lt;th&gt;vs Synoema&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;C++ (-O2)&lt;/td&gt;
&lt;td&gt;2.0&lt;/td&gt;
&lt;td&gt;2.5x faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Synoema JIT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;baseline&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Python 3.12&lt;/td&gt;
&lt;td&gt;27.6&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5.3x slower&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Results: Per-Task (12 tasks)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;C++ (ms)&lt;/th&gt;
&lt;th&gt;Synoema (ms)&lt;/th&gt;
&lt;th&gt;Python (ms)&lt;/th&gt;
&lt;th&gt;Synoema vs Python&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;binary_search&lt;/td&gt;
&lt;td&gt;2.1&lt;/td&gt;
&lt;td&gt;7.4&lt;/td&gt;
&lt;td&gt;16.7&lt;/td&gt;
&lt;td&gt;2.3x faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;collatz&lt;/td&gt;
&lt;td&gt;2.3&lt;/td&gt;
&lt;td&gt;5.7&lt;/td&gt;
&lt;td&gt;16.4&lt;/td&gt;
&lt;td&gt;2.9x faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;factorial&lt;/td&gt;
&lt;td&gt;1.4&lt;/td&gt;
&lt;td&gt;&lt;em&gt;JIT fail&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;17.2&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;fibonacci&lt;/td&gt;
&lt;td&gt;3.7&lt;/td&gt;
&lt;td&gt;&lt;em&gt;JIT fail&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;145.6&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;filter_map&lt;/td&gt;
&lt;td&gt;2.3&lt;/td&gt;
&lt;td&gt;5.2&lt;/td&gt;
&lt;td&gt;16.6&lt;/td&gt;
&lt;td&gt;3.2x faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;fizzbuzz&lt;/td&gt;
&lt;td&gt;1.7&lt;/td&gt;
&lt;td&gt;5.7&lt;/td&gt;
&lt;td&gt;16.8&lt;/td&gt;
&lt;td&gt;3.0x faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gcd&lt;/td&gt;
&lt;td&gt;2.4&lt;/td&gt;
&lt;td&gt;5.6&lt;/td&gt;
&lt;td&gt;16.8&lt;/td&gt;
&lt;td&gt;3.0x faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;matrix_mult&lt;/td&gt;
&lt;td&gt;1.5&lt;/td&gt;
&lt;td&gt;8.4&lt;/td&gt;
&lt;td&gt;17.6&lt;/td&gt;
&lt;td&gt;2.1x faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mergesort&lt;/td&gt;
&lt;td&gt;2.1&lt;/td&gt;
&lt;td&gt;6.6&lt;/td&gt;
&lt;td&gt;17.4&lt;/td&gt;
&lt;td&gt;2.6x faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;quicksort&lt;/td&gt;
&lt;td&gt;1.4&lt;/td&gt;
&lt;td&gt;6.0&lt;/td&gt;
&lt;td&gt;16.7&lt;/td&gt;
&lt;td&gt;2.8x faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;string_ops&lt;/td&gt;
&lt;td&gt;2.0&lt;/td&gt;
&lt;td&gt;5.1&lt;/td&gt;
&lt;td&gt;16.3&lt;/td&gt;
&lt;td&gt;3.2x faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;tree_traverse&lt;/td&gt;
&lt;td&gt;1.5&lt;/td&gt;
&lt;td&gt;6.5&lt;/td&gt;
&lt;td&gt;17.0&lt;/td&gt;
&lt;td&gt;2.6x faster&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;factorial&lt;/strong&gt; and &lt;strong&gt;fibonacci&lt;/strong&gt; fail in JIT mode (known limitation -- being addressed).&lt;/p&gt;

&lt;h2&gt;
  
  
  Analysis
&lt;/h2&gt;

&lt;h3&gt;
  
  
  JIT Compilation Overhead
&lt;/h3&gt;

&lt;p&gt;Synoema's times include Cranelift JIT compilation (10-50ms one-time cost). For short tasks, this overhead is visible. For longer computations, it's negligible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key insight: JIT overhead is constant, interpreter overhead is proportional to work.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Where Synoema Wins
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Recursive algorithms&lt;/strong&gt;: no interpreter loop overhead&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tight numeric loops&lt;/strong&gt; (collatz, gcd): native integer operations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pattern matching&lt;/strong&gt;: compiled to jump tables&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Where Synoema Loses
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;String-heavy operations&lt;/strong&gt;: Python's C-implemented string library is highly optimized&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Very short programs&lt;/strong&gt;: JIT overhead dominates when computation &amp;lt; 10ms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vs C++ always&lt;/strong&gt;: Cranelift generates ~86% quality code vs LLVM/GCC&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Honest Comparison
&lt;/h3&gt;

&lt;p&gt;The comparison that matters for AI agents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Synoema (JIT, type-safe, fewer tokens on functional code)
    vs
Python (interpreted, duck-typed, dominant in LLM generation)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Implications for AI Agents
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Python:   generate (1.5s) -&amp;gt; interpret (Nms)
Synoema:  generate (0.8s, fewer tokens) -&amp;gt; JIT (50ms) -&amp;gt; native (N/4 ms)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The real question: &lt;strong&gt;what's the total cost of the generate -&amp;gt; execute -&amp;gt; analyze cycle?&lt;/strong&gt; Token efficiency + compilation speed + type guarantees create compound savings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/synoema/synoema
&lt;span class="nb"&gt;cd &lt;/span&gt;synoema
cargo run &lt;span class="nt"&gt;--manifest-path&lt;/span&gt; benchmarks/runner/Cargo.toml &lt;span class="nt"&gt;--&lt;/span&gt; run &lt;span class="nt"&gt;--phases&lt;/span&gt; runtime &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Next: we sent the same prompts to 10 LLM models and measured who generates correct Synoema code.&lt;/p&gt;




&lt;p&gt;*Part of &lt;a href="https://dev.to/delimitter/series/37505"&gt;Token Economics of Code&lt;/a&gt; series by @andbubnov.*llmprogrammingrust&lt;/p&gt;

</description>
      <category>agents</category>
      <category>llm</category>
      <category>performance</category>
      <category>python</category>
    </item>
    <item>
      <title>Compilation for LLMs: Why a Language for Models Needs Native Code</title>
      <dc:creator>delimitter</dc:creator>
      <pubDate>Sat, 04 Apr 2026 15:59:08 +0000</pubDate>
      <link>https://dev.to/delimitter_8b9077911a3848/compilation-for-llms-why-a-language-for-models-needs-native-code-3jal</link>
      <guid>https://dev.to/delimitter_8b9077911a3848/compilation-for-llms-why-a-language-for-models-needs-native-code-3jal</guid>
      <description>&lt;h2&gt;
  
  
  Cranelift JIT, 2.8--5.9x Faster Than Python, and Why It Matters for AI Agents
&lt;/h2&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Who this is for.&lt;/strong&gt; If you're building AI agents that generate and execute code, or want to understand why compiled LLM output isn't science fiction but working technology -- read on. All terms explained inline and in the glossary.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;In &lt;a href="https://dev.to/delimitter/series/37505"&gt;previous articles&lt;/a&gt;, we showed how to &lt;a href="https://dev.to/delimitter/token-economics-of-code-why-llm-generated-code-costs-too-much-and-how-to-fix-it-hhn"&gt;cut tokens by 46%&lt;/a&gt; and &lt;a href="https://dev.to/delimitter/constrained-decoding-how-to-guarantee-llm-generated-code-is-syntactically-correct-4jl4"&gt;guarantee syntactic correctness&lt;/a&gt;. But there's a third problem: generated code must not only be short and correct -- it must be &lt;strong&gt;fast&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context: LLM Agents Write and Run Code
&lt;/h2&gt;

&lt;p&gt;Claude Code, Cursor, Devin, OpenAI Codex -- these tools don't just generate code. They &lt;strong&gt;execute&lt;/strong&gt; it: run tests, process data, call APIs. The cycle "generate -&amp;gt; run -&amp;gt; analyze result -&amp;gt; repeat" is the foundation of agentic workflows.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Agentic workflows&lt;/strong&gt; -- an approach where an LLM acts as an autonomous "agent": receives a task, breaks it into steps, writes code, runs it, analyzes the result, and adjusts.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The problem: almost all agents generate &lt;strong&gt;Python&lt;/strong&gt;. And Python is interpreted.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Interpreted language&lt;/strong&gt; -- a language whose code is executed "line by line" by an interpreter, without prior compilation to machine code. Interpreted languages are simpler but 10--100x slower than compiled ones.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This means: every run goes through the CPython interpreter (slow, single-threaded), no code optimization (Python doesn't know types until runtime via duck typing), and serious computation requires C-based libraries (NumPy, pandas).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Duck typing&lt;/strong&gt; -- Python's principle: "if it walks like a duck and quacks like a duck, it's a duck." Type errors are discovered only at runtime.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Solution: JIT Compilation
&lt;/h2&gt;

&lt;p&gt;What if LLM-generated code &lt;strong&gt;compiles to native machine code&lt;/strong&gt; in milliseconds and runs at C speed?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;JIT (Just-In-Time) compilation&lt;/strong&gt; -- compiling code to machine instructions immediately before execution, "on the fly." No separate build step. LLM generates code -&amp;gt; JIT compiles in milliseconds -&amp;gt; native execution speed.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LLM generates code (.sno)
    |
Parser -&amp;gt; AST -&amp;gt; Type Check -&amp;gt; Core IR
    |
Cranelift JIT -&amp;gt; native x86-64 machine code
    |
Execution at C/Rust speed (no interpreter)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The entire cycle -- from text to native code -- takes &lt;strong&gt;&amp;lt; 100 ms&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Cranelift, Not LLVM
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;LLVM&lt;/strong&gt; -- the industry standard. Used in Clang (C/C++), Rust, Swift, Julia. Generates very fast code but compiles &lt;strong&gt;slowly&lt;/strong&gt;. Written in C++, pulls gigabytes of dependencies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cranelift&lt;/strong&gt; -- written in pure Rust. Compiles &lt;strong&gt;10x faster&lt;/strong&gt; than LLVM. Generates code ~86% the quality of LLVM. Ideal for JIT.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;th&gt;LLVM&lt;/th&gt;
&lt;th&gt;Cranelift&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Language&lt;/td&gt;
&lt;td&gt;C++&lt;/td&gt;
&lt;td&gt;Rust&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compilation speed&lt;/td&gt;
&lt;td&gt;1x&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code quality&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;~86%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dependencies&lt;/td&gt;
&lt;td&gt;Gigabytes&lt;/td&gt;
&lt;td&gt;&lt;code&gt;cargo build&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ideal for&lt;/td&gt;
&lt;td&gt;AOT compilation&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;JIT compilation&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Benchmarks: Synoema JIT vs Python vs C++
&lt;/h2&gt;

&lt;p&gt;Methodology: median of 5 runs, 3 warm-up discarded. All times include process startup; Synoema times include JIT compilation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Full Suite (10 tasks)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;C++&lt;/th&gt;
&lt;th&gt;Synoema JIT&lt;/th&gt;
&lt;th&gt;Python&lt;/th&gt;
&lt;th&gt;Synoema vs Python&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;quicksort&lt;/td&gt;
&lt;td&gt;1.4 ms&lt;/td&gt;
&lt;td&gt;6.0 ms&lt;/td&gt;
&lt;td&gt;16.7 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.8x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mergesort&lt;/td&gt;
&lt;td&gt;2.1 ms&lt;/td&gt;
&lt;td&gt;6.6 ms&lt;/td&gt;
&lt;td&gt;17.4 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.6x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;binary_search&lt;/td&gt;
&lt;td&gt;2.1 ms&lt;/td&gt;
&lt;td&gt;7.4 ms&lt;/td&gt;
&lt;td&gt;16.7 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.3x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;tree_traverse&lt;/td&gt;
&lt;td&gt;1.5 ms&lt;/td&gt;
&lt;td&gt;6.5 ms&lt;/td&gt;
&lt;td&gt;17.0 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.6x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;filter_map&lt;/td&gt;
&lt;td&gt;2.3 ms&lt;/td&gt;
&lt;td&gt;5.2 ms&lt;/td&gt;
&lt;td&gt;16.6 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.2x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;collatz&lt;/td&gt;
&lt;td&gt;2.3 ms&lt;/td&gt;
&lt;td&gt;5.7 ms&lt;/td&gt;
&lt;td&gt;16.4 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.9x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gcd&lt;/td&gt;
&lt;td&gt;2.4 ms&lt;/td&gt;
&lt;td&gt;5.6 ms&lt;/td&gt;
&lt;td&gt;16.8 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.0x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;fizzbuzz&lt;/td&gt;
&lt;td&gt;1.7 ms&lt;/td&gt;
&lt;td&gt;5.7 ms&lt;/td&gt;
&lt;td&gt;16.8 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.0x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;matrix_mult&lt;/td&gt;
&lt;td&gt;1.5 ms&lt;/td&gt;
&lt;td&gt;8.4 ms&lt;/td&gt;
&lt;td&gt;17.6 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.1x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;string_ops&lt;/td&gt;
&lt;td&gt;2.0 ms&lt;/td&gt;
&lt;td&gt;5.1 ms&lt;/td&gt;
&lt;td&gt;16.3 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.2x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Average&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.9 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;6.2 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;16.8 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.8x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Compute-Heavy Tasks
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Python&lt;/th&gt;
&lt;th&gt;Synoema JIT&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;fib(30)&lt;/td&gt;
&lt;td&gt;277 ms&lt;/td&gt;
&lt;td&gt;47 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5.9x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;collatz (10K)&lt;/td&gt;
&lt;td&gt;505 ms&lt;/td&gt;
&lt;td&gt;90 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5.6x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gcd (100K)&lt;/td&gt;
&lt;td&gt;143 ms&lt;/td&gt;
&lt;td&gt;83 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.7x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Average&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.4x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  What the Numbers Mean
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Micro-benchmarks:&lt;/strong&gt; 2.1--3.2x faster. Startup overhead dominates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compute-heavy tasks:&lt;/strong&gt; up to 5.9x faster. JIT-compiled native code pulls ahead as startup cost amortizes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;C++ context:&lt;/strong&gt; C++ runs 3x faster than Synoema JIT on average -- expected since Cranelift generates ~86% quality code vs LLVM. Trade-off: Synoema compiles in &amp;lt; 100 ms (no build step).&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture Pipeline
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Source code (.sno)
  |
  +-- Lexer (735 lines, 82 tests)
  +-- Parser (1,672 lines, 43 tests) -- Pratt parser -&amp;gt; AST
  +-- Type Checker (1,908 lines, 61 tests) -- Hindley-Milner
  +-- Core IR (1,536 lines, 44 tests) -- System F
  +-- Diagnostics -- structured errors, LLM hints
  +-- Backend:
      +-- Interpreter (1,894 lines, 119 tests)
      +-- Cranelift JIT (3,044 lines, 126 tests)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;8 crates, ~12,000 lines of Rust, 890+ tests, 0 errors.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for AI Agents
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;With Python:&lt;/strong&gt; LLM generates script (200 tokens, 1.5s) -&amp;gt; Python processes (12s) -&amp;gt; total ~15 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With Synoema:&lt;/strong&gt; LLM generates sno code (108 tokens, 0.8s) -&amp;gt; JIT (50ms) -&amp;gt; native (3s) -&amp;gt; total ~4 seconds.&lt;/p&gt;

&lt;p&gt;Savings: &lt;strong&gt;73% time&lt;/strong&gt;, &lt;strong&gt;46% tokens&lt;/strong&gt;, &lt;strong&gt;zero dependencies&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Changed Since We Started
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;890+ tests&lt;/strong&gt; (from 264), all passing, 0 warnings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;JIT supports&lt;/strong&gt;: closures, records, ADTs, pattern matching, modules, TCO, string stdlib, float arithmetic, type class dispatch&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prelude&lt;/strong&gt;: Result type with combinators (map_ok, unwrap, and_then)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP server&lt;/strong&gt;: &lt;code&gt;npx synoema-mcp&lt;/code&gt; integrates into LLM toolchains&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Region inference&lt;/strong&gt;: memory management without GC&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diagnostics&lt;/strong&gt;: structured errors with LLM-friendly hints&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cargo build &lt;span class="nt"&gt;--release&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; synoema-repl
cargo run &lt;span class="nt"&gt;-p&lt;/span&gt; synoema-repl &lt;span class="nt"&gt;--&lt;/span&gt; jit examples/quicksort.sno
cargo run &lt;span class="nt"&gt;-p&lt;/span&gt; synoema-repl &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nb"&gt;eval&lt;/span&gt; &lt;span class="s2"&gt;"6 * 7"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Source: &lt;a href="https://github.com/Delimitter/synoema" rel="noopener noreferrer"&gt;github.com/Delimitter/synoema&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Next: &lt;strong&gt;Hindley-Milner&lt;/strong&gt; -- 100% type safety with zero annotations. This is what makes type-guided constrained decoding possible.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Part of &lt;a href="https://dev.to/delimitter/series/37505"&gt;Token Economics of Code&lt;/a&gt; series by @andbubnov.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>rust</category>
      <category>llm</category>
      <category>programming</category>
      <category>compiler</category>
    </item>
    <item>
      <title>Token Efficiency: 16 Algorithms, 5 Languages, Zero Guesswork</title>
      <dc:creator>delimitter</dc:creator>
      <pubDate>Sat, 04 Apr 2026 15:36:19 +0000</pubDate>
      <link>https://dev.to/delimitter_8b9077911a3848/token-efficiency-16-algorithms-5-languages-zero-guesswork-4k04</link>
      <guid>https://dev.to/delimitter_8b9077911a3848/token-efficiency-16-algorithms-5-languages-zero-guesswork-4k04</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Who this is for.&lt;/strong&gt; If you use LLMs to generate code — or pay for API tokens — this article shows exactly where your budget goes. Every number is reproducible. No opinions, just data.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In previous articles, we explained &lt;a href="https://dev.to/delimitter/why-every-token-costs-more-than-you-think-4a5g"&gt;why tokens are expensive&lt;/a&gt; (quadratic attention cost) and &lt;a href="https://dev.to/delimitter/the-anatomy-of-bpe-why-python-wastes-46-of-tokens-bfj"&gt;how BPE tokenization works&lt;/a&gt;. Now we show the full data: 16 algorithms implemented in 5 languages, every token counted with tiktoken (cl100k_base).&lt;/p&gt;

&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Tokenizer:&lt;/strong&gt; tiktoken cl100k_base (used by GPT-4, GPT-4o, Claude).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Languages:&lt;/strong&gt; Synoema, Python, JavaScript, TypeScript, C++.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt; 16 algorithmic problems covering recursion, higher-order functions, data structures, string operations, pattern matching, error handling, and custom types.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fairness rules:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identical algorithms (same logic, not idiomatic rewrites)&lt;/li&gt;
&lt;li&gt;SPDX license headers stripped before counting&lt;/li&gt;
&lt;li&gt;No comments counted&lt;/li&gt;
&lt;li&gt;Minimal imports (only what's needed)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Reproducibility:&lt;/strong&gt; &lt;code&gt;cargo run --manifest-path benchmarks/runner/Cargo.toml -- run --phases token&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Results: Summary
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csvs"&gt;&lt;code&gt;&lt;span class="k"&gt;Language&lt;/span&gt;          &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;Avg&lt;/span&gt; &lt;span class="k"&gt;tokens&lt;/span&gt;&lt;span class="err"&gt;/&lt;/span&gt;&lt;span class="k"&gt;task&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;vs&lt;/span&gt; &lt;span class="k"&gt;Synoema&lt;/span&gt;
&lt;span class="err"&gt;------------------|-----------------|----------&lt;/span&gt;
&lt;span class="k"&gt;Synoema&lt;/span&gt;           &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;92.5&lt;/span&gt;            &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;baseline&lt;/span&gt;
&lt;span class="k"&gt;Python&lt;/span&gt;            &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;92.9&lt;/span&gt;            &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;0&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;span class="k"&gt;JavaScript&lt;/span&gt;        &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;94.9&lt;/span&gt;            &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;3&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;span class="k"&gt;TypeScript&lt;/span&gt;        &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;114.6&lt;/span&gt;           &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;24&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;span class="k"&gt;C&lt;/span&gt;&lt;span class="err"&gt;++&lt;/span&gt;               &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;166.6&lt;/span&gt;           &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;80&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Wait — +0% vs Python?&lt;/strong&gt; Yes. The headline "46% savings" from our earlier work was measured on 12 purely functional programs. With 16 diverse tasks including imperative-style algorithms, the picture changes. Keep reading — the per-category breakdown tells the real story.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results: Per-Task Breakdown
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Synoema Wins: Functional &amp;amp; Pattern-Heavy Tasks
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csvs"&gt;&lt;code&gt;&lt;span class="k"&gt;Task&lt;/span&gt;             &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;Synoema&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;Python&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;JS&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;TS&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;C&lt;/span&gt;&lt;span class="err"&gt;++&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;vs&lt;/span&gt; &lt;span class="k"&gt;Python&lt;/span&gt;
&lt;span class="err"&gt;-----------------|---------|--------|-----|-----|-----|----------&lt;/span&gt;
&lt;span class="k"&gt;factorial&lt;/span&gt;        &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;25&lt;/span&gt;      &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;32&lt;/span&gt;     &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;35&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;38&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;58&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;22&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;span class="k"&gt;fibonacci&lt;/span&gt;        &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;38&lt;/span&gt;      &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;49&lt;/span&gt;     &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;52&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;55&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;75&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;22&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;span class="k"&gt;gcd&lt;/span&gt;              &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;26&lt;/span&gt;      &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;35&lt;/span&gt;     &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;38&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;43&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;61&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;26&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;span class="k"&gt;quicksort&lt;/span&gt;        &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;77&lt;/span&gt;      &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;124&lt;/span&gt;    &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;111&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;115&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;245&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;38&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;span class="k"&gt;json&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;build&lt;/span&gt;       &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;32&lt;/span&gt;      &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;67&lt;/span&gt;     &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;60&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;81&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;156&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;52&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;span class="k"&gt;pattern&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;match&lt;/span&gt;    &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;136&lt;/span&gt;     &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;225&lt;/span&gt;    &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;182&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;261&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;277&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;40&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;definition&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;83&lt;/span&gt;      &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;118&lt;/span&gt;    &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;127&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;189&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;204&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;30&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Average saving on functional tasks: &lt;strong&gt;-33% vs Python&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Near-Equal: General Algorithms
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csvs"&gt;&lt;code&gt;&lt;span class="k"&gt;Task&lt;/span&gt;             &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;Synoema&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;Python&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;JS&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;TS&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;C&lt;/span&gt;&lt;span class="err"&gt;++&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;vs&lt;/span&gt; &lt;span class="k"&gt;Python&lt;/span&gt;
&lt;span class="err"&gt;-----------------|---------|--------|-----|-----|-----|----------&lt;/span&gt;
&lt;span class="k"&gt;collatz&lt;/span&gt;          &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;55&lt;/span&gt;      &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;60&lt;/span&gt;     &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;63&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;66&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;87&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;8&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;span class="k"&gt;fizzbuzz&lt;/span&gt;         &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;59&lt;/span&gt;      &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;63&lt;/span&gt;     &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;66&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;69&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;97&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;6&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;span class="k"&gt;tree&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;traverse&lt;/span&gt;    &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;129&lt;/span&gt;     &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;130&lt;/span&gt;    &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;117&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;163&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;338&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;span class="k"&gt;error&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;handling&lt;/span&gt;   &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;95&lt;/span&gt;      &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;90&lt;/span&gt;     &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;101&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;144&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;179&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;6&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;span class="k"&gt;mergesort&lt;/span&gt;        &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;194&lt;/span&gt;     &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;179&lt;/span&gt;    &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;179&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;189&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;320&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;8&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;span class="k"&gt;filter&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;       &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;32&lt;/span&gt;      &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;28&lt;/span&gt;     &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;62&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;76&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;73&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;14&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Synoema Loses: Imperative &amp;amp; Index-Heavy Tasks
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csvs"&gt;&lt;code&gt;&lt;span class="k"&gt;Task&lt;/span&gt;             &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;Synoema&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;Python&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;JS&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;TS&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;C&lt;/span&gt;&lt;span class="err"&gt;++&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;vs&lt;/span&gt; &lt;span class="k"&gt;Python&lt;/span&gt;
&lt;span class="err"&gt;-----------------|---------|--------|-----|-----|-----|----------&lt;/span&gt;
&lt;span class="k"&gt;binary&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;search&lt;/span&gt;    &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;159&lt;/span&gt;     &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;120&lt;/span&gt;    &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;129&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;134&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;175&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;33&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;span class="k"&gt;string&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;ops&lt;/span&gt;       &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;28&lt;/span&gt;      &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;15&lt;/span&gt;     &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;17&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;19&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;52&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;87&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;span class="k"&gt;matrix&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;mult&lt;/span&gt;      &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;312&lt;/span&gt;     &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;152&lt;/span&gt;    &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;180&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;191&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;269&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;105&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Where the Savings Come From
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Function Definitions
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Python: 6 tokens
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;factorial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;factorial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight haskell"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Synoema: 2 tokens for the definition syntax&lt;/span&gt;
&lt;span class="n"&gt;fac&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;fac&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;def&lt;/code&gt;, &lt;code&gt;(&lt;/code&gt;, &lt;code&gt;)&lt;/code&gt;, &lt;code&gt;:&lt;/code&gt;, &lt;code&gt;return&lt;/code&gt; — 5 syntactic tokens that carry zero semantic information. Synoema uses pattern matching: &lt;code&gt;fac n =&lt;/code&gt; is 3 tokens total (name, parameter, equals).&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Conditionals
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Python: if/elif/else = 3 keyword tokens + colons
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight haskell"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Synoema: ? -&amp;gt; : = 3 single-character tokens&lt;/span&gt;
&lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Lists
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Python: commas between elements
&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# 9 tokens (5 numbers + 4 commas)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight haskell"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Synoema: space-separated&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;-- 7 tokens (5 numbers + 2 brackets)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every comma is a wasted token. In a list of N elements, Python wastes N-1 tokens on commas.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Type Annotations
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// TypeScript: ~50% of tokens are type information&lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;B&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;A&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[]):&lt;/span&gt; &lt;span class="nx"&gt;B&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight haskell"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Synoema: zero type tokens, compiler infers everything&lt;/span&gt;
&lt;span class="n"&gt;map&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="kt"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;[]&lt;/span&gt;
&lt;span class="n"&gt;map&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;map&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="n"&gt;xs&lt;/span&gt;
&lt;span class="c1"&gt;-- Inferred: forall a b. (a -&amp;gt; b) -&amp;gt; List a -&amp;gt; List b&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. C++ Ceremony
&lt;/h3&gt;

&lt;p&gt;C++ pays the highest tax: &lt;code&gt;#include&lt;/code&gt;, &lt;code&gt;template&amp;lt;typename T&amp;gt;&lt;/code&gt;, &lt;code&gt;std::vector&amp;lt;int&amp;gt;&lt;/code&gt;, &lt;code&gt;{&lt;/code&gt;, &lt;code&gt;}&lt;/code&gt;, &lt;code&gt;;&lt;/code&gt; after every statement. These structural tokens dominate — hence the +97% overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Honest Picture
&lt;/h2&gt;

&lt;p&gt;The data tells a nuanced story:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Synoema saves big (-22% to -52%)&lt;/strong&gt; when the task is naturally functional:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pattern matching (&lt;code&gt;case&lt;/code&gt; expressions vs Python's if/elif chains)&lt;/li&gt;
&lt;li&gt;List processing (comprehensions, cons, no commas)&lt;/li&gt;
&lt;li&gt;Custom type definitions (ADTs vs Python classes)&lt;/li&gt;
&lt;li&gt;Recursive algorithms (no &lt;code&gt;def&lt;/code&gt;/&lt;code&gt;return&lt;/code&gt; overhead)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Synoema loses (+33% to +105%)&lt;/strong&gt; when the task is imperative:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;matrix_mult&lt;/strong&gt;: no array indexing, must simulate with &lt;code&gt;get xs n&lt;/code&gt; helper (+105%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;binary_search&lt;/strong&gt;: same problem — linked-list traversal vs &lt;code&gt;xs[mid]&lt;/code&gt; (+33%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;string_ops&lt;/strong&gt;: Python's built-in methods (&lt;code&gt;s.upper()&lt;/code&gt;) are extremely concise (+87%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The critical insight:&lt;/strong&gt; Synoema is optimized for the code patterns LLMs most commonly generate — function definitions, recursion, data transformation. The tasks where Synoema loses (imperative loops, indexed access) are patterns where LLMs already generate efficient Python.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for Your LLM Budget
&lt;/h2&gt;

&lt;p&gt;For &lt;strong&gt;functional-style code&lt;/strong&gt; (the majority of LLM-generated algorithms), the savings compound via quadratic attention:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csvs"&gt;&lt;code&gt;&lt;span class="k"&gt;Metric&lt;/span&gt;             &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;Python&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;Synoema&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;Saving&lt;/span&gt;
&lt;span class="err"&gt;-------------------|--------|---------|-------&lt;/span&gt;
&lt;span class="k"&gt;Tokens&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;~&lt;/span&gt;&lt;span class="mf"&gt;93&lt;/span&gt;    &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;~&lt;/span&gt;&lt;span class="mf"&gt;60&lt;/span&gt;     &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;33&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;span class="k"&gt;Attention&lt;/span&gt; &lt;span class="k"&gt;compute&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;649&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;600&lt;/span&gt;   &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="mf"&gt;58&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For &lt;strong&gt;mixed workloads&lt;/strong&gt; (all 16 tasks equally weighted), savings are negligible on tokens but Synoema still wins on type safety and compilation speed.&lt;/p&gt;

&lt;p&gt;The takeaway: &lt;strong&gt;choose the right tool for the task.&lt;/strong&gt; Synoema excels at exactly the code patterns where LLMs benefit most from token reduction — function definitions, data transformations, pattern matching. For imperative array manipulation, Python remains competitive on token efficiency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Delimitter/synoema
&lt;span class="nb"&gt;cd &lt;/span&gt;synoema

&lt;span class="c"&gt;# Run token benchmarks (no runtime deps needed)&lt;/span&gt;
cargo run &lt;span class="nt"&gt;--manifest-path&lt;/span&gt; benchmarks/runner/Cargo.toml &lt;span class="nt"&gt;--&lt;/span&gt; run &lt;span class="nt"&gt;--phases&lt;/span&gt; token

&lt;span class="c"&gt;# Results saved to benchmarks/results/&amp;lt;date&amp;gt;/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Next in the series: runtime benchmarks — token efficiency is half the story. How fast does the generated code actually run?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Part of &lt;a href="https://dev.to/delimitter/series/37505"&gt;Token Economics of Code&lt;/a&gt; series. Token counts: tiktoken cl100k_base, reproducible via open-source benchmark suite.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Glossary
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;BPE (Byte Pair Encoding)&lt;/strong&gt; — Tokenization algorithm used by GPT-4, Claude. Splits text into subword tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;cl100k_base&lt;/strong&gt; — OpenAI's BPE vocabulary (~100K tokens). Standard for GPT-4 and Claude models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;tiktoken&lt;/strong&gt; — Python library for exact BPE token counting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quadratic attention&lt;/strong&gt; — Transformer cost scales as O(n²) with sequence length — 30% fewer tokens = 51% less compute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Syntactic overhead&lt;/strong&gt; — Tokens required by language grammar that carry no semantic information.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>benchmark</category>
      <category>llm</category>
      <category>programming</category>
      <category>rust</category>
    </item>
    <item>
      <title>Type-Guided Constrained Decoding: How to Stop LLMs from Hallucinating Code</title>
      <dc:creator>delimitter</dc:creator>
      <pubDate>Fri, 03 Apr 2026 08:18:05 +0000</pubDate>
      <link>https://dev.to/delimitter_8b9077911a3848/type-guided-constrained-decoding-how-to-stop-llms-from-hallucinating-code-5hbc</link>
      <guid>https://dev.to/delimitter_8b9077911a3848/type-guided-constrained-decoding-how-to-stop-llms-from-hallucinating-code-5hbc</guid>
      <description>&lt;h2&gt;
  
  
  From GBNF Grammars to Type-Directed Generation: Guarantees Instead of Hope
&lt;/h2&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Who this is for.&lt;/strong&gt; If you've ever had ChatGPT generate code that doesn't compile — this article explains how to eliminate that completely. All technical terms explained in footnotes and the glossary at the end.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;In previous articles, we showed that reducing tokens saves money, energy, and compute. But there's a more serious problem: LLMs generate &lt;strong&gt;incorrect&lt;/strong&gt; code. And every retry doubles the token spend.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Scale of the Problem
&lt;/h2&gt;

&lt;p&gt;Type errors account for 33.6% of all failures in LLM-generated code (Mündler et al., PLDI 2025&lt;sup id="fnref1"&gt;1&lt;/sup&gt;). These aren't typos — they're structural errors: wrong argument types, incompatible return values, accessing nonexistent fields.&lt;/p&gt;

&lt;p&gt;When an LLM generates a sort function that doesn't compile, the cost doubles — either a human fixes it (time) or an agent retries (tokens).&lt;/p&gt;

&lt;p&gt;But what if the model &lt;strong&gt;physically cannot&lt;/strong&gt; generate syntactically invalid code?&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Levels of Constraints
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Level 1: Grammar (Syntactic Correctness)
&lt;/h3&gt;

&lt;p&gt;At each generation step, the set of grammatically&lt;sup id="fnref2"&gt;2&lt;/sup&gt; valid tokens is determined. All others are masked — probability set to zero.&lt;/p&gt;

&lt;p&gt;Example: if the model just generated &lt;code&gt;[&lt;/code&gt;, then the next token can be a number, identifier, &lt;code&gt;]&lt;/code&gt;, or &lt;code&gt;[&lt;/code&gt; — but not &lt;code&gt;+&lt;/code&gt;, not &lt;code&gt;=&lt;/code&gt;, not &lt;code&gt;)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tools:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;XGrammar&lt;sup id="fnref3"&gt;3&lt;/sup&gt;&lt;/strong&gt; — default backend in SGLang&lt;sup id="fnref4"&gt;4&lt;/sup&gt;, vLLM, TensorRT-LLM&lt;sup id="fnref5"&gt;5&lt;/sup&gt;. Works with context-free grammars (CFG&lt;sup id="fnref6"&gt;6&lt;/sup&gt;). Approaches zero overhead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outlines&lt;sup id="fnref7"&gt;7&lt;/sup&gt;&lt;/strong&gt; — structured generation via finite state machines (FSM&lt;sup id="fnref8"&gt;8&lt;/sup&gt;). Supports regex and CFG.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;llama.cpp&lt;sup id="fnref9"&gt;9&lt;/sup&gt;&lt;/strong&gt; — built-in GBNF grammar&lt;sup id="fnref10"&gt;10&lt;/sup&gt; support.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Guidance&lt;/strong&gt; (Microsoft) — template-based generation with constraints.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result: &lt;strong&gt;100% syntactic correctness.&lt;/strong&gt; Every generated fragment is a valid program.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 2: Types (Semantic Correctness)
&lt;/h3&gt;

&lt;p&gt;Grammar guarantees that &lt;code&gt;f x = x + 1&lt;/code&gt; is syntactically valid. But not that &lt;code&gt;x&lt;/code&gt; is a number. Type-constrained decoding&lt;sup id="fnref11"&gt;11&lt;/sup&gt; adds a second layer: only tokens compatible with the current type context are allowed.&lt;/p&gt;

&lt;p&gt;Mündler et al. (PLDI 2025) showed that type-constrained decoding reduces compilation errors by &lt;strong&gt;74.8%&lt;/strong&gt; compared to 9.0% for syntax-only constraints.&lt;/p&gt;

&lt;p&gt;This requires type inference&lt;sup id="fnref12"&gt;12&lt;/sup&gt; — so the compiler can determine valid types at every generation point without explicit annotations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 3: Specification (Logical Correctness)
&lt;/h3&gt;

&lt;p&gt;The most powerful level: constraints based on formal specification. A sort function doesn't just have the right type — it actually sorts. This is an area of active research (dependent types, refinement types). Not yet in production tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  How XGrammar Works
&lt;/h2&gt;

&lt;p&gt;XGrammar's key optimization: &lt;strong&gt;splitting the vocabulary into two classes:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context-independent tokens (~80%+).&lt;/strong&gt; Validity determined at preprocessing, before generation. For each grammar state, a bitmask of valid tokens is precomputed. O(1) per token.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context-dependent tokens (~20%).&lt;/strong&gt; Validity depends on the current PDA&lt;sup id="fnref13"&gt;13&lt;/sup&gt; state. Checked at runtime, but few in number.&lt;/p&gt;

&lt;p&gt;Result: &lt;strong&gt;near-zero overhead.&lt;/strong&gt; Constrained decoding adds no measurable overhead to TPOT&lt;sup id="fnref14"&gt;14&lt;/sup&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  BPE Misalignment Breaks Constrained Decoding
&lt;/h2&gt;

&lt;p&gt;This is where language design becomes critical.&lt;/p&gt;

&lt;p&gt;When a language grammar isn't aligned to BPE boundaries, constrained decoding faces the &lt;strong&gt;bridge token&lt;/strong&gt; problem — a BPE token spanning two grammatical symbols.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Domino&lt;/strong&gt; (ICML 2024&lt;sup id="fnref15"&gt;15&lt;/sup&gt;) showed that bridge tokens distort the model's probability distribution. &lt;strong&gt;Grammar-Aligned Decoding&lt;/strong&gt; (NeurIPS 2024&lt;sup id="fnref16"&gt;16&lt;/sup&gt;) formalized the problem and proposed a fix — but with added overhead.&lt;/p&gt;

&lt;p&gt;If a language is designed so bridge tokens &lt;strong&gt;never arise&lt;/strong&gt; — every grammatical symbol coincides with one BPE token — the problem disappears entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deterministic CFG = Zero Overhead
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Nondeterministic CFG&lt;/strong&gt; — when parsing, multiple rules may apply. Requires backtracking&lt;sup id="fnref17"&gt;17&lt;/sup&gt;. Expensive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deterministic CFG (DCFG)&lt;sup id="fnref18"&gt;18&lt;/sup&gt;&lt;/strong&gt; — exactly one rule applies at each step. Compiles to an FSM. No backtracking. No ambiguity.&lt;/p&gt;

&lt;p&gt;Tian et al. (CoLM 2024&lt;sup id="fnref19"&gt;19&lt;/sup&gt;) proved that for DCFGs, constrained decoding compiles in &lt;strong&gt;closed form&lt;/strong&gt; — overhead approaches zero.&lt;/p&gt;

&lt;p&gt;If a language has a DCFG grammar with BPE-aligned operators, constrained decoding is &lt;strong&gt;free&lt;/strong&gt;: zero overhead + zero bridge tokens.&lt;/p&gt;

&lt;h2&gt;
  
  
  In Practice: GBNF Grammar
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;root    ::= program
program ::= (decl newline)* decl newline?
decl    ::= func-def | type-sig | type-def

func-def ::= lower-id ws (pattern ws)* "=" ws expr
cond-expr ::= "?" ws expr ws "-&amp;gt;" ws expr ws ":" ws expr
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plugging into SGLang:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write factorial&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;extra_body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ebnf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;synoema.gbnf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Result: GUARANTEED syntactically valid code
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or llama.cpp:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./main &lt;span class="nt"&gt;-m&lt;/span&gt; model.gguf &lt;span class="nt"&gt;--grammar-file&lt;/span&gt; synoema.gbnf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"-- Quicksort:"&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; 128 &lt;span class="nt"&gt;--temp&lt;/span&gt; 0.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Economic Impact
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Lever&lt;/th&gt;
&lt;th&gt;Mechanism&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;BPE-aligned grammar&lt;/td&gt;
&lt;td&gt;46% fewer tokens&lt;/td&gt;
&lt;td&gt;-46% direct&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quadratic attention&lt;/td&gt;
&lt;td&gt;54% length → 29% cost&lt;/td&gt;
&lt;td&gt;-71% on attention&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Constrained decoding&lt;/td&gt;
&lt;td&gt;0 invalid code → 0 retries&lt;/td&gt;
&lt;td&gt;-10–30%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type constraints&lt;/td&gt;
&lt;td&gt;-74.8% type errors&lt;/td&gt;
&lt;td&gt;-5–15% additional&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Combined: &lt;strong&gt;50–70%&lt;/strong&gt; savings in cost and energy vs unoptimized Python generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;In the next article, we'll introduce &lt;strong&gt;Synoema&lt;/strong&gt; — a language with all three levers: BPE-aligned grammar (33 single-token operators), Hindley-Milner&lt;sup id="fnref20"&gt;20&lt;/sup&gt; type inference, and Cranelift&lt;sup id="fnref21"&gt;21&lt;/sup&gt; JIT for native speed.&lt;/p&gt;







&lt;h2&gt;
  
  
  Series: Token Economics of Code
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://dev.to/delimitter_8b9077911a3848/why-every-token-costs-more-than-you-think-3i00"&gt;Why Every Token Costs More Than You Think&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/delimitter_8b9077911a3848/the-anatomy-of-bpe-why-python-wastes-46-of-tokens-4e0k"&gt;The Anatomy of BPE: Why Python Wastes 46% of Tokens&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Type-Guided Constrained Decoding: How to Stop LLMs from Hallucinating Code ← &lt;em&gt;you are here&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Compilation for LLMs: Cranelift JIT, 4.4× Faster Than Python &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Hindley-Milner for LLMs: Type Inference Without Annotations &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Show HN: Synoema — The First Programming Language Designed for LLMs &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;The Future of Code Generation: From Prompts to Compilation &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Glossary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;Explanation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Constrained decoding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Technology forbidding invalid tokens during generation. Guarantees correctness&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;XGrammar&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Constrained decoding engine, de facto standard for LLM inference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SGLang&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Open-source LLM inference engine from UC Berkeley&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;vLLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Open-source LLM inference engine with memory optimization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TensorRT-LLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;NVIDIA's inference engine, fastest on their GPUs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GBNF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Grammar description format for llama.cpp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;llama.cpp&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Popular project for running LLMs on commodity hardware&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CFG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Context-Free Grammar — formal grammar describing language syntax&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DCFG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deterministic CFG — unambiguous parsing, enables zero-overhead constraints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FSM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Finite State Machine — model for fast token validity checking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PDA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pushdown Automaton — FSM with a stack for nested structures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TPOT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Time Per Output Token — main LLM inference speed metric&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bridge token&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;BPE token spanning two grammar symbol boundaries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Type inference&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Automatic type deduction by the compiler, no annotations needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hindley-Milner&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Most powerful type inference algorithm. Used in Haskell, OCaml&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cranelift&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rust-based compiler backend. Fast JIT compilation to native code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PLDI / ICML / NeurIPS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Top academic conferences on PL, ML, and AI respectively&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Backtracking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Parsing by trial-and-error with rollbacks. Slow but universal&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;&lt;strong&gt;PLDI (Programming Language Design and Implementation)&lt;/strong&gt; — one of the most prestigious academic conferences on programming languages. Papers undergo rigorous peer review. If a result is published at PLDI, it's trustworthy. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn2"&gt;
&lt;p&gt;&lt;strong&gt;Formal grammar&lt;/strong&gt; — a set of rules describing which sequences of symbols are valid in a language. For example, "after &lt;code&gt;[&lt;/code&gt;, the next symbol can be a number, identifier, or &lt;code&gt;]&lt;/code&gt;" is part of a grammar. Python has a complex grammar (hundreds of rules); JSON has a simple one (~10 rules). ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn3"&gt;
&lt;p&gt;&lt;strong&gt;XGrammar&lt;/strong&gt; — constrained decoding engine from the MLC-AI team. The de facto standard for LLM inference. Its key innovation: splitting the vocabulary into "easy" tokens (80%+, checked at preprocessing) and "hard" tokens (20%, checked at runtime), yielding near-zero overhead. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn4"&gt;
&lt;p&gt;&lt;strong&gt;SGLang&lt;/strong&gt; — open-source LLM inference engine from UC Berkeley. One of the fastest ways to serve LLMs. Supports constrained decoding via XGrammar out of the box. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn5"&gt;
&lt;p&gt;&lt;strong&gt;TensorRT-LLM&lt;/strong&gt; — NVIDIA's inference engine, optimized for their GPUs. Fastest on NVIDIA hardware, but complex to set up. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn6"&gt;
&lt;p&gt;&lt;strong&gt;CFG (Context-Free Grammar)&lt;/strong&gt; — a class of grammars where each rule has the form "symbol → sequence of symbols." Most programming languages are described by CFGs. JSON, XML, HTML, Python, JavaScript all have CFGs. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn7"&gt;
&lt;p&gt;&lt;strong&gt;Outlines&lt;/strong&gt; — open-source library for structured generation. Compiles a grammar or regex into a finite state machine that filters tokens on the fly. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn8"&gt;
&lt;p&gt;&lt;strong&gt;FSM (Finite State Machine)&lt;/strong&gt; — a mathematical model that at any moment is in one of a finite number of states and transitions between them on input symbols. Used for fast checking of whether the next token is valid. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn9"&gt;
&lt;p&gt;&lt;strong&gt;llama.cpp&lt;/strong&gt; — the most popular open-source project for running LLMs on commodity hardware (CPU, Mac M1/M2, budget GPUs). Written in C++, works without Python. Supports GBNF grammars for constrained decoding. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn10"&gt;
&lt;p&gt;&lt;strong&gt;GBNF (GGML BNF)&lt;/strong&gt; — grammar description format used in llama.cpp. Extension of standard Backus-Naur Form. Example: &lt;code&gt;expr ::= number | expr "+" expr&lt;/code&gt;. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn11"&gt;
&lt;p&gt;&lt;strong&gt;Type-constrained decoding&lt;/strong&gt; — an extension of constrained decoding that checks types in addition to grammar. If a function expects &lt;code&gt;Int&lt;/code&gt;, the model can't substitute &lt;code&gt;String&lt;/code&gt;. Requires type inference — automatic type deduction by the compiler. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn12"&gt;
&lt;p&gt;&lt;strong&gt;Type inference&lt;/strong&gt; — the compiler's ability to determine types of all expressions automatically, without programmer annotations. Instead of &lt;code&gt;int add(int x, int y)&lt;/code&gt; (as in C), you write just &lt;code&gt;add x y = x + y&lt;/code&gt;, and the compiler deduces &lt;code&gt;Int → Int → Int&lt;/code&gt;. The most powerful type inference algorithm is Hindley-Milner, used in Haskell and ML. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn13"&gt;
&lt;p&gt;&lt;strong&gt;PDA (Pushdown Automaton)&lt;/strong&gt; — an extension of a finite state machine with a stack. Needed for grammars with nesting (brackets, code blocks). A regular FSM can't count brackets — a PDA can. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn14"&gt;
&lt;p&gt;&lt;strong&gt;TPOT (Time Per Output Token)&lt;/strong&gt; — the time to generate one output token. The main metric for LLM inference speed. For GPT-4: ~20–30 ms; for small models on powerful GPUs: 5–10 ms. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn15"&gt;
&lt;p&gt;&lt;strong&gt;ICML (International Conference on Machine Learning)&lt;/strong&gt; — one of the top three ML conferences (with NeurIPS and ICLR). Publication at ICML signals high-quality research. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn16"&gt;
&lt;p&gt;&lt;strong&gt;NeurIPS (Neural Information Processing Systems)&lt;/strong&gt; — the largest AI/ML conference. ~15,000 attendees annually. Publication at NeurIPS is the gold standard for ML research. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn17"&gt;
&lt;p&gt;&lt;strong&gt;Backtracking&lt;/strong&gt; — a parsing method where the parser tries one rule, and if it fails, backtracks and tries another. Slow because the same text may be parsed multiple times. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn18"&gt;
&lt;p&gt;&lt;strong&gt;DCFG (Deterministic Context-Free Grammar)&lt;/strong&gt; — a subclass of CFG where parsing is unambiguous at every step. Compiles to an efficient automaton. Most real programming languages are DCFGs (or close). Python technically isn't due to indentation, but with the offside rule it approximates one. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn19"&gt;
&lt;p&gt;&lt;strong&gt;CoLM&lt;/strong&gt; — a newer conference at the intersection of language models and formal methods. Focuses on how compiler theory can improve LLMs. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn20"&gt;
&lt;p&gt;&lt;strong&gt;Hindley-Milner&lt;/strong&gt; — the most powerful automatic type inference algorithm, developed in the 1960s–80s. Allows the compiler to determine types of all expressions &lt;strong&gt;without a single annotation&lt;/strong&gt;. Used in Haskell, OCaml, F#, Elm. Detailed in the fifth article. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn21"&gt;
&lt;p&gt;&lt;strong&gt;Cranelift&lt;/strong&gt; — a compiler backend written in Rust. Converts intermediate representation (IR) to native machine code (x86-64, ARM). Alternative to LLVM: compiles 10× faster, though generated code is ~14% slower. Ideal for JIT compilation where compilation speed matters more than peak optimization. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>codequality</category>
      <category>llm</category>
      <category>programming</category>
    </item>
    <item>
      <title>The Anatomy of BPE: Why Python Wastes 46% of Tokens</title>
      <dc:creator>delimitter</dc:creator>
      <pubDate>Thu, 02 Apr 2026 15:11:31 +0000</pubDate>
      <link>https://dev.to/delimitter_8b9077911a3848/the-anatomy-of-bpe-why-python-wastes-46-of-tokens-4e0k</link>
      <guid>https://dev.to/delimitter_8b9077911a3848/the-anatomy-of-bpe-why-python-wastes-46-of-tokens-4e0k</guid>
      <description>&lt;h2&gt;
  
  
  How BPE Tokenization Works and What It Means for Language Design
&lt;/h2&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Who this is for.&lt;/strong&gt; If you want to understand how ChatGPT "sees" your code and why the same program costs different amounts in different languages — read on. All terms explained in footnotes and the glossary at the end.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;In the previous article, we established that inference cost grows quadratically with token count. The natural question: can we reduce token count without losing semantics?&lt;/p&gt;

&lt;p&gt;To answer that, we need to understand how LLMs see code. Not as text — as a sequence of tokens. And between how a programmer sees &lt;code&gt;def factorial(n):&lt;/code&gt; and how GPT-4 sees it, there's a chasm.&lt;/p&gt;

&lt;h2&gt;
  
  
  How BPE Works
&lt;/h2&gt;

&lt;p&gt;BPE (Byte Pair Encoding)&lt;sup id="fnref1"&gt;1&lt;/sup&gt; is the algorithm that converts text into sequences of integers (tokens). It underlies all modern LLMs: GPT-4 uses the cl100k_base&lt;sup id="fnref2"&gt;2&lt;/sup&gt; vocabulary, Claude uses a modified BPE, and Llama uses SentencePiece&lt;sup id="fnref3"&gt;3&lt;/sup&gt; BPE.&lt;/p&gt;

&lt;p&gt;The algorithm is simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start with an alphabet of individual bytes (256 characters).&lt;/li&gt;
&lt;li&gt;Find the most frequent pair of adjacent symbols in the corpus&lt;sup id="fnref4"&gt;4&lt;/sup&gt;.&lt;/li&gt;
&lt;li&gt;Create a new symbol for that pair and add it to the vocabulary.&lt;/li&gt;
&lt;li&gt;Repeat steps 2–3 until the desired vocabulary size (~100K).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The result is a vocabulary of ~100,000 "subwords" of variable length. Short, frequent words (&lt;code&gt;the&lt;/code&gt;, &lt;code&gt;is&lt;/code&gt;, &lt;code&gt;def&lt;/code&gt;) encode as a single token. Rare words get split: &lt;code&gt;tokenization&lt;/code&gt; → &lt;code&gt;token&lt;/code&gt; + &lt;code&gt;ization&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The critical property:&lt;/strong&gt; BPE is trained on &lt;strong&gt;natural language&lt;/strong&gt;, not code. So it's optimized for English prose, not Python syntax.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Misalignment Problem
&lt;/h2&gt;

&lt;p&gt;The grammatical units of a programming language — operators, keywords, delimiters — &lt;strong&gt;don't align&lt;/strong&gt; with BPE token boundaries. This creates two types of waste.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Type 1: Redundant tokens on syntax.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Take a simple Python function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;factorial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;factorial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;BPE (cl100k_base) splits this into &lt;strong&gt;29 tokens&lt;/strong&gt;. Semantically significant: &lt;code&gt;factorial&lt;/code&gt;, &lt;code&gt;n&lt;/code&gt;, &lt;code&gt;0&lt;/code&gt;, &lt;code&gt;1&lt;/code&gt;, &lt;code&gt;*&lt;/code&gt;, &lt;code&gt;-&lt;/code&gt;. The remaining 23 tokens are syntactic overhead: &lt;code&gt;def&lt;/code&gt;, spaces, &lt;code&gt;(&lt;/code&gt;, &lt;code&gt;)&lt;/code&gt;, &lt;code&gt;:&lt;/code&gt;, &lt;code&gt;if&lt;/code&gt;, &lt;code&gt;==&lt;/code&gt;, &lt;code&gt;return&lt;/code&gt; (twice), indentation, newlines.&lt;/p&gt;

&lt;p&gt;The equivalent program in a minimal-syntax functional language:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight haskell"&gt;&lt;code&gt;&lt;span class="n"&gt;fac&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="n"&gt;fac&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;fac&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;16 tokens.&lt;/strong&gt; Same semantics. 45% fewer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Type 2: Bridge tokens&lt;sup id="fnref5"&gt;5&lt;/sup&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sometimes a single BPE token spans two grammatical symbols. For example, &lt;code&gt;"name"&lt;/code&gt; in JSON may become one token, even though grammatically it's space + quote + identifier + quote. This creates problems for constrained decoding&lt;sup id="fnref6"&gt;6&lt;/sup&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark: 12 Programs, 3 Languages
&lt;/h2&gt;

&lt;p&gt;I compared token counts for equivalent programs in three languages: Python, Haskell&lt;sup id="fnref7"&gt;7&lt;/sup&gt;, and an optimized language where every operator is exactly one BPE token.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Program&lt;/th&gt;
&lt;th&gt;Optimized&lt;/th&gt;
&lt;th&gt;Python&lt;/th&gt;
&lt;th&gt;Haskell&lt;/th&gt;
&lt;th&gt;Saving vs Python&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Factorial&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;45%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Map&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;52%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;QuickSort&lt;/td&gt;
&lt;td&gt;51&lt;/td&gt;
&lt;td&gt;83&lt;/td&gt;
&lt;td&gt;54&lt;/td&gt;
&lt;td&gt;39%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FizzBuzz&lt;/td&gt;
&lt;td&gt;44&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;49&lt;/td&gt;
&lt;td&gt;31%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Filter&lt;/td&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;td&gt;67&lt;/td&gt;
&lt;td&gt;36&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fibonacci&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;46&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;43%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sum&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;33&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;52%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Length&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;47%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reverse&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;35&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;49%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compose&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;td&gt;75&lt;/td&gt;
&lt;td&gt;39&lt;/td&gt;
&lt;td&gt;49%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maximum&lt;/td&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;td&gt;58&lt;/td&gt;
&lt;td&gt;37&lt;/td&gt;
&lt;td&gt;52%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zip&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;53&lt;/td&gt;
&lt;td&gt;37&lt;/td&gt;
&lt;td&gt;40%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;332&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;615&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;373&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;46%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The optimized language uses 46% fewer tokens than Python&lt;/strong&gt;, and 11% fewer than Haskell.&lt;/p&gt;

&lt;p&gt;Given quadratic attention cost: 46% fewer tokens ≈ &lt;strong&gt;71% less computation&lt;/strong&gt; in attention layers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the Savings Come From
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Function Definition
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Python: 6 tokens of boilerplate
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;

&lt;span class="c1"&gt;# Optimized: 0 boilerplate tokens
&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Conditional
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Python: 6 overhead tokens
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;

&lt;span class="c1"&gt;# Optimized: 3 tokens
&lt;/span&gt;&lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lists
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# 9 tokens (commas are waste)
&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;       &lt;span class="c1"&gt;# 7 tokens (no commas needed)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pattern Matching&lt;sup id="fnref8"&gt;8&lt;/sup&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Python: 29 tokens
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fac&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;fac&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Optimized: 16 tokens
&lt;/span&gt;&lt;span class="n"&gt;fac&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="n"&gt;fac&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;fac &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  BPE-Aligned Grammar
&lt;/h2&gt;

&lt;p&gt;The savings above aren't just "terse syntax." They're achieved through deliberate grammar design accounting for BPE tokenizer properties.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The BPE-aligned grammar&lt;sup id="fnref9"&gt;9&lt;/sup&gt; principle:&lt;/strong&gt; every language operator must be exactly one BPE token.&lt;/p&gt;

&lt;p&gt;For the optimized language, all 33 operators were verified — each encodes to exactly 1 BPE token on cl100k_base (GPT-4/Claude) and o200k_base&lt;sup id="fnref10"&gt;10&lt;/sup&gt; (GPT-4o):&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Two chars, 1 token:    -&amp;gt; &amp;lt;- |&amp;gt; ++ &amp;gt;&amp;gt; == != &amp;lt;= &amp;gt;= &amp;amp;&amp;amp; || ..
One char, 1 token:     ? : . = @ | \ _ , + - * / % &amp;lt; &amp;gt;
Delimiters, 1 token:   ( ) [ ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn't coincidence — it's a &lt;strong&gt;design constraint&lt;/strong&gt;. If an operator doesn't fit in one BPE token, it gets replaced by one that does.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for LLMs
&lt;/h2&gt;

&lt;p&gt;When an LLM generates code in the optimized language instead of Python, it generates 46% fewer tokens (faster, cheaper), spends 71% less on attention (larger codebases fit), creates no bridge tokens (cleaner constrained decoding), and can't "babble" (minimal syntax prevents bloat).&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;In the next article, we'll look at &lt;strong&gt;constrained decoding&lt;/strong&gt; — the technology that guarantees 100% syntactic correctness. And we'll show why BPE-aligned grammar makes constrained decoding &lt;strong&gt;free&lt;/strong&gt;.&lt;/p&gt;







&lt;h2&gt;
  
  
  Series: Token Economics of Code
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://dev.to/delimitter_8b9077911a3848/why-every-token-costs-more-than-you-think-3i00"&gt;Why Every Token Costs More Than You Think&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;The Anatomy of BPE: Why Python Wastes 46% of Tokens ← &lt;strong&gt;you are here&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Type-Guided Constrained Decoding: How to Stop LLMs from Hallucinating Code &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Compilation for LLMs: Cranelift JIT, 4.4× Faster Than Python &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Hindley-Milner for LLMs: Type Inference Without Annotations &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Show HN: Synoema — The First Programming Language Designed for LLMs &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;The Future of Code Generation: From Prompts to Compilation &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Glossary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;Explanation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BPE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Byte Pair Encoding — algorithm splitting text into tokens by merging frequent character pairs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;cl100k_base&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GPT-4/Claude's BPE vocabulary with ~100K tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;o200k_base&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GPT-4o's newer BPE vocabulary with ~200K tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SentencePiece&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Google's BPE alternative used in Llama and open models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Corpus&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Massive text dataset for training BPE and LLMs (web, books, code)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bridge token&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;BPE token spanning the boundary of two grammar symbols&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BPE-aligned grammar&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Grammar where every operator = exactly 1 BPE token&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pattern matching&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Defining functions by examples: &lt;code&gt;fac 0 = 1&lt;/code&gt; instead of &lt;code&gt;if n == 0&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Constrained decoding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Technology forbidding invalid tokens during generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Haskell&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Functional language with minimal syntax, brevity benchmark&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;&lt;strong&gt;BPE (Byte Pair Encoding)&lt;/strong&gt; — a text compression algorithm invented in 1994 and adapted for LLMs. The idea: find the most frequent pairs of characters in a huge text corpus and merge them into a new symbol. Repeat ~100,000 times. The result is a vocabulary of "subwords" that the model thinks in. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn2"&gt;
&lt;p&gt;&lt;strong&gt;cl100k_base&lt;/strong&gt; — the specific BPE token vocabulary used by GPT-4 and Claude. Contains ~100,000 tokens. Trained primarily on English internet text. GPT-4o uses a newer vocabulary called o200k_base with ~200,000 tokens. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn3"&gt;
&lt;p&gt;&lt;strong&gt;SentencePiece&lt;/strong&gt; — Google's alternative BPE implementation, used in Llama and other open models. Works at the Unicode character level instead of bytes, which is better for non-English languages. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn4"&gt;
&lt;p&gt;&lt;strong&gt;Corpus&lt;/strong&gt; — the massive text dataset used to train the BPE vocabulary (and the LLM itself). Includes web pages, books, articles, GitHub code. Usually hundreds of billions of tokens. Code makes up only ~5–15% of a typical corpus — which is why BPE is optimized for English prose, not Python syntax. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn5"&gt;
&lt;p&gt;&lt;strong&gt;Bridge token&lt;/strong&gt; — a BPE token that spans the boundary between two grammatical symbols. For example, BPE might merge a space and keyword &lt;code&gt;if&lt;/code&gt; into one token. This creates problems for constrained decoding engines, which must "split" the token, distorting the model's probability distribution. More details in the third article. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn6"&gt;
&lt;p&gt;&lt;strong&gt;Constrained decoding&lt;/strong&gt; — technology that forbids invalid tokens at each generation step. Guarantees syntactically valid output. Covered in detail in the third article. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn7"&gt;
&lt;p&gt;&lt;strong&gt;Haskell&lt;/strong&gt; — a functional programming language with minimal syntax. Used as the "brevity benchmark" among existing languages. &lt;code&gt;fac 0 = 1&lt;/code&gt; in Haskell and the optimized language look nearly identical, but the optimized language additionally accounts for BPE boundaries. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn8"&gt;
&lt;p&gt;&lt;strong&gt;Pattern matching&lt;/strong&gt; — defining a function by "examples." Instead of &lt;code&gt;if n == 0: return 1&lt;/code&gt;, you write &lt;code&gt;fac 0 = 1&lt;/code&gt; — literally: "factorial of zero is one." The compiler generates the check automatically. Shorter, clearer, and eliminates an entire class of errors. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn9"&gt;
&lt;p&gt;&lt;strong&gt;BPE-aligned grammar&lt;/strong&gt; — a language design principle where every operator, keyword, and delimiter encodes to exactly one BPE token. This means: no "wasted" tokens on syntax and no bridge tokens. Conventional languages don't account for BPE — they were created long before LLMs. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn10"&gt;
&lt;p&gt;&lt;strong&gt;o200k_base&lt;/strong&gt; — the newer BPE vocabulary used by GPT-4o. Contains ~200,000 tokens (twice cl100k_base). Better coverage of code and non-English languages, but same underlying principles. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>machinelearning</category>
      <category>science</category>
    </item>
    <item>
      <title>Why Every Token Costs More Than You Think</title>
      <dc:creator>delimitter</dc:creator>
      <pubDate>Thu, 02 Apr 2026 15:07:06 +0000</pubDate>
      <link>https://dev.to/delimitter_8b9077911a3848/why-every-token-costs-more-than-you-think-3i00</link>
      <guid>https://dev.to/delimitter_8b9077911a3848/why-every-token-costs-more-than-you-think-3i00</guid>
      <description>&lt;h1&gt;
  
  
  Why Every Token Costs More Than You Think
&lt;/h1&gt;

&lt;h2&gt;
  
  
  The Quadratic Price of Attention: How Context Length Is Killing Your AI Budget
&lt;/h2&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Who this is for.&lt;/strong&gt; If you use ChatGPT, Claude, Copilot, or Cursor to write code, this article explains why the same tasks can cost 2–4× less. No technical background required — all terms are explained inline and in the glossary at the end.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;When you ask Claude or GPT to write a sorting function, the model generates ~50 tokens&lt;sup id="fnref1"&gt;1&lt;/sup&gt; per second. Each token costs fractions of a cent. Seems cheap.&lt;/p&gt;

&lt;p&gt;But behind that simplicity lies an engineering reality most people overlook: the cost of each token grows &lt;strong&gt;quadratically&lt;/strong&gt; with context length&lt;sup id="fnref2"&gt;2&lt;/sup&gt;. If you're working with codebases spanning thousands of lines, this quadratic relationship transforms from a theoretical abstraction into a line item that can double your AI budget.&lt;/p&gt;

&lt;p&gt;In this article, I'll show where this cost comes from, why inference — not training — is the dominant consumer of resources, and what can be done about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Inference Consumes 90%+ of All Energy
&lt;/h2&gt;

&lt;p&gt;There's a common misconception: the major cost of LLMs&lt;sup id="fnref3"&gt;3&lt;/sup&gt; is training. Training GPT-4 reportedly cost $50–100M. An impressive number.&lt;/p&gt;

&lt;p&gt;But training is a one-time capital expense. Inference&lt;sup id="fnref4"&gt;4&lt;/sup&gt; is an ongoing operational cost that occurs with every request, every second, for every user.&lt;/p&gt;

&lt;p&gt;According to AWS, inference consumes more than 90% of total energy in the LLM lifecycle. The AI inference market is valued at $106 billion in 2025, projected to exceed $250 billion by 2030 at a 19.2% compound annual growth rate.&lt;/p&gt;

&lt;p&gt;Every token ChatGPT generates costs OpenAI approximately $0.00012. Sounds negligible. But at billions of daily requests, this adds up to hundreds of millions of dollars per year — and terajoules of electricity.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Quadratic Trap
&lt;/h2&gt;

&lt;p&gt;Here's the key fact that changes everything.&lt;/p&gt;

&lt;p&gt;In a standard transformer&lt;sup id="fnref5"&gt;5&lt;/sup&gt; with self-attention&lt;sup id="fnref6"&gt;6&lt;/sup&gt;, the computational cost of processing a sequence of &lt;em&gt;n&lt;/em&gt; tokens is:&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cost(n) = O(n² · d)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;where &lt;em&gt;d&lt;/em&gt; is the model dimension. This is not a linear relationship. It's &lt;strong&gt;quadratic&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;What this means in practice:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Relative attention cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1,000 tokens&lt;/td&gt;
&lt;td&gt;1× (baseline)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2,000 tokens&lt;/td&gt;
&lt;td&gt;4×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4,000 tokens&lt;/td&gt;
&lt;td&gt;16×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8,000 tokens&lt;/td&gt;
&lt;td&gt;64×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32,000 tokens&lt;/td&gt;
&lt;td&gt;1,024×&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Doubling the context increases attention cost &lt;strong&gt;fourfold&lt;/strong&gt;, not twofold. This means reducing context by 50% saves not 50%, but &lt;strong&gt;75%&lt;/strong&gt; of attention computation.&lt;/p&gt;

&lt;p&gt;When a developer sends a 2,000-line Python file (~8,000 tokens) to an LLM for refactoring, the attention cost is 64× higher than for a simple 1,000-token question. And that's just one request.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Money
&lt;/h2&gt;

&lt;p&gt;Let's calculate for a typical team.&lt;/p&gt;

&lt;p&gt;A team of 10 developers uses an AI assistant (Cursor, Copilot, Claude Code). Each makes an average of 100 requests per day. Average request context: 2,000 input tokens. Average response: 500 output tokens.&lt;/p&gt;

&lt;p&gt;At Claude Sonnet 4 pricing ($3/M input, $15/M output):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input:  10 × 100 × 2,000 = 2M tokens/day × $3/M  = $6/day
Output: 10 × 100 × 500   = 500K tokens/day × $15/M = $7.50/day
Total: ~$13.50/day = ~$405/month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now imagine expressing the same programs with 46% fewer tokens (I'll show in the next article that this is achievable):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input:  2M × 0.54 = 1.08M tokens/day × $3/M  = $3.24/day
Output: 500K × 0.54 = 270K tokens/day × $15/M = $4.05/day
Total: ~$7.29/day = ~$219/month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Savings: &lt;strong&gt;$186/month&lt;/strong&gt; for 10 people, or &lt;strong&gt;$2,200/year&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For 100 developers: &lt;strong&gt;$22,000/year&lt;/strong&gt;. For 1,000: &lt;strong&gt;$220,000&lt;/strong&gt;. And this is a conservative estimate with a relatively affordable model and moderate workload.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Energy Dimension
&lt;/h2&gt;

&lt;p&gt;Measurements on LLaMA-65B&lt;sup id="fnref7"&gt;7&lt;/sup&gt; (A100 GPUs&lt;sup id="fnref8"&gt;8&lt;/sup&gt;) show energy consumption in the range of 3–4 joules per output token. On modern H100s with optimized inference engines like vLLM&lt;sup id="fnref9"&gt;9&lt;/sup&gt;, efficiency has improved roughly 10×, down to ~0.39 J per token. But usage scale has grown even faster.&lt;/p&gt;

&lt;p&gt;ChatGPT processes an estimated one billion requests daily. At an average response of 500 tokens:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1B requests × 500 tokens × 0.39 J ≈ 195 GJ/day ≈ 54,000 kWh/day
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the energy consumption of a small town — every single day. Reducing token count isn't just about saving money. It's a direct reduction in energy consumption and carbon footprint.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Babbling Problem
&lt;/h2&gt;

&lt;p&gt;The study "Towards Green AI" (2026) found that 3 out of 10 tested models exhibit "babbling" behavior — generating significantly more text than necessary. Suppressing this yielded energy savings of 44% to 89%.&lt;/p&gt;

&lt;p&gt;But what if the language the LLM writes code in is designed so that "babbling" is physically impossible?&lt;/p&gt;

&lt;p&gt;Python code is inherently verbose. &lt;code&gt;def&lt;/code&gt;, &lt;code&gt;return&lt;/code&gt;, &lt;code&gt;if/elif/else&lt;/code&gt;, commas in lists — all syntactic overhead&lt;sup id="fnref10"&gt;10&lt;/sup&gt; that consumes tokens without carrying semantic information.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Optimization Levers
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Lever 1: Representation compression.&lt;/strong&gt; Express the same program with fewer tokens. This isn't obfuscation — it's grammar design optimized for BPE tokenizers&lt;sup id="fnref11"&gt;11&lt;/sup&gt;. Potential: 35–50%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lever 2: Constrained decoding&lt;sup id="fnref12"&gt;12&lt;/sup&gt;.&lt;/strong&gt; Prevent the model from generating syntactically invalid code. Every error = retry = double token spend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lever 3: Type guarantees.&lt;/strong&gt; Type errors account for 33.6% of all failures in LLM-generated code. Type-guided generation&lt;sup id="fnref13"&gt;13&lt;/sup&gt; reduces them by 74.8%.&lt;/p&gt;

&lt;p&gt;Combining all three levers can yield 60–80% cumulative savings in tokens, money, energy, and time.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;In the next article, we'll examine &lt;strong&gt;how BPE tokenization actually works&lt;/strong&gt; and why Python syntax wastes 46% of tokens on structural noise.&lt;/p&gt;







&lt;h2&gt;
  
  
  Series: Token Economics of Code
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Why Every Token Costs More Than You Think &lt;strong&gt;← you are here&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;The Anatomy of BPE: Why Python Wastes 46% of Tokens &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Type-Guided Constrained Decoding: How to Stop LLMs from Hallucinating Code &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Compilation for LLMs: Cranelift JIT, 4.4× Faster Than Python &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Hindley-Milner for LLMs: Type Inference Without Annotations &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Show HN: Synoema — The First Programming Language Designed for LLMs &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;The Future of Code Generation: From Prompts to Compilation &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Glossary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;Explanation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Token&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Smallest text unit for an LLM. Roughly ¾ of a word or 3–4 characters&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Large Language Model — neural network that generates text/code (GPT-4, Claude, Llama)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inference&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Generating a response from a trained model. Happens with every request&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Everything the model "sees" — prompt, chat history, files. Measured in tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Transformer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Neural network architecture underlying all LLMs. Uses attention mechanism&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self-attention&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mechanism where every token considers all others. Cost: O(n²)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BPE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Byte Pair Encoding — algorithm that splits text into tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Constrained decoding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Technology forbidding invalid tokens during generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Graphics card for AI computation. NVIDIA H100 is standard for LLM inference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;vLLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Open-source engine for fast LLM serving&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Overhead&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Parts of code/computation carrying no useful payload&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;&lt;strong&gt;Token&lt;/strong&gt; — the smallest unit of text an LLM processes. Not a letter, not a word, but a "chunk" of text 1–15 characters long. The word "hello" is 1 token; the code &lt;code&gt;def factorial(n):&lt;/code&gt; is 6 tokens. The model doesn't see characters — it sees a sequence of tokens. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn2"&gt;
&lt;p&gt;&lt;strong&gt;Context (context window)&lt;/strong&gt; — everything the model "sees" at once: your question, previous messages, attached files. Measured in tokens. GPT-4 has a context of up to 128K tokens, Claude up to 200K. The longer the context, the more computation the model needs. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn3"&gt;
&lt;p&gt;&lt;strong&gt;LLM (Large Language Model)&lt;/strong&gt; — a neural network trained on massive amounts of text that can generate text, code, and answer questions. Examples: GPT-4, Claude, Llama, Gemini. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn4"&gt;
&lt;p&gt;&lt;strong&gt;Inference&lt;/strong&gt; — the process of using an already-trained model to generate responses. When you type a prompt into ChatGPT and get an answer, that's inference. Unlike training (which happens once), inference happens billions of times per day. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn5"&gt;
&lt;p&gt;&lt;strong&gt;Transformer&lt;/strong&gt; — the neural network architecture underlying all modern LLMs. Invented at Google in 2017 ("Attention Is All You Need" paper). Its key feature is the "attention" mechanism, which lets the model consider relationships between any words in the text, even distant ones. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn6"&gt;
&lt;p&gt;&lt;strong&gt;Self-attention&lt;/strong&gt; — a mechanism where every token "looks at" every other token in the context to understand their relationships. This gives transformers their power — but also creates quadratic cost: if there are &lt;em&gt;n&lt;/em&gt; tokens, there are &lt;em&gt;n × n&lt;/em&gt; pairs to compare. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn7"&gt;
&lt;p&gt;&lt;strong&gt;LLaMA&lt;/strong&gt; — a family of open-source language models from Meta (Facebook). Available for download and self-hosted deployment, unlike GPT-4. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn8"&gt;
&lt;p&gt;&lt;strong&gt;GPU (Graphics Processing Unit)&lt;/strong&gt; — originally a graphics card, now used for AI computation. NVIDIA A100 and H100 are specialized GPUs for LLM inference and training. A single H100 costs ~$30–40K and draws 700 watts. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn9"&gt;
&lt;p&gt;&lt;strong&gt;vLLM&lt;/strong&gt; — an open-source engine for fast LLM serving. Optimizes GPU memory usage through PagedAttention, enabling more simultaneous requests. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn10"&gt;
&lt;p&gt;&lt;strong&gt;Syntactic overhead&lt;/strong&gt; — parts of code required by the language's syntax but carrying no meaning. For example, Python's &lt;code&gt;def&lt;/code&gt; before a function definition and &lt;code&gt;return&lt;/code&gt; before a return value are mandatory but contain no information about &lt;em&gt;what&lt;/em&gt; the function does. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn11"&gt;
&lt;p&gt;&lt;strong&gt;BPE (Byte Pair Encoding)&lt;/strong&gt; — the algorithm that splits text into tokens. Used in all modern LLMs. Finds the most frequent pairs of characters in a huge text corpus and merges them into new "subwords." Result: a vocabulary of ~100,000 tokens. Covered in detail in the second article. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn12"&gt;
&lt;p&gt;&lt;strong&gt;Constrained decoding&lt;/strong&gt; — a technology that forbids the model from choosing invalid tokens at each generation step. If the model is generating JSON, it ensures brackets are closed and commas are in the right places. The same can be done for any language with a formal grammar. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn13"&gt;
&lt;p&gt;&lt;strong&gt;Type-guided generation&lt;/strong&gt; — an extension of constrained decoding where the model is additionally prevented from generating code with type errors. A second layer of guarantees on top of syntactic ones. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;/ol&gt;

</description>
      <category>llm</category>
      <category>machinelearning</category>
      <category>programming</category>
      <category>science</category>
    </item>
  </channel>
</rss>
