<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Scott Corley</title>
    <description>The latest articles on DEV Community by Scott Corley (@scott-corley).</description>
    <link>https://dev.to/scott-corley</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3895009%2Fc35608d3-ed62-4ac0-94d3-9c547bc164fa.png</url>
      <title>DEV Community: Scott Corley</title>
      <link>https://dev.to/scott-corley</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/scott-corley"/>
    <language>en</language>
    <item>
      <title>A rabbit hole turned into a systems programming language. Initial benchmarks show 60% fewer (AI) tokens consumed with real code.</title>
      <dc:creator>Scott Corley</dc:creator>
      <pubDate>Fri, 24 Apr 2026 00:40:30 +0000</pubDate>
      <link>https://dev.to/scott-corley/a-rabbit-hole-turned-into-a-systems-programming-language-initial-benchmarks-show-60-fewer-ai-15kc</link>
      <guid>https://dev.to/scott-corley/a-rabbit-hole-turned-into-a-systems-programming-language-initial-benchmarks-show-60-fewer-ai-15kc</guid>
      <description>&lt;p&gt;Fair Warning: I tend to ramble, not finish thoughts and I am great at constructing long run-on sentences. Working with AI Agents has helped me be more productive and organize my thoughts and efforts. So I worked with AI on this project and, to be fair to people with more organized thought processes I put all my thoughts together and had AI generate a more concise post below. I will not use AI to answer any questions you had but an initial post I feel should have more structure than my typical brain dump.  &lt;/p&gt;




&lt;p&gt;I want to be upfront about something: I'm not a professional programmer. I'm an automation engineer. I've written scripts and code on and off throughout my career — enough to get things done, not enough to call it my day job.&lt;/p&gt;

&lt;p&gt;What changed recently is agentic AI. I started using AI agents to help automate parts of my work and the efficiency gains were real and immediate. That experience sent me down a rabbit hole I didn't expect.&lt;/p&gt;

&lt;p&gt;The rabbit hole didn't start with programming languages. It started with human language — specifically, information density in spoken communication. I was reading research on how much semantic content different languages pack into a given unit of speech. Some languages are denser than others. The information-per-syllable ratio varies significantly and in measurable ways. That got me thinking: if we can measure information density in spoken language, what does that look like for programming languages? And more specifically — what does it look like for the tokens an AI actually processes?&lt;/p&gt;

&lt;p&gt;From there the question narrowed fast. AI agents were helping me write code. But I kept watching them make a particular class of mistake — not syntax errors, those are boring. The subtler ones. Hidden side effects. Silently swallowed errors. Preconditions buried in a comment the model will never find when it needs them. The code was syntactically correct and semantically wrong, and the language gave the agent no way to know better. The ambiguity was structural.&lt;/p&gt;

&lt;p&gt;So I started building Candor. Not because I had a plan to build a programming language, but because the rabbit hole bottomed out there.&lt;/p&gt;




&lt;h2&gt;
  
  
  The idea: reduce ambiguity, reduce cost
&lt;/h2&gt;

&lt;p&gt;The original goal was simple: make a language where neither humans nor AI agents can hide what code does. Every side effect declared. Every error handled. Every precondition machine-readable. If the language is unambiguous for a human reviewer, it is unambiguous for the model writing it — for the same reasons.&lt;/p&gt;

&lt;p&gt;What I didn't anticipate — and what became obvious the moment I started using AI agents to build the compiler itself — is that ambiguity and token cost are the same problem from two different angles.&lt;/p&gt;

&lt;p&gt;When a language forces you to write 24 tokens of boilerplate to propagate an error, those 24 tokens carry zero semantic information. The model has to read through them to learn one thing: "if this fails, return the error." Every one of those tokens costs compute. Every one of them costs electricity. Every one of them is GPU cycles and memory bandwidth and heat. At the scale AI-assisted development is moving toward — agentic loops, multi-model pipelines, continuous AI-driven iteration — those costs compound into something real.&lt;/p&gt;

&lt;p&gt;That's when the token efficiency question moved from "nice to have" to the center of the design.&lt;/p&gt;




&lt;h2&gt;
  
  
  I measured it
&lt;/h2&gt;

&lt;p&gt;I used the Anthropic &lt;code&gt;count_tokens&lt;/code&gt; API against &lt;code&gt;claude-sonnet-4-6&lt;/code&gt; — the same tokenizer Claude actually uses when processing code. Not an approximation. Real API calls, baseline overhead subtracted, results saved as timestamped JSON.&lt;/p&gt;

&lt;p&gt;I measured every keyword in Candor. Every operator. Every common signature pattern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;36 of 37 core keywords = 1 BPE token each.&lt;/strong&gt; That's not luck — I verified each one before committing to it. The one failure is &lt;code&gt;refmut&lt;/code&gt; (3 tokens), which has an Agent Form alias that reduces the common case by 33%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;pure&lt;/code&gt; — the most important annotation in the language — is 1 token.&lt;/strong&gt; Free.&lt;/p&gt;

&lt;p&gt;Then I measured what the &lt;code&gt;?&lt;/code&gt; propagation operator actually saves:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// This is what error propagation looks like in full syntax — 24 tokens
match open(path) { ok(v) =&amp;gt; v   err(e) =&amp;gt; return err(e) }

// This is what it looks like with ? — 4 tokens
open(path)?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Full syntax&lt;/th&gt;
&lt;th&gt;With &lt;code&gt;?&lt;/code&gt;
&lt;/th&gt;
&lt;th&gt;Saved&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single propagation site&lt;/td&gt;
&lt;td&gt;24 tok&lt;/td&gt;
&lt;td&gt;4 tok&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;83%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3 sites — typical IO function&lt;/td&gt;
&lt;td&gt;72 tok&lt;/td&gt;
&lt;td&gt;14 tok&lt;/td&gt;
&lt;td&gt;58&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;81%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5 sites — complex pipeline&lt;/td&gt;
&lt;td&gt;120 tok&lt;/td&gt;
&lt;td&gt;24 tok&lt;/td&gt;
&lt;td&gt;96&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At 5 propagation sites: &lt;strong&gt;96 tokens eliminated from a single function.&lt;/strong&gt; Every one of those 96 tokens was routing boilerplate. No signal. Pure overhead.&lt;/p&gt;




&lt;h2&gt;
  
  
  The complete function comparison
&lt;/h2&gt;

&lt;p&gt;Here's the same IO function in both forms, measured end to end:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Full syntax (Verification Form) — 106 tokens
fn process(path: str) -&amp;gt; result&amp;lt;str, str&amp;gt; effects(io) {
    let f = match open(path) { ok(v) =&amp;gt; v   err(e) =&amp;gt; return err(e) }
    let s = match read(f)  { ok(v) =&amp;gt; v   err(e) =&amp;gt; return err(e) }
    let r = match parse(s) { ok(v) =&amp;gt; v   err(e) =&amp;gt; return err(e) }
    return ok(r)
}

// Agent Form — 42 tokens
fn process(path: str) -&amp;gt; ?str io {
    let f = open(path)?
    let s = read(f)?
    let r = parse(s)?
    return ok(r)
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;60% fewer tokens. Same program. Same semantics. Same compiled output.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The transformation from Agent Form to full syntax is mechanical and one-pass. &lt;code&gt;?T&lt;/code&gt; expands to &lt;code&gt;result&amp;lt;T, str&amp;gt;&lt;/code&gt;. &lt;code&gt;io&lt;/code&gt; expands to &lt;code&gt;effects(io)&lt;/code&gt;. &lt;code&gt;expr?&lt;/code&gt; expands to the full match block. No inference. Every rule is a substitution.&lt;/p&gt;

&lt;p&gt;This means AI writes the dense form. Humans review the full form. Both are the same program.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the savings mean in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Frontend:&lt;/strong&gt; If you're using AI in a code generation workflow — IDE assistant, AI review bot, agent writing components — every token the language doesn't need is context the model can spend on your actual problem. A 60% reduction in function-level token overhead means the same context window fits roughly 2.5× as many function definitions. That's not a developer-experience improvement. That's a capability increase.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backend:&lt;/strong&gt; At 100 concurrent AI requests on a 70B model, each token in the KV cache costs approximately 327 KB of VRAM. A 56% savings in function signature tokens across a coding context frees significant memory per request — which means more requests per GPU, lower latency, or smaller hardware for the same throughput.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-world electrical and mechanical:&lt;/strong&gt; This is the part people don't talk about enough. GPU compute is not free. It is electricity, heat, cooling systems, mechanical stress on hardware. A token that doesn't need to be processed doesn't burn a watt. At the scale AI is scaling to — inference farms, always-on coding agents, continuous pipeline generation — token efficiency is energy efficiency. A language that eliminates 80% of error-routing boilerplate per function isn't just cleaner. It is a measurably lower power draw per unit of work.&lt;/p&gt;




&lt;h2&gt;
  
  
  Trust and verification: the other side of the coin
&lt;/h2&gt;

&lt;p&gt;Token efficiency was the measurement that surprised me. But it's not why I started building Candor.&lt;/p&gt;

&lt;p&gt;The deeper problem is trust. People are split on AI-generated code — not because AI writes bad syntax, but because they can't see what the code is doing. Hidden side effects. Silently swallowed errors. Preconditions buried in a comment the agent never reads. The fear is reasonable. And reassurance doesn't answer it. Structure does.&lt;/p&gt;

&lt;p&gt;Candor is built so that neither a human nor an AI agent can hide what a function does.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Every side effect is declared and compiler-enforced:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fn send_report(data: str) -&amp;gt; unit effects(io, net) {
    write_file("log.txt", data)
    http_post("https://api.example.com/report", data)
    return unit
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;effects(io, net)&lt;/code&gt; is not a comment. It is enforced by the type checker across the entire call graph. A pure function that tries to call &lt;code&gt;send_report&lt;/code&gt; is a compile error. An AI agent cannot write a function that silently touches the network without declaring it. A human reviewer sees it in the signature without reading the body.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Every error must be handled — silence is a compile error:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;let config = load_config("settings.cnd") must {
    ok(cfg) =&amp;gt; cfg
    err(e)  =&amp;gt; {
        print(str_concat("error: ", e))
        return unit
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Discarding a &lt;code&gt;result&amp;lt;T,E&amp;gt;&lt;/code&gt; without handling both arms is rejected by the compiler. There is no way for AI-generated code to silently swallow a failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;And then there's the LLVM layer.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the part that goes deeper than type checking. Candor's &lt;code&gt;pure&lt;/code&gt; annotation doesn't just tell the compiler to check the call graph. It emits a machine-verifiable attribute in LLVM IR:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight llvm"&gt;&lt;code&gt;&lt;span class="c1"&gt;; pure function — LLVM's own verifier rejects this if it contains a load or store&lt;/span&gt;
&lt;span class="k"&gt;define&lt;/span&gt; &lt;span class="kt"&gt;i64&lt;/span&gt; &lt;span class="vg"&gt;@add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;i64&lt;/span&gt; &lt;span class="nv"&gt;%a.in&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;i64&lt;/span&gt; &lt;span class="nv"&gt;%b.in&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="err"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;none&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;nounwind&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="p"&gt;...&lt;/span&gt;

&lt;span class="c1"&gt;; effectful function — bare define, no machine-verifiable purity claim&lt;/span&gt;
&lt;span class="k"&gt;define&lt;/span&gt; &lt;span class="kt"&gt;ptr&lt;/span&gt; &lt;span class="vg"&gt;@process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;ptr&lt;/span&gt; &lt;span class="nv"&gt;%path.in&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="p"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;memory(none) nounwind&lt;/code&gt; is not a Candor invention. It is an LLVM attribute that LLVM's own verifier enforces independently of the Candor compiler. If a &lt;code&gt;pure&lt;/code&gt; function emits a load or store at the IR level, LLVM rejects it. The guarantee travels all the way from source to hardware without requiring trust in any single tool.&lt;/p&gt;

&lt;p&gt;The transparency chain looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AI writes:       fn add(a: i64, b: i64) -&amp;gt; i64 pure { return a + b }
                        ↓ EFFECTS-001 (type checker)
Candor checks:   pure callers may not call effectful code
                        ↓ emit_llvm
LLVM IR:         define i64 @add(...) memory(none) nounwind { ... }
                        ↓ LLVM verifier
Hardware:        guaranteed — no memory side effects
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No step in that chain requires trust. Each layer is independently auditable. That is the goal: a language where the safety guarantees aren't words in a README, they're machine-checkable at every level.&lt;/p&gt;




&lt;h2&gt;
  
  
  The honest &lt;code&gt;|&amp;gt;&lt;/code&gt; result
&lt;/h2&gt;

&lt;p&gt;I measured the pipeline operator too, and I'm including this because I think honesty about negative results matters:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;|&amp;gt;&lt;/code&gt; is 2 tokens. BPE splits &lt;code&gt;|&lt;/code&gt; and &lt;code&gt;&amp;gt;&lt;/code&gt; separately. It costs tokens compared to nested calls:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Nested&lt;/th&gt;
&lt;th&gt;Pipeline&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;3-step, snake_case&lt;/td&gt;
&lt;td&gt;14 tok&lt;/td&gt;
&lt;td&gt;16 tok&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-2&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5-step, snake_case&lt;/td&gt;
&lt;td&gt;23 tok&lt;/td&gt;
&lt;td&gt;26 tok&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-3&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I kept it anyway. The value is structural, not arithmetic: &lt;code&gt;x |&amp;gt; parse |&amp;gt; filter |&amp;gt; render&lt;/code&gt; is linear. &lt;code&gt;render(filter(parse(x)))&lt;/code&gt; requires parsing depth before understanding sequence. Transformer attention is sequential — linear structure matches how a model reads code. That's worth 2–3 tokens per pipeline. But it's a reasoning benefit, not a token savings, and I document it as such.&lt;/p&gt;

&lt;p&gt;If a design decision doesn't save tokens, you shouldn't claim it does.&lt;/p&gt;




&lt;h2&gt;
  
  
  This was built with AI, for a world that uses AI
&lt;/h2&gt;

&lt;p&gt;I want to credit the tools that made this possible, because it's relevant to the story. Claude and Gemini have been significant collaborators on this project — not just autocomplete, but architectural reasoning, debugging, and in some cases pushing back on my design decisions in ways that made the language better. The experience of building a language alongside AI is part of why the language exists. Working that way makes the gaps in existing languages visible fast.&lt;/p&gt;

&lt;p&gt;Local models contributed too. GitHub Copilot. Several others. The whole point of Candor is that it should be a better surface for that kind of collaboration.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where to find everything
&lt;/h2&gt;

&lt;p&gt;The benchmark tool, the raw JSON data, and the full methodology are all in the repo:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full benchmark report: &lt;a href="https://github.com/candor-core/candor/blob/main/docs/token_benchmark.md" rel="noopener noreferrer"&gt;docs/token_benchmark.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Three Forms spec (Agent / Verification / Machine): &lt;a href="https://github.com/candor-core/candor/blob/main/docs/three_forms.md" rel="noopener noreferrer"&gt;docs/three_forms.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Measurement tool: &lt;code&gt;benchmarks/tokenizer/token_analysis.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Raw results: &lt;code&gt;benchmarks/tokenizer/results/2026-04-23_claude-sonnet-4-6_3.json&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Source: &lt;a href="https://github.com/candor-core/candor" rel="noopener noreferrer"&gt;github.com/candor-core/candor&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tool re-runs against any model. I'll post updated baselines when major models ship — measurements shift as tokenizers evolve, and that's by design.&lt;/p&gt;

&lt;p&gt;I may be off track with parts of this. I hope it inspires something useful regardless.&lt;/p&gt;

&lt;p&gt;— Scott W. Corley&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>computerscience</category>
      <category>claude</category>
    </item>
  </channel>
</rss>
