<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Neelagiri65</title>
    <description>The latest articles on DEV Community by Neelagiri65 (@neelagiri65).</description>
    <link>https://dev.to/neelagiri65</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3864660%2F20cf6369-86d1-490a-bf20-29e74c50e9ac.png</url>
      <title>DEV Community: Neelagiri65</title>
      <link>https://dev.to/neelagiri65</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/neelagiri65"/>
    <language>en</language>
    <item>
      <title>AI spend is a black box. Trust is the meter.</title>
      <dc:creator>Neelagiri65</dc:creator>
      <pubDate>Wed, 24 Jun 2026 21:49:27 +0000</pubDate>
      <link>https://dev.to/neelagiri65/the-unsigned-meter-what-i-learned-trying-to-audit-ai-token-bills-2gah</link>
      <guid>https://dev.to/neelagiri65/the-unsigned-meter-what-i-learned-trying-to-audit-ai-token-bills-2gah</guid>
      <description>&lt;p&gt;An electricity meter is sealed. It is calibrated by a body that does not work for the utility, it can be read by the person paying, and a disputed bill has a physical artifact to point at. Metered billing works for one reason. The meter is trustworthy independently of the seller.&lt;/p&gt;

&lt;p&gt;An AI bill has no such meter. The counter sits inside the provider's serving stack. It reports how many tokens were used, the invoice is paid on that number and most of what it counts is never returned. On a frontier reasoning model the bulk of the spend is reasoning and cache tokens, billed but never shown.&lt;/p&gt;

&lt;p&gt;There is no sealed meter. There is a number and a request to trust it.&lt;/p&gt;

&lt;p&gt;The pattern is universal. Every major model meters by the token, from &lt;a href="https://openai.com" rel="noopener noreferrer"&gt;OpenAI&lt;/a&gt; and &lt;a href="https://www.anthropic.com" rel="noopener noreferrer"&gt;Anthropic&lt;/a&gt; to &lt;a href="https://ai.google" rel="noopener noreferrer"&gt;Google&lt;/a&gt;, &lt;a href="https://ai.meta.com" rel="noopener noreferrer"&gt;Meta&lt;/a&gt;, &lt;a href="https://mistral.ai" rel="noopener noreferrer"&gt;Mistral&lt;/a&gt;, &lt;a href="https://www.deepseek.com" rel="noopener noreferrer"&gt;DeepSeek&lt;/a&gt;, &lt;a href="https://x.ai" rel="noopener noreferrer"&gt;xAI&lt;/a&gt; and &lt;a href="https://qwenlm.github.io" rel="noopener noreferrer"&gt;Alibaba's Qwen&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;The hyperscalers reselling them, &lt;a href="https://aws.amazon.com/bedrock" rel="noopener noreferrer"&gt;AWS&lt;/a&gt;, &lt;a href="https://azure.microsoft.com/en-us/products/ai-services" rel="noopener noreferrer"&gt;Microsoft Azure&lt;/a&gt; and &lt;a href="https://cloud.google.com/vertex-ai" rel="noopener noreferrer"&gt;Google Cloud&lt;/a&gt; bill the same way and gateways like &lt;a href="https://openrouter.ai" rel="noopener noreferrer"&gt;OpenRouter&lt;/a&gt; and &lt;a href="https://huggingface.co" rel="noopener noreferrer"&gt;Hugging Face&lt;/a&gt; pass the meter straight&lt;br&gt;
through. All of them keep the meter on their own side of the glass.&lt;/p&gt;

&lt;p&gt;I spent a few weeks building the independent meter reader. This is what the build found including the part that proved the premise wrong, which turned out to be the most useful finding of all.&lt;/p&gt;

&lt;h2&gt;
  
  
  The research said the problem was real
&lt;/h2&gt;

&lt;p&gt;Before writing code I ran the same adversarial research method I use for app store intelligence. Fan out across sources, extract falsifiable claims, then verify each with a three-vote pass where two refutations kill the claim. One hundred and three agents, ninety seven claims, three killed under scrutiny. The kill list is the point. The goal is to be as hard on my own conclusions as the tool is meant to be on token counts. Hold that thought, because the same discipline later saved the project.&lt;/p&gt;

&lt;p&gt;The verdict was a real, academically validated white space.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CoIn (arXiv 2505.13778): users are billed for invisible reasoning tokens, which often account for the majority of the cost, yet have no way to verify their authenticity.&lt;/li&gt;
&lt;li&gt;Invisible Tokens, Visible Bills (arXiv 2505.18471): users are billed for operations they cannot observe, verify or contest.&lt;/li&gt;
&lt;li&gt;PALACE (arXiv 2508.00912): commercial services conceal internal reasoning traces while still charging for every generated token.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A preprint on token inflation put a number on the exposure. Hidden reasoning usage is inflatable by roughly 1,469% on average without detection. A hundred-dollar honest bill becomes about fifteen hundred. Treat the magnitudes as directional rather than settled, but three independent groups agree on the shape. The spend is largely for work that cannot be observed.&lt;/p&gt;

&lt;p&gt;Buried in that research was one sentence I read, nodded at, and did not actually absorb. &lt;/p&gt;

&lt;p&gt;The trust paradox:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Every audit must trust some artifact, but current frameworks trust exactly the ones a provider has &amp;gt; the strongest reason to manipulate.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Remember that one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The build, and the finding that stopped me
&lt;/h2&gt;

&lt;p&gt;The leading academic approach, CoIn, is cooperative. It needs the provider to build a Merkle tree of token fingerprints, commit the root, and serve proofs on audit. Elegant and commercially dead on&lt;br&gt;
arrival. No provider volunteers to make its own meter auditable.&lt;/p&gt;

&lt;p&gt;So the build went the other way. Passive and outside in. Retokenise the delivered output locally, with the model's own tokenizer and reconcile it against the reported number. No provider cooperation,&lt;br&gt;
nothing leaves the machine. Label every figure by confidence. EXACT when re counted with the real tokenizer, BOUNDED when estimated within a band, UNVERIFIABLE for reasoning and cache, which are billed but never returned.&lt;/p&gt;

&lt;p&gt;Pointed at &lt;a href="https://www.byteplus.com" rel="noopener noreferrer"&gt;BytePlus&lt;/a&gt;, the finding that stopped me was video. A five second clip, billed 246,840 tokens. Video is metered by a published formula, width times height divided by 1024 times frames, so the bill is re derivable from the delivered file with ffprobe. It matched to the token. Gap zero. A&lt;br&gt;
second clip at a different resolution, 108,900 tokens, gap zero. Four live text completions, gap zero on all four.&lt;/p&gt;

&lt;p&gt;A wall of gap-zero results from live, paid calls. It looked like proof. A launch started forming around it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The question that broke it
&lt;/h2&gt;

&lt;p&gt;Someone asked one sentence. Are these not the same number from two sides?&lt;/p&gt;

&lt;p&gt;And the trust paradox came back to collect.&lt;/p&gt;

&lt;p&gt;Here is the uncomfortable part. Re-counting the delivered output and matching the reported number is a consistency check. It binds the bill to the artifact that was handed over. It is not an independent measure of true cost. The provider counts the tokens the model generated. The tool counts the canonical encoding of the text the model chose to return. Two computation paths, so a match means generation was canonical and nothing was dropped in transit. &lt;/p&gt;

&lt;p&gt;A real check, the class of a checksum.&lt;/p&gt;

&lt;p&gt;Against an honest provider that check is a near-guaranteed pass. Against a rational one set on overcharging it is toothless, because nobody inflates the one bucket that can be recomputed. The inflation lives in the buckets that cannot be, reasoning, cache, the rate. The gap-zero wall was demonstrating the one thing that was never the risk.&lt;/p&gt;

&lt;p&gt;Worse was the reflex. On a small text gap, the instinct was to swap tokenizers until it vanished. That is the trust paradox in miniature, tuning the audit until it agrees with the bill. An audit that can be adjusted until it passes is not an audit. A gap, it turns out, has three causes that the number alone cannot separate. Over billing, the wrong tokenizer, or legitimate non canonical generation. So a gap is a flag to investigate, never a verdict, exactly as a match is never proof.&lt;/p&gt;

&lt;p&gt;The research had said this in one sentence. Building the wrong thing was the cost of understanding it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually survives
&lt;/h2&gt;

&lt;p&gt;Killing the headline claim left the real one standing, and it is stronger.&lt;/p&gt;

&lt;p&gt;An AI bill cannot be independently verified. The verifiable part verifies itself, and the rest is structurally out of reach. What can be done is to measure how much of the bill has any ground truth at all. That is the honest, novel signal.&lt;/p&gt;

&lt;p&gt;The delivered part can be bound to the artifact, which catches a 1080p-billed-but-480p-delivered swap and catches metering bugs, including the prompt-cache failures that have over-billed real users by ten to twenty times. The undelivered part, reasoning and cache, most of a modern bill, is unverifiable by anyone, and the right move is to say so, loudly, with a number.&lt;/p&gt;

&lt;p&gt;The product is not "we check the bill". It is a measurement of how much of the bill nobody can check and how small the sliver that can be. For a five second video that sliver is a six figure token count&lt;br&gt;
that at least ties to the file. For a reasoning call the sliver is almost nothing and the honest output is the size of the dark.&lt;/p&gt;

&lt;h2&gt;
  
  
  The discipline is admitting where verification ends
&lt;/h2&gt;

&lt;p&gt;The thing that saved this project is the thing that built it. An adversarial pass that kills the claims&lt;br&gt;
which do not survive. It killed three of ninety-seven claims in the literature. It should have been run&lt;br&gt;
on the headline before anyone got attached to gap zero. When a one-sentence question can dismantle the&lt;br&gt;
strongest demo, the demo was the problem.&lt;/p&gt;

&lt;p&gt;It is also the house style. The app-store work is outside-in. Read anyone's public reviews, cooperate with no one, and treat the only honest trend source as the only honest trend source. This is the same philosophy aimed at billing. Read the artifact that was handed over, cooperate with no one and be ruthless about the line between what can be known and what is being asked on trust. The discipline is not the verification. The discipline is admitting where verification ends.&lt;/p&gt;

&lt;p&gt;The AI meter is unsigned and a large part of it reads in the dark. The useful move is not to pretend the dark can be read. It is to measure exactly how much of it there is.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;TokenLedger is open source under Apache-2.0: &lt;code&gt;pip install retoken&lt;/code&gt;. The known limitations, including the rule that a gap is a flag and never a verdict, are written up in the repository, because a tool about trust should hold itself to its own standard.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>finops</category>
      <category>opensource</category>
    </item>
    <item>
      <title>The App Store's silent giants: AI assistants reply to almost none of their reviewers</title>
      <dc:creator>Neelagiri65</dc:creator>
      <pubDate>Sun, 21 Jun 2026 15:22:46 +0000</pubDate>
      <link>https://dev.to/neelagiri65/the-app-stores-silent-giants-ai-assistants-reply-to-almost-none-of-their-reviewers-hj9</link>
      <guid>https://dev.to/neelagiri65/the-app-stores-silent-giants-ai-assistants-reply-to-almost-none-of-their-reviewers-hj9</guid>
      <description>&lt;p&gt;An App Store rating looks like a verdict. It behaves more like a monument, built over years and slow to move. It says very little about how this month's users feel.&lt;/p&gt;

&lt;p&gt;I took the 12 most-rated Productivity apps on the US App Store, 32 million ratings between them, and split the headline star into the two numbers it hides: how far recent sentiment has fallen below the lifetime average, and whether the developer replies when users complain.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it is measured
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Population truth.&lt;/strong&gt; Lifetime ratings and the star histogram come from Apple's full ratings data, every rating an app has ever received.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recent sentiment.&lt;/strong&gt; A fixed window of the most recent reviews by date, so an app captured to a depth of thousands is not compared on a multi-year average against an app with a few hundred. Same window for everyone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developer response.&lt;/strong&gt; Reply share and median latency over that recent window.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complaints&lt;/strong&gt; are bucketed with a rule-based taxonomy. It is a heuristic, not a trained classifier, and I treat it as one.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What turned up
&lt;/h2&gt;

&lt;p&gt;The AI assistants now own this chart, and they reply to almost no one.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;App&lt;/th&gt;
&lt;th&gt;Lifetime&lt;/th&gt;
&lt;th&gt;Recent&lt;/th&gt;
&lt;th&gt;Reply share&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;4.8&lt;/td&gt;
&lt;td&gt;4.18&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude&lt;/td&gt;
&lt;td&gt;4.7&lt;/td&gt;
&lt;td&gt;3.06&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grok&lt;/td&gt;
&lt;td&gt;4.9&lt;/td&gt;
&lt;td&gt;3.77&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Perplexity&lt;/td&gt;
&lt;td&gt;4.8&lt;/td&gt;
&lt;td&gt;3.60&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Google Gemini&lt;/td&gt;
&lt;td&gt;4.7&lt;/td&gt;
&lt;td&gt;3.65&lt;/td&gt;
&lt;td&gt;13%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dropbox&lt;/td&gt;
&lt;td&gt;4.8&lt;/td&gt;
&lt;td&gt;2.75&lt;/td&gt;
&lt;td&gt;58%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gmail&lt;/td&gt;
&lt;td&gt;4.7&lt;/td&gt;
&lt;td&gt;2.40&lt;/td&gt;
&lt;td&gt;26%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Google Drive&lt;/td&gt;
&lt;td&gt;4.8&lt;/td&gt;
&lt;td&gt;3.90&lt;/td&gt;
&lt;td&gt;23%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Microsoft Authenticator&lt;/td&gt;
&lt;td&gt;4.7&lt;/td&gt;
&lt;td&gt;2.18&lt;/td&gt;
&lt;td&gt;1%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The older tools are the ones still in the trenches: Dropbox answers 58% of recent reviewers, Gmail 26%, Drive 23%. The steepest recent drops belong to Microsoft Authenticator (4.7 to 2.18), Gmail (4.7 to 2.40) and Dropbox (4.8 to 2.75).&lt;/p&gt;

&lt;p&gt;Plotted on two axes, backlash against response, every app falls into one of four archetypes: Firefighters, Ghost Ships, Complacent Giants and Resilient Leaders. Eight of the twelve are Ghost Ships, taking a recent hit in near silence.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest limits
&lt;/h2&gt;

&lt;p&gt;Recent reviewers self-select toward the dissatisfied. A person who hits a bug is far more likely to leave a review than a contented one, so a low recent average blends genuine decline with that bias, and this data cannot cleanly separate the two. I tie no drop to a specific app release, because the version data is too sparse to support that claim. The lifetime figure is population truth; the recent figure is a biased sample; I never present one as the other.&lt;/p&gt;

&lt;p&gt;The full interactive Friction Matrix, the per-app complaint archetypes, and the method in detail are here: &lt;a href="https://nativerse-ventures.com/productivity-friction-matrix" rel="noopener noreferrer"&gt;https://nativerse-ventures.com/productivity-friction-matrix&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Independent research from the Nativerse lab. Figures are public App Store data, cited, not invented.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ios</category>
      <category>appstore</category>
      <category>ai</category>
      <category>datascience</category>
    </item>
    <item>
      <title>The day a refactor passed on my laptop and failed on yours</title>
      <dc:creator>Neelagiri65</dc:creator>
      <pubDate>Sat, 13 Jun 2026 09:48:34 +0000</pubDate>
      <link>https://dev.to/neelagiri65/the-day-a-refactor-passed-on-my-laptop-and-failed-on-yours-1o31</link>
      <guid>https://dev.to/neelagiri65/the-day-a-refactor-passed-on-my-laptop-and-failed-on-yours-1o31</guid>
      <description>&lt;p&gt;Most of the code being written right now is not being written. It is being&lt;br&gt;
generated, glanced at, then merged. The reviewer is tired. The diff is large.&lt;br&gt;
Increasingly the reviewer is itself a language model summarising the work of&lt;br&gt;
another language model. Somewhere in that loop there is supposed to be a moment&lt;br&gt;
where someone confirms the change did what it claimed. Often there isn't.&lt;/p&gt;

&lt;p&gt;I wanted a small, boring tool to fill that gap. Take a function from before a&lt;br&gt;
refactor and after. Run both on the same inputs. Tell me plainly whether the&lt;br&gt;
behaviour changed. Not an opinion. Not a confidence score. A result I could&lt;br&gt;
rerun next week to the same answer, byte for byte. If a teammate ran it on their&lt;br&gt;
machine they should get my exact result, not something close.&lt;/p&gt;

&lt;p&gt;That last sentence sounds trivial. It is the entire problem. This is the story of&lt;br&gt;
where it broke and why the fix turned out to be the most important design decision&lt;br&gt;
in the whole tool.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why rerunning it is the only claim worth making
&lt;/h2&gt;

&lt;p&gt;There is no shortage of tools that review your pull request. The newer ones are&lt;br&gt;
language models with a nice interface. They are useful. They are also the same&lt;br&gt;
kind of thing that wrote the code: a probabilistic system giving you its&lt;br&gt;
impression. Ask the same one twice and you can get two different reviews. In a&lt;br&gt;
world where a model wrote the diff, a model reviewing the diff is the same fallible&lt;br&gt;
loop checking its own work.&lt;/p&gt;

&lt;p&gt;So I did not want to add another opinion. I wanted a verdict with a property no&lt;br&gt;
opinion has: you can reproduce it. Run the check. Get a result. That result is a&lt;br&gt;
function of the inputs and nothing else. No wall clock. No network. No luck&lt;br&gt;
particular to one machine. Same inputs in, same answer out, on any computer.&lt;/p&gt;

&lt;p&gt;If you have that, you can sign it and hand it to someone who does not trust you.&lt;br&gt;
They rerun it and confirm it themselves. The trust comes from reproduction, not&lt;br&gt;
from my reputation or my model's confidence. That is the whole pitch. It only works&lt;br&gt;
if the reproduction is real.&lt;/p&gt;
&lt;h2&gt;
  
  
  Where it broke: a function that returned a float
&lt;/h2&gt;

&lt;p&gt;Early on the tool handled integers, strings, lists of integers. Clean, exact, the&lt;br&gt;
same on every machine. Then I pointed it at a numerical function. A refactor of an&lt;br&gt;
averaging routine. The kind of change an AI assistant produces ten times a day.&lt;/p&gt;

&lt;p&gt;On my Mac the check said the two versions diverged on one input. On a Linux box in&lt;br&gt;
CI it said they were identical. Same code. Same inputs. Two different verdicts.&lt;/p&gt;

&lt;p&gt;This is the nightmare for a tool whose only selling point is reproducibility. A&lt;br&gt;
verdict that depends on the machine is not a verdict. It is a rumour.&lt;/p&gt;

&lt;p&gt;The cause is not a bug in my tool. It is the nature of floating point arithmetic.&lt;br&gt;
It is worth understanding, since almost every "we test your AI code" tool will hit&lt;br&gt;
it and most will quietly paper over it.&lt;/p&gt;
&lt;h2&gt;
  
  
  What IEEE 754 promises and what it does not
&lt;/h2&gt;

&lt;p&gt;Floating point numbers follow a standard called IEEE 754. The standard is precise&lt;br&gt;
about which operations are guaranteed to give the same answer everywhere. That&lt;br&gt;
guarantee is narrower than people assume.&lt;/p&gt;

&lt;p&gt;The basic operations are correctly rounded. Addition. Subtraction. Multiplication.&lt;br&gt;
Division. Square root. The fused multiply add. Each is required to return the&lt;br&gt;
single nearest representable result, every time, on every conforming machine. At&lt;br&gt;
double precision with the default rounding mode these operations are identical bit&lt;br&gt;
for bit whether you run them on an Apple chip or an Intel server. There is no&lt;br&gt;
ambiguity. There is no luck of the platform. Two different expressions built only&lt;br&gt;
from these operations will agree across machines or disagree across machines&lt;br&gt;
consistently.&lt;/p&gt;

&lt;p&gt;The functions you reach for next are not covered. Sine. Cosine. Exponential.&lt;br&gt;
Logarithm. Raising to a fractional power. For these the standard only recommends&lt;br&gt;
correct rounding. It does not require it. The reason is a genuinely hard maths&lt;br&gt;
problem, sometimes called the table maker's dilemma: computing the last bit&lt;br&gt;
correctly for these functions can need enormous intermediate precision.&lt;br&gt;
Implementations make different tradeoffs. The C maths library on macOS and the&lt;br&gt;
one on Linux can legitimately return results that differ in the final bit.&lt;/p&gt;

&lt;p&gt;That final bit is exactly what bit me. My averaging refactor touched a function&lt;br&gt;
whose two versions agreed to the last bit under one maths library and disagreed&lt;br&gt;
under another. Neither machine was wrong. The standard permits both. My tool was&lt;br&gt;
trying to render a global verdict on a quantity that is, by design, local.&lt;/p&gt;
&lt;h2&gt;
  
  
  The decision: refuse what you cannot reproduce, by name
&lt;/h2&gt;

&lt;p&gt;There were two tempting fixes. Both are traps.&lt;/p&gt;

&lt;p&gt;The first is to round the results before comparing. Compare to twelve decimal&lt;br&gt;
places and call it equal. This feels reasonable. It is not safe. A real difference&lt;br&gt;
in the last bit can sit right on the rounding boundary. One machine rounds up. The&lt;br&gt;
other rounds down. Rounding does not remove the disagreement. It hides it sometimes&lt;br&gt;
and invents it other times. You have traded a guarantee for a coin flip.&lt;/p&gt;

&lt;p&gt;The second is to compare with a tolerance. Equal if within some epsilon. Now your&lt;br&gt;
tool no longer answers the question it was asked. "Did this refactor preserve the&lt;br&gt;
behaviour" has quietly become "is the new behaviour close enough for my taste." For&lt;br&gt;
a tool whose only asset is a precise reproducible verdict, that is the asset gone.&lt;/p&gt;

&lt;p&gt;The fix that actually holds is less clever and more honest. The tool admits a&lt;br&gt;
floating point function only when its computation stays inside the correctly&lt;br&gt;
rounded operations. Those are reproducible across machines, since the standard&lt;br&gt;
makes them so. The moment a function reaches for a transcendental, the tool does&lt;br&gt;
not guess and does not round. It refuses, by name. It says so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clamp_average  REFUSED  depends on a platform-variable transcendental (math.exp);
                        a cross-host reproducible verdict is not possible here.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Agreement across machines comes from restriction, not from cleverness. Inside the&lt;br&gt;
admissible set the raw bits are already identical everywhere. The tool records the&lt;br&gt;
result as its exact bit pattern, with no rounding and no massaging. A NaN is&lt;br&gt;
normalised to a single canonical form. A NaN payload is not observable behaviour.&lt;br&gt;
The sign of a zero is preserved exactly. The sign of a zero is observable: dividing&lt;br&gt;
by positive zero and by negative zero gives positive and negative infinity. The&lt;br&gt;
details matter. The rule behind all of them is one sentence.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A value is admissible only if the verdict it produces is identical on every&lt;br&gt;
machine. Everything else is refused, out loud.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why refusing is a feature, not a weakness
&lt;/h2&gt;

&lt;p&gt;It is uncomfortable to ship a tool that says "I will not judge this." The instinct&lt;br&gt;
is to maximise coverage so the tool looks capable. That instinct is how you end up&lt;br&gt;
with a tool that confidently lies a small fraction of the time, which is worse than&lt;br&gt;
useless for anything you would actually rely on.&lt;/p&gt;

&lt;p&gt;The refusal is the thing that makes every other answer trustworthy. When the tool&lt;br&gt;
says two versions are equivalent, it is staking that claim on a verdict it can&lt;br&gt;
reproduce anywhere. When it cannot make that promise it tells you. Then you reach&lt;br&gt;
for a human or a different technique. You are never handed a green light that was&lt;br&gt;
really a shrug.&lt;/p&gt;

&lt;p&gt;This is the opposite of the marketing reflex, which is to claim more. The claim&lt;br&gt;
here is deliberately small and completely solid: these specific behaviours were&lt;br&gt;
checked on these specific inputs, the result reproduces to identical bytes&lt;br&gt;
everywhere, here is everything I declined to check. Small and true beats broad and&lt;br&gt;
shaky. That is true above all for the one job where you are trying to replace a&lt;br&gt;
rubber stamp with something you can stand behind.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is, plainly
&lt;/h2&gt;

&lt;p&gt;The tool is called equiv. It runs a changed function and its previous version on&lt;br&gt;
the same generated inputs. It reports whether they diverged, with the exact input&lt;br&gt;
that broke them when they do. It produces a signed receipt of what was checked,&lt;br&gt;
addressed by its content, which anyone can rerun to the same bytes. It is not a&lt;br&gt;
prover. It is bounded testing: a pass means no divergence was found on the inputs&lt;br&gt;
it tried, not that none exists. It says so. It checks mechanical behaviour, never&lt;br&gt;
intent or architecture. It tells you that too.&lt;/p&gt;

&lt;p&gt;That is the honest shape of it. In a field full of tools that review your code by&lt;br&gt;
having a model form an impression, the contribution here is not intelligence. It is&lt;br&gt;
the refusal to pretend. A verdict you can reproduce. A clear list of what was not&lt;br&gt;
checked. A flat "no" whenever a yes would not survive being run on a different&lt;br&gt;
machine.&lt;/p&gt;

&lt;p&gt;The hard part was never generating inputs or comparing outputs. It was deciding,&lt;br&gt;
before writing the code, exactly which questions the tool is allowed to answer with&lt;br&gt;
certainty, then being willing to say nothing about the rest.&lt;/p&gt;




&lt;p&gt;equiv is open source under the Apache 2.0 licence and runs as a GitHub Action:&lt;br&gt;
&lt;a href="https://github.com/Neelagiri65/equiv" rel="noopener noreferrer"&gt;github.com/Neelagiri65/equiv&lt;/a&gt;. If you work on&lt;br&gt;
numerical or cross language equivalence and I have got a detail wrong, I would&lt;br&gt;
genuinely like to hear it.&lt;/p&gt;

&lt;p&gt;Built at &lt;a href="https://nativerse-ventures.com" rel="noopener noreferrer"&gt;Nativerse Ventures&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>rust</category>
      <category>ai</category>
      <category>testing</category>
      <category>programming</category>
    </item>
    <item>
      <title>Bharataddress v0.2 — The Complete Open Source Indian Address Toolkit</title>
      <dc:creator>Neelagiri65</dc:creator>
      <pubDate>Mon, 06 Apr 2026 23:21:10 +0000</pubDate>
      <link>https://dev.to/neelagiri65/why-indian-address-parsing-is-broken-and-what-i-built-to-fix-it-2pne</link>
      <guid>https://dev.to/neelagiri65/why-indian-address-parsing-is-broken-and-what-i-built-to-fix-it-2pne</guid>
      <description>&lt;p&gt;Built a Python toolkit for Indian addresses. 26,700+ pincodes, no standard format, landmarks instead of street names, multiple scripts. The usual chaos.&lt;/p&gt;

&lt;p&gt;bharataddress handles parsing, formatting, validation, geocoding, address similarity, batch processing and DIGIPIN encoding. All offline. No API keys. No ML. 4.3MB total.&lt;/p&gt;

&lt;p&gt;62.5% exact match on a public 200-address gold set. Tested head to head against Shiprocket's 760MB TinyBERT NER model on the same test set. bharataddress wins on 6 of 9 fields. Fully reproducible.&lt;/p&gt;

&lt;p&gt;What you get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;parse() turns messy address strings into structured JSON&lt;/li&gt;
&lt;li&gt;geocode() gives you lat/lng from pincode centroids for 16,400+ pincodes&lt;/li&gt;
&lt;li&gt;encode_digipin() generates India Post's new 10-char geo-code&lt;/li&gt;
&lt;li&gt;format() outputs India Post / single-line / shipping label styles&lt;/li&gt;
&lt;li&gt;validate() checks consistency and flags whether an address is deliverable&lt;/li&gt;
&lt;li&gt;address_similarity() gives you a 0-1 score for dedup&lt;/li&gt;
&lt;li&gt;parse_csv() and parse_dataframe() for bulk processing&lt;/li&gt;
&lt;li&gt;extract_state_from_gstin() pulls state from GST numbers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;pip install bharataddress&lt;br&gt;
&lt;a href="https://github.com/Neelagiri65/bharataddress" rel="noopener noreferrer"&gt;https://github.com/Neelagiri65/bharataddress&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;100 tests. MIT licensed. First open-source Indian address parser with DIGIPIN support.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>opensource</category>
      <category>beginners</category>
      <category>python</category>
    </item>
  </channel>
</rss>
