DEV Community: Neelagiri65

The AI everyone talks about is not the one developers run

Neelagiri65 — Mon, 06 Jul 2026 14:14:05 +0000

Ask most people who is winning AI and you get the same answer. OpenAI. Anthropic. Google. The trillion-dollar labs and their flagship models at five dollars a million tokens. That is the conversation. It is not the usage.

The board

OpenRouter publishes a live ranking of the models developers actually send traffic to, ordered by real use. As I write this, the number one model is DeepSeek V4 Flash, at nine cents per million tokens. The whole top of the board is cheap models, most of them open-weight. The celebrated American flagships, Claude Opus among them, sit at seventh and eighth, at five dollars. More than fifty times the price of the model outranking them.

Here is the actual board, printed by a command you can run yourself in the next thirty seconds:

You remember the DeepSeek moment. In early 2025 a cheap open model wiped a trillion dollars off Nvidia in a single day and the story became "China caught up". The headlines moved on within a fortnight. The board did not. On the surface where developers pick models for real work, the cheap open models never gave the lead back.

One caveat, up front, because it matters

This is OpenRouter's traffic. OpenRouter is where developers route across many providers to optimise price and performance, so it over-represents the cost-conscious and the open. Teams that call the OpenAI or Anthropic APIs directly are not counted here. So this is not "nobody uses Claude". Plenty do, for good reasons. The honest claim is narrower and more interesting: on the one public board where developers vote with real spend on price and performance, the vote is not going where the headlines point.

I lead with that caveat on purpose. It is the first thing a sharp reader would raise, and a number is only worth showing if you are honest about what it does and does not measure.

Why the gap costs you

If you are choosing what to build on, the distance between the conversation and the usage is exactly the thing that bites. The conversation is optimised for launches and benchmark charts. The usage is optimised for your bill and your latency. Those are different objectives, and only one of them shows up in a keynote. A team that standardised on a five-dollar flagship because it was the name in the room, when a nine-cent model would have carried most of the load, is paying a tax on the discourse.

And this is the shape of almost every number in AI right now. A benchmark a lab chose to report. A ranking with no method attached. A "fastest model" with nothing to compare it to. A status badge that is a claim, not a measurement. You cannot navigate an ecosystem on numbers you cannot trace.

Insight you can check, in the terminal

That is why we built gawk. It reads the public signals of the AI world and gives you the live pulse, with one rule above the rest: cited, not invented. Which models are actually used, which developer SDKs are rising and falling, which tools are healthy right now, what shipped today. Every figure links to the source it came from. And it is in the place you already work.

npx gawk-cli models

Every card prints its source, its age, and a URL. It derives nothing beyond arithmetic on numbers it also shows you. The ranking above is not a screenshot from a design tool. It is the live board.

Be clear about the limit. gawk will not tell you which model is best for your problem. That is your judgment, on your workload, and no leaderboard can make it for you. What it will tell you, with a source attached to every figure, is what the ecosystem is actually doing, so your judgment starts from facts instead of from a feed.

The board is public. The numbers trace. Do not take the conversation's word for it, and do not take mine. Run it.

npx gawk-cli

Originally published at nativerse-ventures.com. gawk is free and open to run: npx gawk-cli · gawk.dev · github.com/Neelagiri65/gawk-cli

AI spend is a black box. Trust is the meter.

Neelagiri65 — Wed, 24 Jun 2026 21:49:27 +0000

An electricity meter is sealed. It is calibrated by a body that does not work for the utility, it can be read by the person paying, and a disputed bill has a physical artifact to point at. Metered billing works for one reason. The meter is trustworthy independently of the seller.

An AI bill has no such meter. The counter sits inside the provider's serving stack. It reports how many tokens were used, the invoice is paid on that number and most of what it counts is never returned. On a frontier reasoning model the bulk of the spend is reasoning and cache tokens, billed but never shown.

There is no sealed meter. There is a number and a request to trust it.

The pattern is universal. Every major model meters by the token, from OpenAI and Anthropic to Google, Meta, Mistral, DeepSeek, xAI and Alibaba's Qwen.

The hyperscalers reselling them, AWS, Microsoft Azure and Google Cloud bill the same way and gateways like OpenRouter and Hugging Face pass the meter straight
through. All of them keep the meter on their own side of the glass.

I spent a few weeks building the independent meter reader. This is what the build found including the part that proved the premise wrong, which turned out to be the most useful finding of all.

The research said the problem was real

Before writing code I ran the same adversarial research method I use for app store intelligence. Fan out across sources, extract falsifiable claims, then verify each with a three-vote pass where two refutations kill the claim. One hundred and three agents, ninety seven claims, three killed under scrutiny. The kill list is the point. The goal is to be as hard on my own conclusions as the tool is meant to be on token counts. Hold that thought, because the same discipline later saved the project.

The verdict was a real, academically validated white space.

CoIn (arXiv 2505.13778): users are billed for invisible reasoning tokens, which often account for the majority of the cost, yet have no way to verify their authenticity.
Invisible Tokens, Visible Bills (arXiv 2505.18471): users are billed for operations they cannot observe, verify or contest.
PALACE (arXiv 2508.00912): commercial services conceal internal reasoning traces while still charging for every generated token.

A preprint on token inflation put a number on the exposure. Hidden reasoning usage is inflatable by roughly 1,469% on average without detection. A hundred-dollar honest bill becomes about fifteen hundred. Treat the magnitudes as directional rather than settled, but three independent groups agree on the shape. The spend is largely for work that cannot be observed.

Buried in that research was one sentence I read, nodded at, and did not actually absorb.

The trust paradox:

Every audit must trust some artifact, but current frameworks trust exactly the ones a provider has > the strongest reason to manipulate.

Remember that one.

The build, and the finding that stopped me

The leading academic approach, CoIn, is cooperative. It needs the provider to build a Merkle tree of token fingerprints, commit the root, and serve proofs on audit. Elegant and commercially dead on
arrival. No provider volunteers to make its own meter auditable.

So the build went the other way. Passive and outside in. Retokenise the delivered output locally, with the model's own tokenizer and reconcile it against the reported number. No provider cooperation,
nothing leaves the machine. Label every figure by confidence. EXACT when re counted with the real tokenizer, BOUNDED when estimated within a band, UNVERIFIABLE for reasoning and cache, which are billed but never returned.

Pointed at BytePlus, the finding that stopped me was video. A five second clip, billed 246,840 tokens. Video is metered by a published formula, width times height divided by 1024 times frames, so the bill is re derivable from the delivered file with ffprobe. It matched to the token. Gap zero. A
second clip at a different resolution, 108,900 tokens, gap zero. Four live text completions, gap zero on all four.

A wall of gap-zero results from live, paid calls. It looked like proof. A launch started forming around it.

The question that broke it

Someone asked one sentence. Are these not the same number from two sides?

And the trust paradox came back to collect.

Here is the uncomfortable part. Re-counting the delivered output and matching the reported number is a consistency check. It binds the bill to the artifact that was handed over. It is not an independent measure of true cost. The provider counts the tokens the model generated. The tool counts the canonical encoding of the text the model chose to return. Two computation paths, so a match means generation was canonical and nothing was dropped in transit.

A real check, the class of a checksum.

Against an honest provider that check is a near-guaranteed pass. Against a rational one set on overcharging it is toothless, because nobody inflates the one bucket that can be recomputed. The inflation lives in the buckets that cannot be, reasoning, cache, the rate. The gap-zero wall was demonstrating the one thing that was never the risk.

Worse was the reflex. On a small text gap, the instinct was to swap tokenizers until it vanished. That is the trust paradox in miniature, tuning the audit until it agrees with the bill. An audit that can be adjusted until it passes is not an audit. A gap, it turns out, has three causes that the number alone cannot separate. Over billing, the wrong tokenizer, or legitimate non canonical generation. So a gap is a flag to investigate, never a verdict, exactly as a match is never proof.

The research had said this in one sentence. Building the wrong thing was the cost of understanding it.

What actually survives

Killing the headline claim left the real one standing, and it is stronger.

An AI bill cannot be independently verified. The verifiable part verifies itself, and the rest is structurally out of reach. What can be done is to measure how much of the bill has any ground truth at all. That is the honest, novel signal.

The delivered part can be bound to the artifact, which catches a 1080p-billed-but-480p-delivered swap and catches metering bugs, including the prompt-cache failures that have over-billed real users by ten to twenty times. The undelivered part, reasoning and cache, most of a modern bill, is unverifiable by anyone, and the right move is to say so, loudly, with a number.

The product is not "we check the bill". It is a measurement of how much of the bill nobody can check and how small the sliver that can be. For a five second video that sliver is a six figure token count
that at least ties to the file. For a reasoning call the sliver is almost nothing and the honest output is the size of the dark.

The discipline is admitting where verification ends

The thing that saved this project is the thing that built it. An adversarial pass that kills the claims
which do not survive. It killed three of ninety-seven claims in the literature. It should have been run
on the headline before anyone got attached to gap zero. When a one-sentence question can dismantle the
strongest demo, the demo was the problem.

It is also the house style. The app-store work is outside-in. Read anyone's public reviews, cooperate with no one, and treat the only honest trend source as the only honest trend source. This is the same philosophy aimed at billing. Read the artifact that was handed over, cooperate with no one and be ruthless about the line between what can be known and what is being asked on trust. The discipline is not the verification. The discipline is admitting where verification ends.

The AI meter is unsigned and a large part of it reads in the dark. The useful move is not to pretend the dark can be read. It is to measure exactly how much of it there is.

TokenLedger is open source under Apache-2.0: pip install retoken. The known limitations, including the rule that a gap is a flag and never a verdict, are written up in the repository, because a tool about trust should hold itself to its own standard.

The App Store's silent giants: AI assistants reply to almost none of their reviewers

Neelagiri65 — Sun, 21 Jun 2026 15:22:46 +0000

An App Store rating looks like a verdict. It behaves more like a monument, built over years and slow to move. It says very little about how this month's users feel.

I took the 12 most-rated Productivity apps on the US App Store, 32 million ratings between them, and split the headline star into the two numbers it hides: how far recent sentiment has fallen below the lifetime average, and whether the developer replies when users complain.

How it is measured

Population truth. Lifetime ratings and the star histogram come from Apple's full ratings data, every rating an app has ever received.
Recent sentiment. A fixed window of the most recent reviews by date, so an app captured to a depth of thousands is not compared on a multi-year average against an app with a few hundred. Same window for everyone.
Developer response. Reply share and median latency over that recent window.
Complaints are bucketed with a rule-based taxonomy. It is a heuristic, not a trained classifier, and I treat it as one.

What turned up

The AI assistants now own this chart, and they reply to almost no one.

App	Lifetime	Recent	Reply share
ChatGPT	4.8	4.18	0%
Claude	4.7	3.06	0%
Grok	4.9	3.77	0%
Perplexity	4.8	3.60	0%
Google Gemini	4.7	3.65	13%
Dropbox	4.8	2.75	58%
Gmail	4.7	2.40	26%
Google Drive	4.8	3.90	23%
Microsoft Authenticator	4.7	2.18	1%

The older tools are the ones still in the trenches: Dropbox answers 58% of recent reviewers, Gmail 26%, Drive 23%. The steepest recent drops belong to Microsoft Authenticator (4.7 to 2.18), Gmail (4.7 to 2.40) and Dropbox (4.8 to 2.75).

Plotted on two axes, backlash against response, every app falls into one of four archetypes: Firefighters, Ghost Ships, Complacent Giants and Resilient Leaders. Eight of the twelve are Ghost Ships, taking a recent hit in near silence.

The honest limits

Recent reviewers self-select toward the dissatisfied. A person who hits a bug is far more likely to leave a review than a contented one, so a low recent average blends genuine decline with that bias, and this data cannot cleanly separate the two. I tie no drop to a specific app release, because the version data is too sparse to support that claim. The lifetime figure is population truth; the recent figure is a biased sample; I never present one as the other.

The full interactive Friction Matrix, the per-app complaint archetypes, and the method in detail are here: https://nativerse-ventures.com/productivity-friction-matrix

Independent research from the Nativerse lab. Figures are public App Store data, cited, not invented.

The day a refactor passed on my laptop and failed on yours

Neelagiri65 — Sat, 13 Jun 2026 09:48:34 +0000

Most of the code being written right now is not being written. It is being
generated, glanced at, then merged. The reviewer is tired. The diff is large.
Increasingly the reviewer is itself a language model summarising the work of
another language model. Somewhere in that loop there is supposed to be a moment
where someone confirms the change did what it claimed. Often there isn't.

I wanted a small, boring tool to fill that gap. Take a function from before a
refactor and after. Run both on the same inputs. Tell me plainly whether the
behaviour changed. Not an opinion. Not a confidence score. A result I could
rerun next week to the same answer, byte for byte. If a teammate ran it on their
machine they should get my exact result, not something close.

That last sentence sounds trivial. It is the entire problem. This is the story of
where it broke and why the fix turned out to be the most important design decision
in the whole tool.

Why rerunning it is the only claim worth making

There is no shortage of tools that review your pull request. The newer ones are
language models with a nice interface. They are useful. They are also the same
kind of thing that wrote the code: a probabilistic system giving you its
impression. Ask the same one twice and you can get two different reviews. In a
world where a model wrote the diff, a model reviewing the diff is the same fallible
loop checking its own work.

So I did not want to add another opinion. I wanted a verdict with a property no
opinion has: you can reproduce it. Run the check. Get a result. That result is a
function of the inputs and nothing else. No wall clock. No network. No luck
particular to one machine. Same inputs in, same answer out, on any computer.

If you have that, you can sign it and hand it to someone who does not trust you.
They rerun it and confirm it themselves. The trust comes from reproduction, not
from my reputation or my model's confidence. That is the whole pitch. It only works
if the reproduction is real.

Where it broke: a function that returned a float

Early on the tool handled integers, strings, lists of integers. Clean, exact, the
same on every machine. Then I pointed it at a numerical function. A refactor of an
averaging routine. The kind of change an AI assistant produces ten times a day.

On my Mac the check said the two versions diverged on one input. On a Linux box in
CI it said they were identical. Same code. Same inputs. Two different verdicts.

This is the nightmare for a tool whose only selling point is reproducibility. A
verdict that depends on the machine is not a verdict. It is a rumour.

The cause is not a bug in my tool. It is the nature of floating point arithmetic.
It is worth understanding, since almost every "we test your AI code" tool will hit
it and most will quietly paper over it.

What IEEE 754 promises and what it does not

Floating point numbers follow a standard called IEEE 754. The standard is precise
about which operations are guaranteed to give the same answer everywhere. That
guarantee is narrower than people assume.

The basic operations are correctly rounded. Addition. Subtraction. Multiplication.
Division. Square root. The fused multiply add. Each is required to return the
single nearest representable result, every time, on every conforming machine. At
double precision with the default rounding mode these operations are identical bit
for bit whether you run them on an Apple chip or an Intel server. There is no
ambiguity. There is no luck of the platform. Two different expressions built only
from these operations will agree across machines or disagree across machines
consistently.

The functions you reach for next are not covered. Sine. Cosine. Exponential.
Logarithm. Raising to a fractional power. For these the standard only recommends
correct rounding. It does not require it. The reason is a genuinely hard maths
problem, sometimes called the table maker's dilemma: computing the last bit
correctly for these functions can need enormous intermediate precision.
Implementations make different tradeoffs. The C maths library on macOS and the
one on Linux can legitimately return results that differ in the final bit.

That final bit is exactly what bit me. My averaging refactor touched a function
whose two versions agreed to the last bit under one maths library and disagreed
under another. Neither machine was wrong. The standard permits both. My tool was
trying to render a global verdict on a quantity that is, by design, local.

The decision: refuse what you cannot reproduce, by name

There were two tempting fixes. Both are traps.

The first is to round the results before comparing. Compare to twelve decimal
places and call it equal. This feels reasonable. It is not safe. A real difference
in the last bit can sit right on the rounding boundary. One machine rounds up. The
other rounds down. Rounding does not remove the disagreement. It hides it sometimes
and invents it other times. You have traded a guarantee for a coin flip.

The second is to compare with a tolerance. Equal if within some epsilon. Now your
tool no longer answers the question it was asked. "Did this refactor preserve the
behaviour" has quietly become "is the new behaviour close enough for my taste." For
a tool whose only asset is a precise reproducible verdict, that is the asset gone.

The fix that actually holds is less clever and more honest. The tool admits a
floating point function only when its computation stays inside the correctly
rounded operations. Those are reproducible across machines, since the standard
makes them so. The moment a function reaches for a transcendental, the tool does
not guess and does not round. It refuses, by name. It says so:

clamp_average  REFUSED  depends on a platform-variable transcendental (math.exp);
                        a cross-host reproducible verdict is not possible here.

Agreement across machines comes from restriction, not from cleverness. Inside the
admissible set the raw bits are already identical everywhere. The tool records the
result as its exact bit pattern, with no rounding and no massaging. A NaN is
normalised to a single canonical form. A NaN payload is not observable behaviour.
The sign of a zero is preserved exactly. The sign of a zero is observable: dividing
by positive zero and by negative zero gives positive and negative infinity. The
details matter. The rule behind all of them is one sentence.

A value is admissible only if the verdict it produces is identical on every
machine. Everything else is refused, out loud.

Why refusing is a feature, not a weakness

It is uncomfortable to ship a tool that says "I will not judge this." The instinct
is to maximise coverage so the tool looks capable. That instinct is how you end up
with a tool that confidently lies a small fraction of the time, which is worse than
useless for anything you would actually rely on.

The refusal is the thing that makes every other answer trustworthy. When the tool
says two versions are equivalent, it is staking that claim on a verdict it can
reproduce anywhere. When it cannot make that promise it tells you. Then you reach
for a human or a different technique. You are never handed a green light that was
really a shrug.

This is the opposite of the marketing reflex, which is to claim more. The claim
here is deliberately small and completely solid: these specific behaviours were
checked on these specific inputs, the result reproduces to identical bytes
everywhere, here is everything I declined to check. Small and true beats broad and
shaky. That is true above all for the one job where you are trying to replace a
rubber stamp with something you can stand behind.

What this is, plainly

The tool is called equiv. It runs a changed function and its previous version on
the same generated inputs. It reports whether they diverged, with the exact input
that broke them when they do. It produces a signed receipt of what was checked,
addressed by its content, which anyone can rerun to the same bytes. It is not a
prover. It is bounded testing: a pass means no divergence was found on the inputs
it tried, not that none exists. It says so. It checks mechanical behaviour, never
intent or architecture. It tells you that too.

That is the honest shape of it. In a field full of tools that review your code by
having a model form an impression, the contribution here is not intelligence. It is
the refusal to pretend. A verdict you can reproduce. A clear list of what was not
checked. A flat "no" whenever a yes would not survive being run on a different
machine.

The hard part was never generating inputs or comparing outputs. It was deciding,
before writing the code, exactly which questions the tool is allowed to answer with
certainty, then being willing to say nothing about the rest.

equiv is open source under the Apache 2.0 licence and runs as a GitHub Action:
github.com/Neelagiri65/equiv. If you work on
numerical or cross language equivalence and I have got a detail wrong, I would
genuinely like to hear it.

Built at Nativerse Ventures.

Bharataddress v0.2 — The Complete Open Source Indian Address Toolkit

Neelagiri65 — Mon, 06 Apr 2026 23:21:10 +0000

Built a Python toolkit for Indian addresses. 26,700+ pincodes, no standard format, landmarks instead of street names, multiple scripts. The usual chaos.

bharataddress handles parsing, formatting, validation, geocoding, address similarity, batch processing and DIGIPIN encoding. All offline. No API keys. No ML. 4.3MB total.

62.5% exact match on a public 200-address gold set. Tested head to head against Shiprocket's 760MB TinyBERT NER model on the same test set. bharataddress wins on 6 of 9 fields. Fully reproducible.

What you get:

parse() turns messy address strings into structured JSON
geocode() gives you lat/lng from pincode centroids for 16,400+ pincodes
encode_digipin() generates India Post's new 10-char geo-code
format() outputs India Post / single-line / shipping label styles
validate() checks consistency and flags whether an address is deliverable
address_similarity() gives you a 0-1 score for dedup
parse_csv() and parse_dataframe() for bulk processing
extract_state_from_gstin() pulls state from GST numbers

pip install bharataddress
https://github.com/Neelagiri65/bharataddress

100 tests. MIT licensed. First open-source Indian address parser with DIGIPIN support.