DEV Community: Vikrant Shukla

Where AI Coding Is Actually Headed (Not the Hype Version)

Vikrant Shukla — Sun, 17 May 2026 03:30:00 +0000

The future of AI coding gets discussed in two registers: the utopian one where AI writes all the code and developers are free to think about "higher-level problems," and the dismissive one where AI is just autocomplete that gets things wrong at the worst moments.

Both are wrong in similar ways. Here is what I actually think is happening, based on watching production teams work with these tools for the last couple of years.

We are still in the autocomplete phase, but it's ending

The dominant paradigm right now is still assistant-style generation: a developer writes intent, the model writes implementation, the developer accepts or revises. This is useful and already changes productivity measurably — but it's not structurally different from any other productivity tool. It makes the current process faster, not different.

What's shifting is the scope of what "a task" means. Tools that started as single-function generators are now completing across multiple files, running tests, reading error output, and iterating on their own suggestions. This is qualitatively different. It's not faster autocomplete — it's a different relationship to the work unit.

The next phase is CI-native agents

The most consequential change coming in the near term is not chat-based coding assistance. It is agents embedded directly in the continuous integration pipeline.

Right now, CI is where code goes to be verified. In 18–36 months, CI will increasingly be where code goes to be partially written — where an agent ingests a failing test, proposes a fix, runs validation, and opens a PR if the fix passes. A developer reviews and merges.

This is not a far-future speculation. It's already deployed in some form at companies running large monorepos with well-defined test contracts. The technical prerequisites — deterministic test suites, clean dependency graphs, reliable build environments — are the same prerequisites for high-quality code at scale generally. Teams that invest in test quality now are building infrastructure that agent-assisted CI will amplify.

Boilerplate is largely solved

For anything with a well-defined schema — REST API clients, data models, ORM queries, configuration parsers, serialisation code, test fixtures — AI generation is already faster and often better than human-written code. The domain is too regular, the patterns too established, and the failure modes too obvious.

This is not "AI will take developer jobs." It is "a significant fraction of what junior developers spend time on is going to be automated, and that will change the shape of what junior developers are for."

What replaces the boilerplate time is unclear. The optimistic view is that engineers get to spend more time on architecture, product thinking, and edge-case hardening. The realistic view is that velocity expectations will rise to absorb the freed capacity, and some of the judgment development that comes from writing the boilerplate will need to come from somewhere else.

The shrinking value of syntax knowledge, the rising value of specification

One of the clearest patterns I see is the declining marginal value of knowing language syntax and standard library APIs precisely. If you can describe what you want clearly and precisely, you can get working code in languages you don't fluently write. This is genuinely new.

What rises in value is the ability to specify intent precisely: to write test contracts that capture the real requirements, to describe edge cases clearly enough that they can be verified, to recognise when generated code is plausible-but-wrong because you have a clear mental model of what correct looks like.

The developers who are most effective with AI tools, in my observation, are not the ones who are best at prompting. They are the ones with the strongest models of what the code needs to do — which is also what made them good developers before.

Where human judgment remains the scarce resource

A few domains remain genuinely hard for current AI tools, not because the capability isn't there in principle, but because the problem is underspecified or the signal is weak:

Security review. Not automated scanning — that's increasingly fine — but the judgment about whether a system's threat model is correctly framed, whether the trust boundaries make sense, whether the authentication flow has subtle flaws in its assumptions.

Distributed systems correctness. Reasoning about concurrent, partially-failing systems at the design level. The AI can generate an implementation of your consensus algorithm. It cannot tell you whether your consensus algorithm is the right choice for your failure model.

Domain translation. Taking a messy, ambiguous real-world problem and converting it into a well-defined computational problem. This is the hardest part of software engineering and the part where AI assistance is currently least useful.

These are the domains worth investing in. Not because AI won't eventually make progress there — it will — but because the timeline is longer and the human advantage is currently most durable.

The future of AI coding is not autonomous systems writing all software while developers sip coffee and approve PRs. It is a material shift in where human judgment is applied, a compression of the time between intent and implementation, and a significant change in the skill mix that makes an engineer effective.

That's already happening. The interesting question is whether your team's practices are evolving at the same rate as the tools.

How Companies Actually Get Tiered Token Pricing (and Why You Probably Can't Yet)

Vikrant Shukla — Sat, 16 May 2026 03:30:00 +0000

Everyone in developer tooling knows that the published price isn't the only price. But the mechanics of how you get from "paying retail" to "paying something better" are rarely explained clearly.

This is an attempt to do that.

The public tier is real, and it's fine for most teams

Let me start honestly: for most teams spending under $50–80k/year on LLM API costs, the public pricing tier is adequate. The margins built into it are significant, but the absolute cost at low volumes is low enough that optimising it is not the highest-leverage work you can do.

Where it starts to matter is when you're scaling a production system, when LLM cost is a meaningful line in your operating budget, or when you're building a product where per-call economics affect your own pricing.

The mechanisms that actually unlock better pricing

Volume commitments with annual contracts. The single most straightforward path. Moving from pay-as-you-go to a contractual annual commitment — with a floor on spend — shifts the conversation from "customer" to "account." The discount on committed spend varies by provider and by the size of the commitment, but it exists, it's negotiable, and it starts at lower spend levels than most people assume.

The key insight is that providers are optimising for two things: revenue predictability and capacity planning. A customer who commits to $200k/year is worth considerably more to them, in planning terms, than a customer who might spend $200k or might spend $20k. That predictability has real value to the provider, and some of it can be captured by the customer.

Reserved capacity tiers. Separate from pricing, some providers offer reserved capacity — guaranteed throughput and latency SLAs — at a premium. For teams where the LLM is in the critical path of a user-facing product, this is often worth paying for independently of any discount discussion. But it's also a negotiating lever: committing to reserved capacity is a strong signal of production intent, which improves your position in the pricing conversation.

Benchmark and evaluation partnerships. This is less commonly discussed. Some providers actively want customers who are building domain-specific evaluation suites — test sets that assess model performance on real-world tasks in their vertical. If you have a well-defined problem domain and a meaningful evaluation set, approaching a provider as a benchmark partner (rather than just a customer) can open a different kind of conversation. You are offering them something they genuinely need: signal on where the model performs well and where it doesn't, in your specific context.

Reference customer arrangements. The marketing and sales value of a credible reference customer — a company willing to be named publicly, discuss their use case, and appear in case study material — is real. Providers, especially in the 12–24 months after launching an enterprise-facing product, actively want these relationships. If you're willing to participate (and have governance approval to do so), it's worth raising in procurement discussions.

Cloud provider bundling. If your primary infrastructure is on Azure, Google Cloud, or AWS, there are often mechanisms to apply API credits or cloud commit spend to LLM API costs. These arrangements are typically structured at the cloud provider level rather than directly with the model provider, but the economics can be meaningful if you already have large cloud commitments.

What doesn't work

Asking for a discount without any of the above. A sales team has no mechanism to offer reduced pricing to a small pay-as-you-go account without some quid pro quo — either commitment, volume, or strategic value. A cold email asking for "startup pricing" or "special rates" without substance behind it will be declined.

Also: waiting until you're already at high spend. The time to negotiate is before you've locked in your architecture and before the provider knows you have no viable alternative. Negotiate early, when you have options.

The practical implication for small teams

If you're a small team currently at or approaching $8–10k/month in API spend, you're at the threshold where these conversations start to be worth having. Not because the absolute savings are transformational, but because the relationship and the contract structure you establish now will shape your cost trajectory as you scale.

The first step is simple: contact your provider's sales team and ask directly what the contracted pricing tiers look like at your current and projected usage level. The worst outcome is you learn there's nothing available at your scale. The better outcome is you get a number that's worth committing to.

Beyond Pay-Per-Token: How Enterprises Barter Architecture for AI Access

Vikrant Shukla — Fri, 15 May 2026 03:30:00 +0000

The public pricing page is what most developers see: input tokens, output tokens, price per million, maybe a cached-input tier. A clean table with a currency symbol.

What the table doesn't show is the deals that don't fit in it.

At the highest level of enterprise AI procurement, some of the most interesting commercial relationships don't look like token purchases at all. They look more like barter — except what's being exchanged is architectural commitment rather than currency.

What gets traded instead of tokens

Large organisations have something model providers want badly: data, compute, distribution, and validation.

Data partnerships. A healthcare network with ten years of de-identified clinical records, a financial institution with transaction pattern data, a logistics company with real-world routing optimisation problems — these are extraordinarily valuable for training and fine-tuning. A provider who gets access to a curated, domain-specific dataset in exchange for preferential API access is trading token capacity for something they couldn't otherwise easily acquire.

Reserved compute commitments. The hyperscale cloud providers — Microsoft Azure, Google Cloud, AWS — have deeply intertwined relationships with the frontier model labs. Enterprise customers who commit to significant cloud infrastructure spend often find that their AI API costs are structured very differently. The "token price" becomes almost a rounding error against a larger infrastructure negotiation.

Reference architecture partnerships. Being a named reference customer — agreeing to publish a case study, participate in an advisory council, demo the integration at a conference — has real value to a provider building enterprise credibility. It doesn't show up as a line item on your invoice, but it influences the price that goes on the invoice.

Co-development agreements. Some enterprises are not just customers but contributors: funding fine-tuning runs, contributing domain-specific evaluation benchmarks, co-authoring research on enterprise use cases. These arrangements blur the line between customer and partner in ways that make the "per-token" framing essentially meaningless.

Why this matters for everyone, not just enterprise procurement

If you are a small team or independent developer, this isn't just a trivia fact about how the other half lives. It has practical implications.

The pricing you see is calibrated to a market that includes these high-level deals. The public tier pricing reflects a cost structure that is partly subsidised by the strategic value the provider gets from its largest relationships. In a world where the biggest customers are trading architecture for access, the marginal economics of serving the long tail of pay-as-you-go developers are shaped by that dynamic.

It also means the pricing tier you're on is not the floor. There is a below-published-pricing tier — it's just not accessible via a self-serve signup flow.

What triggers access to better pricing

This is worth being specific about, because the answer is actionable even at a mid-sized organisation:

Volume commitment with a contract is the first unlock. Moving from pay-as-you-go to a committed annual spend — even at levels that feel modest compared to enterprise deals — typically moves you into a negotiable tier. The number is usually somewhere north of $100k/year in API spend, but it varies.

Workload visibility is the second factor. Providers want to understand what you're building. A one-page technical summary of your use case, expected traffic patterns, and growth trajectory is often the difference between the published tier and a conversation about a custom arrangement. They're not being charitable — they want to understand the shape of your future revenue and whether you're a reference customer they'd want.

Production deployment signals seriousness. A proof-of-concept customer is worth less to a provider than a production customer. If you can demonstrate that you are building something real and intend to scale it, the conversation changes.

The structural implication

We are in an early period for AI infrastructure pricing, and the structures are not yet settled. The token-as-unit-of-account model made sense in the early API phase. As the market matures, I'd expect the pricing models to evolve in the direction of the patterns already visible in the top tier: capacity reservations, tiered service agreements, outcome-based pricing in some verticals, and bundling with adjacent cloud services.

The developers who understand this now — who know that "the public price is not the only price" and know how to have that procurement conversation — will be at a structural cost advantage as their usage scales.

Microsoft Says 50% AI Code. Here's What That Actually Means for Engineers

Vikrant Shukla — Thu, 14 May 2026 03:30:00 +0000

I was in a conversation recently with a senior engineering leader at Microsoft. He mentioned, almost in passing, that their development teams now carry an internal target: generate 50% of production code using AI tools.

Not a stretch goal. A target.

I've been thinking about what that actually means for the working engineer — not the headline version, but the granular, day-to-day reality of what changes and what doesn't.

What the 50% number is measuring

First, let's be precise about what "50% AI-generated code" means, because the metric itself is underspecified.

Is it 50% of lines committed? 50% of characters? 50% of files touched in a sprint? Does a human-edited AI suggestion count as 50% AI or 0%?

At most orgs tracking this, "AI-generated" means code that was initially produced by a copilot or agent tool and accepted by the engineer — even if subsequently edited. On that definition, the number at many teams is already 30–40% for greenfield work. Getting to 50% is less a technological leap than a cultural one: making it the default path rather than the optional tool.

What actually changes at 50%

Code review shifts in shape, not volume. If anything, the review burden increases. AI-generated code tends to be syntactically correct and structurally conventional, which makes reviewers lower their guard. The bugs that slip through are not the obvious ones — they are the subtle semantic ones, the hidden invariant violations, the race conditions that only appear at scale. Teams that hit 50% without updating their review practices will ship more of these.

The senior/junior dynamic changes. Junior developers using AI tools can produce code at a velocity that previously required years of pattern recognition. But velocity without judgment is dangerous. The role of the senior engineer is no longer primarily to write code faster — it is to provide the judgment layer that AI cannot: understanding what invariants the system is actually maintaining, catching the plausible-but-wrong output, knowing when the generated code misunderstands the domain.

Accountability attribution gets murky. When a bug ships in AI-generated code, who owns it? The engineer who accepted it? The team lead who approved it in review? The org that set the 50% target? This is not an abstract question — it will show up in post-mortems, in performance reviews, and eventually in liability discussions. Orgs haven't caught up with the governance implications.

Documentation pressure increases. AI tools generate code from context. The less context your codebase has — through comments, docstrings, README files, meaningful variable names — the worse the generated output. Teams chasing the 50% target without investing in documentation quality will find the AI generating plausible code that increasingly doesn't fit the actual system.

What doesn't change

The actual hard parts of software engineering remain human. Deciding what to build. Understanding the user's real problem, which is usually not the stated problem. Navigating the organisational constraints that shape what's technically possible. Making judgment calls under uncertainty with incomplete information.

These are not prompt-engineering challenges. No amount of AI code generation changes them.

The thing worth worrying about

The target creates an incentive structure. If teams are measured on "% of code AI-generated," engineers will find ways to hit the number — including accepting AI output they'd otherwise revise, or framing human-written code as AI-assisted.

Metrics shape behaviour. The 50% target is a reasonable directional signal but a poor KPI. The better question is: what is the defect rate, cycle time, and system reliability of the code being shipped, regardless of how it was produced?

The exec I spoke to understood this. His point was not "AI should write half your code." His point was "if your team isn't regularly using AI as a core part of the workflow, that's a capability gap we need to close."

That's a more nuanced position than the headline number suggests — and it's the right one to hold onto as these targets propagate across the industry.

The LLM Code Bugs Nobody Talks About

Vikrant Shukla — Wed, 13 May 2026 03:30:00 +0000

Every developer I know has a story about AI-generated code that looked completely right and was completely wrong. Not "wrong" in an obvious way — wrong in the way that costs you a Tuesday.

After shipping production systems where AI wrote a meaningful portion of the codebase, here are the failure modes I've stopped being surprised by.

1. Hallucinated imports that pass linting

The model confidently reaches for pandas.DataFrame.to_parquet(engine='pyarrow', schema_evolution=True). That parameter does not exist. The code passes a static linter because the import resolves. It fails at runtime, in production, on the one path you didn't test.

Why it happens: the model has seen thousands of DataFrame snippets and infers plausible-sounding parameters from patterns across libraries. It doesn't distinguish "I've seen this exact call" from "this feels right given everything I've read."

Fix: for any library call involving optional parameters, run the generated code against the actual library docs or a real interpreter before committing. Don't trust the linter alone.

2. Confident wrong refactoring

You ask the model to "refactor this function to be more readable." It hands back something cleaner. You merge it. Three weeks later, a subtle change in variable scope or early-return logic breaks a downstream assertion nobody noticed.

The model optimised for the appearance of correctness — shorter, more idiomatic code — without tracking all the invariants the original author was quietly maintaining.

Fix: treat AI refactors like any external PR. Require a diff review and existing test suite pass, not just a visual scan.

3. Stale context poisoning

Long sessions are particularly dangerous. By message 40, the model's working context has accumulated contradictory instructions, out-of-date schema definitions, and revised requirements that weren't cleanly updated. It starts synthesising code from a model of your codebase that no longer matches reality.

This is different from hallucination. The model is using your instructions — just the version from 30 messages ago.

Fix: for complex tasks, periodically start a fresh context with a clean system prompt rather than accumulating state across a single long session. Treat context windows like short-term memory, not a persistent source of truth.

4. Non-determinism breaking CI

AI-generated tests that rely on dictionary ordering, floating-point equality, or datetime.now() will pass locally and fail in CI at a rate that's maddening to debug. The model writes tests that pass in its internal simulation of the code but doesn't account for environment-specific non-determinism.

Fix: ban == comparisons on floats and timestamps in any AI-generated test. Add a lint rule. Make the model explain why each assertion is deterministic before you accept it.

5. The "working" code that doesn't handle scale

The model has seen plenty of sequential, synchronous Python. It has seen far less production async code under real concurrency. Generated async handlers frequently contain subtle race conditions — shared state accessed across coroutines without locks, cancellation paths that leave resources open, retry logic that doesn't account for partial writes.

The code works perfectly in your local test. Under 50 concurrent requests, it quietly corrupts state.

Fix: any AI-generated concurrency code gets a mandatory review against the asyncio documentation and your race condition checklist before merge. If you don't have a checklist, make one.

The meta-bug

All of these share a root cause: the model is optimising for plausibility, not correctness. Code that looks like correct code scores well in training. Code that subtly fails under edge conditions is hard to distinguish from working code in text form.

This doesn't mean AI coding is broken. It means the review practices most teams inherited from the "all code is human-written" era are the wrong shape for "30–50% of this was generated."

Your reviewers need to be looking for a different class of bug. The obvious ones, the model largely gets right. It's the plausible-but-wrong — the confident hallucination, the silent invariant violation, the context-poisoned refactor — that will cost you time until you build specific habits around catching them.

What patterns have you hit that aren't on this list? I'd genuinely like to know.

The Softmax Bottleneck: Why Making LLMs Bigger Doesn't Always Make Them Smarter

Vikrant Shukla — Tue, 12 May 2026 03:30:00 +0000

When researchers scale a language model — more parameters, more layers, wider hidden dimensions — there's an implicit assumption: a bigger model can represent more things. More expressiveness, more knowledge, better predictions. Mostly this is true. But there's a structural ceiling that scaling alone can't raise, and it sits right at the final layer of the network. It's called the softmax bottleneck.

Understanding it explains why some models hit a performance wall that raw compute can't fix, and why certain architectural choices (mixture of experts, output factorisation, mixture of softmaxes) exist beyond just increasing model size.

What the Softmax Bottleneck Actually Is

At the final step of a language model, you need to produce a probability distribution over every token in the vocabulary — typically 30,000 to 200,000 tokens. The model does this by taking the hidden state vector h (dimension d), multiplying by an output embedding matrix W (shape d × V, where V is vocabulary size), and applying softmax.

The problem: the output of that matrix multiplication is a rank-d matrix. If the "true" next-token distribution you're trying to approximate requires higher effective rank than d allows, you can't represent it — no matter how well you've trained.

Formally: the log-probability matrix log P(x_{t+1} | context) across all contexts and all tokens has a certain rank in the ideal case. If that rank exceeds the hidden dimension d, the softmax layer is a bottleneck. The model is being asked to express a high-rank function through a low-rank projection.

Why This Shows Up in Practice

A vocabulary of 100,000 tokens has an enormous number of contextual distinctions. Consider how differently the word "bank" should distribute probability across next tokens when preceded by "river" vs. "financial" vs. "blood" vs. "memory". Across all possible preceding contexts, the full distribution matrix has potentially very high rank — each context creates a distinct probability distribution over the vocabulary, and those distributions may be nearly linearly independent of one another.

A model with hidden dimension d = 4096 can only produce at most 4096 linearly independent output distributions, regardless of the number of parameters in the model body. The transformer blocks can be arbitrarily deep and powerful; they eventually produce a d-dimensional vector, and that vector can only express a limited diversity of next-token distributions.

This was formalised in "Breaking the Softmax Bottleneck: A High-Rank RNN Language Model" (Yang et al., 2017/2018), which showed empirically that even very large hidden dimensions were often insufficient for natural language, and that the rank constraint was genuinely binding.

How the Field Has Responded

Several architectural responses have emerged:

Mixture of Softmaxes (MoS): Instead of computing a single softmax, compute K parallel softmax distributions and mix them with learned weights. This allows the effective output rank to scale as K × d rather than just d. Yang et al.'s own proposed solution — it works, but adds inference cost proportional to K.

Tied input/output embeddings: An interesting side effect: tying the input embedding matrix to the output matrix (a widely used trick that reduces parameter count) actually helps with the bottleneck in some configurations, because the input embeddings encode richer token-token relationships that the output projection then inherits.

Mixture of Experts (MoE): When different experts activate for different inputs, the effective expressiveness of the output stage scales with the number of active experts, partially relaxing the rank constraint. This is one underappreciated reason MoE models can punch above their activated-parameter weight.

Larger hidden dimensions in the final layers: Some architectures deliberately widen the final few transformer blocks or use a different (wider) projection head, recognising that the bottleneck is sharpest at the output stage.

What This Means for Practitioners

If you're fine-tuning a base model and finding that validation loss plateaus at a value that seems unreasonably high for the task, the bottleneck may be architectural rather than a data or training issue. This is more likely to bite you when:

Your task requires fine-grained token-level discrimination across a large vocabulary (code generation, multilingual tasks)
You're working with a model whose hidden dimension is small relative to vocabulary size
You've added vocabulary tokens (domain-specific terms) without adjusting the output architecture

The fix is rarely "train longer." It's either increasing d, applying output factorisation, or accepting that the model has a structural ceiling on its token distribution expressiveness.

The Bigger Picture

The softmax bottleneck is a clean example of a class of architectural constraints that don't show up in parameter counts or FLOP estimates, but which fundamentally cap what a model can express. The field tends to fixate on scaling laws — more data, more compute, better performance — and those laws are real. But they operate within architectural envelopes. When you're near the ceiling of one of those envelopes, more compute doesn't help.

Understanding where the ceilings are is what separates architecture intuition from benchmark-chasing.

Lost in the Middle: Why LLMs Quietly Ignore the Centre of Their Own Context Window

Vikrant Shukla — Mon, 11 May 2026 04:33:36 +0000

Every time you hand a long document to an LLM and ask it to summarise or answer a question, something quietly goes wrong. The model reads the whole thing — or appears to — but its answers disproportionately reflect what was at the beginning and the end. Whatever sat in the middle? Largely ignored.

This isn't a rumour. It was rigorously documented in a 2023 paper titled "Lost in the Middle: How Language Models Use Long Contexts" (Liu et al., Stanford/UC Berkeley), and it remains one of the most practically important — and underappreciated — findings in applied LLM science.

The Shape of the Problem

The researchers ran a controlled experiment: they placed the answer to a multi-document QA question inside a set of retrieved documents, then varied which position the relevant document occupied — first, middle, or last. Performance dropped sharply when the relevant document was positioned in the middle of the context, even when the total context length was well within the model's stated window.

The performance curve is U-shaped: high accuracy when the answer is near position 1 (beginning) or the final position, with a pronounced dip for everything in between. On some configurations, accuracy in the middle fell by 20+ percentage points compared to the edges.

Why Does This Happen?

The mechanism is rooted in how attention distributes across long sequences. Two forces pull in opposite directions:

Recency bias — The final tokens are closest to where the model is generating its next token. In autoregressive transformers, recent tokens tend to receive higher attention weights because they're both positionally proximate and because many training tasks (next-token prediction, instruction following) implicitly reward sensitivity to recent context.

Primacy bias — The very first tokens in a prompt — especially system instructions — receive unusually high attention during pre-training and fine-tuning because they set the conversational frame. Instruction-tuned models are heavily conditioned on treating the beginning of context as authoritative.

The middle gets neither benefit. It's not recent enough for recency bias, and it wasn't there when the model was learning to follow instructions. In the attention score distribution, middle-sequence tokens often receive lower aggregate attention than their informational value warrants.

What This Means in Practice

If you're building a RAG pipeline, this has concrete implications for your retrieval and context construction strategy:

Reranking matters more than retrieval order. A retriever that returns the most relevant chunk in position 3 of 5 will underperform the same retriever that puts that chunk at position 1 — even if the model technically "sees" all five. Getting retrieval order right isn't just aesthetics; it's accuracy.

Don't bury your needle. If you're doing something like "here are 10 excerpts, answer based on them," and the answer lives in excerpt 6, you're playing against the model's attention distribution. Front-load or back-load your most relevant context.

The "more context = better" assumption breaks down. Adding more retrieved chunks to a prompt can actually reduce accuracy if it pushes the relevant chunk deeper into a crowded middle. Precision of retrieval often beats recall of retrieval for this reason.

Longer context windows don't fix it. The effect persists in models with 128K+ context windows. The architecture hasn't changed; only the capacity has. A 1M-token model still has a U-shaped attention distribution. The middle is just a much bigger valley.

Does Instruction Tuning Help?

Somewhat. Models fine-tuned explicitly to scan entire contexts (e.g., with long-document QA training data) show a shallower U-curve. Anthropic's Constitutional AI training and similar techniques that emphasise careful reading do push some improvement. But the effect doesn't disappear — it attenuates.

There's also a prompt-engineering mitigation: explicitly instructing the model to "carefully consider all documents before answering" or "pay equal attention to all parts of the provided context" has been shown to partially counteract primacy/recency bias, likely by activating fine-tuned attention-distribution behaviour. It's not a fix, but it's free.

The Deeper Lesson

This finding is a good reminder that LLMs don't "read" context the way a human reads a document — scanning linearly, building a uniform mental model. They process all tokens in parallel, but the attention mechanism creates an implicit weighting that's shaped by training distribution, position, and architecture. The model's nominal context window is the ceiling; its effective context is shaped by where you put things.

If your application depends on the model reliably using specific information — legal clauses, numerical specs, code sections — position is not neutral. Put critical content at the edges, rerank your retrieval results before injection, and don't assume that visible-in-context means attended-to.

I built a local proxy to track exact LLM API costs per project

Vikrant Shukla — Fri, 08 May 2026 17:16:11 +0000

The problem was simple: I run a small software studio that builds
client work heavily using Claude. Every project ended with the same
awkward conversation — "what did the AI actually cost?"

Token estimates drift from the real bill. Nothing attributed costs
per project. I was either undercharging or handwaving, neither of
which builds client trust.

So I built Halton Meter.

What it does

Halton Meter is a local mitmproxy-based daemon that intercepts
outbound LLM API traffic, attributes each request to a project,
computes exact cost from published pricing, and writes everything
to a local SQLite database. Nothing about how you call the API changes.

pipx install halton-meter
halton-meter init --apps

Three processes come up:

Edge listener on 127.0.0.1:8081
proxy interceptor on 127.0.0.1:8090
Loopback API on 127.0.0.1:8765

Every LLM API call you make — via the Anthropic SDK, OpenAI SDK,
raw HTTP, whatever — gets intercepted, tagged, costed, and logged.

Why a proxy and not an SDK wrapper?

SDK wrappers only catch calls you make directly. They miss ChatGPT,
Gemini Code Assist, and anything going through a tool you don't
control. A proxy captures everything on the wire without touching
your code.

Project attribution

The daemon attributes each request using a three-step chain:``

Claude Code sessions, scripts, notebooks, and direct SDK calls all
get attributed correctly with zero code changes.

Terminal report

halton-meter report

Breakdown by project, model, and date. Numbers come directly from
the provider's published pricing — no estimates, no hidden margins.

What's supported

Six adapters across four providers: Claude, OpenAI, Gemini, and Grok.
Direct API surfaces and OAuth surfaces (ChatGPT, Gemini Code Assist)
both intercepted.

Local only

No cloud. No tracking. API keys never leave your machine. Everything
lives in ~/.halton-meter/db.sqlite. The bundled dashboard is open
source (Apache 2.0) and runs locally.

Docs and full architecture at haltonmeter.com.

Happy to answer questions on the proxy architecture, the attribution
chain, or the cost calculation in the comments.