I was paying 2x too much for Claude API calls...

#architecture #claude #ai #productivity

I was three weeks into building an Agent for my work (a productivity helper for data analysts) when I noticed certain flows were costing noticeably more than others. I assumed it was response length — longer answers, more output tokens, higher bill. So I added a system prompt instruction to be concise, watched the costs barely move, and moved on.

Two weeks later I finally token-counted the inputs. The problem wasn't the output. The problem was me passing raw JSON data as context on every single request. The same information serialized as plain prose used 60% fewer tokens. I had been paying a 2.5x markup on every API call that touched the data — for weeks — because I never checked what I was actually sending.

That sent me back to the transformer paper. Not to feel bad about the cost, but to understand why this happens at an architectural level. What I found turned several things I treated as configuration choices into things I now understand as architectural requirements.

Why JSON costs more than prose

The model never sees your text. It sees tokens — integer IDs produced by Byte-Pair Encoding (BPE). BPE builds a vocabulary of subword units by iteratively merging frequent character pairs in the training corpus. Plain English prose compresses well: common words and subwords get their own tokens, so a typical sentence runs around 4–5 characters per token.

JSON doesn't compress the same way. Every structural character — {, }, ", :, , — is a potential token boundary. For example, in my FinMentor Multi Agent Architecture a key-value pair like "ticker": "AAPL" tokenizes to roughly 8 tokens. The prose equivalent — "AAPL" — is 1. I ran both through tiktoken (OpenAI's BPE tokenizer, same approach as Claude) on equivalent portfolio payloads. The JSON used 2.6x the tokens.

The practical fix is simple: serialize to prose where you can, and compact JSON where you can't. Remove whitespace, use short key names, avoid redundant nesting. The model doesn't need your JSON to be human-readable — it needs it to be short.

The first thing to check when a client says "our API costs are too high" is not the system prompt length or the response verbosity. It's what format their data is arriving in.

Implementing attention from scratch

I wanted to see the math directly, so I implemented scaled dot-product attention in pure NumPy:

def scaled_dot_product_attention(Q, K, V):
    d_k = Q.shape[-1]
    scores = Q @ K.T
    scaled = scores / np.sqrt(d_k)
    weights = softmax(scaled)
    return weights @ V, weights

The formula is softmax(QK^T / sqrt(d_k)) @ V. Each token has three vectors: a Query (what it's looking for), a Key (what it offers), and a Value (what information it passes forward). The dot product of a query against all keys gives raw attention scores — how relevant is each other token to this one. Softmax converts those scores to a probability distribution. The weighted sum of values is the output.

The scaling factor sqrt(d_k) is the part that's easy to skip over and wrong to skip. Without it, raw dot products grow in magnitude as embedding dimension increases. Push those large values through softmax and the distribution collapses: one token captures nearly all the weight, everything else approaches zero. Attention becomes winner-take-all. The model loses the ability to synthesize information from multiple positions simultaneously.

I ran the demo without the scaling factor on the same 4-token sequence. The max attention weight went from 0.52 to 0.97. Three tokens effectively disappeared from the computation. That's not a subtle degradation — it's a broken architecture. The scaling factor isn't a hyperparameter you tune; it's load-bearing math.

Why RAG is architecturally required

Attention is computed across every pair of tokens in the sequence. For a sequence of length n, that's n² attention computations. Double the context, quadruple the compute. At 1,000 tokens the cost is manageable. At 100,000 tokens it's 10,000× more expensive than at 1,000.

The curve makes two things obvious that I previously treated as preferences.

First, context windows have hard limits for economic reasons, not just technical ones. You cannot solve the context problem by extending the window indefinitely. The cost curve makes that infeasible long before any memory limit does.

Second, RAG is not a retrieval preference — it's the engineering solution to this constraint. Instead of putting a 50GB knowledge base into context (impossible), you embed it into a vector index, retrieve the 2–3K most relevant tokens at query time, and inject only those. You convert an O(n²) problem into an O(k²) problem where k is small and fixed. Once you see the scaling chart, RAG stops being a technique to evaluate and starts being an obvious architectural decision.

The related failure mode is the lost-in-the-middle problem. Attention weights aren't uniformly distributed across position — the model reliably attends to content at the beginning and end of long contexts but loses weight on content buried in the middle. If you have critical instructions in a system prompt, don't bury them in paragraph 8 of 12.

What this means if you're deploying Claude

Three things that became obvious once I understood the architecture:

Token-count your inputs before diagnosing any cost problem. Response length is visible; input bloat is invisible. The token counter is the first tool to reach for, not the last.

Put critical instructions at the start or end of your system prompt. The lost-in-the-middle effect is a documented attention behavior, not a quirk. If your deployment has a key constraint — "always disclaim that this is not financial advice" — it belongs in the first paragraph or the last, not buried between personality instructions and formatting rules.

RAG isn't optional for large knowledge bases. If your deployment involves more than a few thousand tokens of reference material that changes over time, RAG is architecturally required. Not a nice-to-have. The quadratic scaling curve makes the alternative unworkable at any meaningful scale.

Honest take

Most LLM tutorials skip the architecture entirely. You get "here's how to call the API," "here's how to write a system prompt," and "here's how to do RAG." That works until you hit a cost spike, a failure mode you can't reproduce, or a client asking why their AI assistant stops following instructions when the context gets long.

The architecture isn't academic. It's the explanation for every non-obvious production behavior you'll encounter. JSON costs more because of how BPE tokenization works. RAG exists because of quadratic scaling. Prompt position matters because attention weights aren't uniform across context length. These aren't mysterious emergent properties — they follow directly from how transformers are built.

Understanding the architecture doesn't make you a researcher. It makes you a better engineer.

Notebook with all the code: https://github.com/saulolinares10/anthropic-alignment-notes