DEV Community: Sergii Khomenko

LLM Cost Tracking for Rails

Sergii Khomenko — Thu, 28 May 2026 05:17:26 +0000

A Rails app starts calling OpenAI or Anthropic. A few months later someone in finance asks "who's burning $X a month on this and on what?" The answer requires per-user, per-feature, per-tenant attribution — and the obvious solutions all want you to give up something I wasn't willing to give up.

This is the design rationale behind llm_cost_tracker, a Rails Engine I've been building. It's not the only way to solve this problem; it's the way that fit the constraints I cared about.

The constraint set

Three non-negotiables shaped every other choice:

No new infra. A Rails app already has a database, a request lifecycle, an authentication layer, a dashboard pattern. Anything I bolt on should reuse those, not duplicate them.
No prompt storage. Prompt content is regulated data in a lot of contexts — PII, customer transcripts, medical, legal. The tracker has no business holding it.
No traffic redirection. Direct calls to OpenAI / Anthropic / Gemini are the simplest path and the one with fewest failure modes. A proxy adds a hop, a key rotation surface, and a vendor relationship.

Those three rules ruled out most of the existing landscape.

Why not a proxy

The first instinct for "track LLM spend" is to put a proxy in front. Helicone, Portkey, LiteLLM Proxy, OpenRouter — they all model the problem this way: route OpenAI traffic through proxy.example.com, the proxy sees the request and response, logs cost, forwards to OpenAI.

It's a clean separation. It also means your API keys live in their config not yours, their downtime is your downtime, their TLS and data-residency posture is yours by default, their rate-limiting sits between your code and the provider, and a new SDK feature waits on their proxy supporting it.

For some teams that trade is fine. It wasn't fine for me — the LLM call is already the most expensive and most reliability-sensitive thing in the request, and putting another hop in front of it felt like the wrong move.

The alternative: capture what we need inside the Ruby process, on the way out. Patch the official SDK methods at boot, or wrap the underlying Faraday client. The call still goes straight to OpenAI / Anthropic / Gemini; we just observe the request and response as they pass through.

Why ActiveRecord, not a TSDB

The second fork: where does the data go? Cost tracking is shaped like a time series — append-mostly rows, aggregations over time windows. A TSDB (Timescale, ClickHouse, Influx) is the textbook answer.

I picked Postgres / MySQL via ActiveRecord anyway, for one reason: the data is operational, not analytical. It needs to join to your users table, your subscriptions table, your tenants table. It needs to live behind the same RLS and the same backups as the rest of your app data. Standing up a separate TSDB to query "show me LLM cost for tenant 42 last month" makes that join harder, not easier.

Three tables ship in the install generator: llm_cost_tracker_calls (one row per LLM call, with token counts and total cost), llm_cost_tracker_call_line_items (per-component breakdown — input, output, cache reads, hosted tool charges), and llm_cost_tracker_call_tags (the attribution rows). For the LLM volumes most Rails apps see today, a single Postgres handles this fine.

Why block-scoped tags

Attribution is the whole game. Tokens × rate × model gives you a total; tags answer "whose total is it?"

The mechanism is a block:

LlmCostTracker.with_tags(user_id: current_user.id, feature: "support_chat") do
  client.chat.completions.create(model: "gpt-4o-mini", messages: ...)
end

Anything that hits a tracked SDK or Faraday client inside that block picks up the tags. You wrap it around an around_action in a controller, around perform in a job, around a feature module's entry point. The SDK call itself doesn't change.

The reason it's not a kwarg on the SDK call: I don't control the SDK call. The OpenAI gem's client.chat.completions.create has its own signature; threading a tag through it would mean either monkey-patching the call shape or asking every caller to use a wrapper. Block-scoped context fits Ruby's grain — same shape as ActiveSupport::CurrentAttributes, same shape as Rails request-store patterns.

Tags merge across nested blocks (inner wins), get sanitized for high-cardinality or secret-shaped values, and end up as a row per (call, key, value) in the database. Group by, filter, breakdown.

Why a frozen pricing snapshot per call

Prices change. OpenAI cut prompt caching rates twice in the last year; Anthropic introduced 1-hour cache TTL with its own rate; Gemini rolled out context-length-tiered pricing. If you compute cost lazily — "the rate is whatever the current price table says" — you have a moving floor under historical reports.

So every call freezes its pricing snapshot at write time: the exact per-component rate that produced its cost, stamped on the row. Run a report from three months ago today, you get what it cost then. Update the price table tomorrow, historical numbers don't shift.

The trade-off is storage: a few hundred bytes per call for the snapshot. At the volumes we're talking about for LLM calls, that's invisible next to the message bodies themselves (which we don't store).

What's there now

Version 0.11.0 instruments three official SDKs (OpenAI, Anthropic, RubyLLM) and ships Faraday middleware for everything else — OpenAI-compatible APIs like Groq, DeepSeek, OpenRouter; Azure OpenAI on both endpoint styles; Gemini; custom gateways. The mounted dashboard at /llm-costs has pages for cost overview, top models, the call ledger, tag breakdowns, data-quality signals, and a pricing reference. Budget guardrails block calls before send when an estimate would cross a configured monthly, daily, or per-call cap.

What it deliberately isn't: prompt or completion storage, trace replay, eval framework, model-routing logic, sidecar service, OpenTelemetry exporter. Each of those would justify a separate gem.

If your shape is "Rails app, direct API calls to one or two providers, finance asking where the spend goes" — this is the layer I wanted to exist.

Repo: github.com/sergey-homenko/llm_cost_tracker

Writing code got cheap. Being responsible for it didn't. (What shipping an AI-generated gem taught me)

Sergii Khomenko — Sat, 23 May 2026 20:15:30 +0000

Quick warning before you read: this isn't a launch announcement. It's a story about getting something wrong in public, and changing how I work because of it. If that's not your thing, no hard feelings.

The confession

A little over a month ago I published a Ruby gem called llm_cost_tracker. Here's the honest part: almost all of it was written by an LLM. I prompted, it produced, I shipped. I felt pretty good about myself for about a week. Then I posted it to r/ruby asking for feedback, and people gave me exactly that.

They were right, too. A chunk of the code was just hallucinated. The example I keep coming back to, because it's so on the nose: my gemspec listed activesupport and activerecord as hard dependencies through add_dependency. The gem can't even load without them. And yet, inside the gem, there was code carefully checking at runtime whether ActiveSupport and ActiveRecord were present. It was defending against the absence of the two things it literally cannot run without.

Nobody writes that on purpose. The model produced something that looked careful, and I shipped it without noticing, because I hadn't really read it. Once you spot one of those, you start seeing them everywhere: guard clauses for impossible states, checks that check the checks, abstractions wrapped around abstractions. Paranoia mode. Over-engineering for situations that can't happen.

None of the feedback was mean. It was just correct, and it stung because I couldn't argue with any of it. You can't defend code you never understood in the first place.

The idea was fine. I wasn't.

Worth saying: the gem itself solves a real problem. If your Rails app calls an LLM, the monthly bill tells you what you spent and basically nothing about who spent it. You get totals per model, maybe per API key. What you don't get is your own world: which feature made the call, which tenant it belongs to, whether that prompt tweak you shipped last Tuesday quietly doubled your token usage.

That information only exists for a moment, at call time, inside your app. The provider has no idea what feature: "chat" means to you. Miss it there and it's gone for good.

The tools that already exist aim higher than I needed. Langfuse is full-blown observability, and self-hosting it means running Postgres and ClickHouse and Redis and S3. Helicone sits as a proxy on the path of every call. LiteLLM lives over in Python. I just wanted a small Rails-native thing that answered one question, which feature spent the money, and then got out of my way.

Good idea, bad execution. Not telling those two apart is what nearly made me delete the whole thing.

What I changed, and what I didn't

Here's where I want to be careful, because the easy version of this story is a lie.

I did not throw it all out and lovingly rewrite it by hand. That would make a nicer arc and it isn't true. The truth is messier. There is still a pile of generated code in this gem, and I'm still working through it, file by file.

What actually changed is how I work. Some of it I rewrite myself now. Plenty of it I still hand to agents, because swearing off the tools would be silly and I don't believe in it. But I delegate completely differently than I used to. I read every diff. I review it the way I'd review a coworker's PR, suspicious by default. And I don't commit anything I can't explain out loud. The gem that got roasted and the gem today aren't "AI-written" versus "human-written." They're "shipped blind" versus "actually mine."

I'm in the middle of that cleanup right now, today. Tearing out the paranoid guards, the pointless self-checks, the engineering that existed to look thorough rather than to do anything. It's slow and it's boring and nobody claps for it. I'm doing it anyway, because quality is the part I can actually control, and I'd rather tell you the cleanup is still happening than pretend I've crossed a line I haven't.

The lesson: writing code got cheap, so the job moved

This is the bit I'd hand to a younger me.

You can't give the machine the keyboard and walk off. What it gives you back is a draft, not a finished thing. The second I treated a draft as done, I'd quietly handed off my own understanding, and it took a stranger about thirty seconds to notice the hole. There's no faking that you understand your own code once someone starts asking questions.

But the reframe that actually stuck with me is this. Writing code is cheap now. It doesn't cost what it used to in hours or effort. And when the cost of making something drops, the value just moves somewhere else. Here it moved to quality, and to ownership. The hours I used to spend typing, I now spend reading, shaping, and standing behind whatever goes out under my name. That is not a smaller job. For a solo maintainer it might honestly be the entire job now.

The roast didn't talk me out of using AI. It talked me into raising my review bar to match how fast the code shows up. Cheap to generate, expensive to be responsible for. And the responsibility was always going to be mine.

Where it honestly stands

What it does: it logs every LLM call into your own Postgres or MySQL. Provider, model, tokens, cost, latency, a per-component breakdown, and a pricing snapshot so old numbers don't silently change when the provider updates prices. No proxy, calls go straight to the provider. No second datastore to run. Attribution is just tags you wrap around a call:

LlmCostTracker.with_tags(user_id: Current.user&.id, feature: "chat") do
  client = OpenAI::Client.new(api_key: ENV["OPENAI_API_KEY"])
  client.responses.create(model: "gpt-4o", input: "Hello")
end

What it's deliberately not: no prompt capture, no traces, no replay, not invoice-grade. If what you need is observability, go use Langfuse. I'll say that to your face.

And two things I'm not going to dress up:

It still has a lot of generated code in it that I'm reviewing and reworking. The direction is right. It is not finished, and I won't pretend otherwise.
It is not running in anyone's production yet, mine included. I've tested it against a separate test app with real OpenAI and Anthropic keys, paying for the calls out of my own pocket, streaming and all. Capture works, tags attribute correctly, the dashboard renders. But "works on my test app" is a long way from "battle-tested," and I know the difference. ## What I'm actually asking for

I want a handful of early adopters. People with real LLM calls in a real Rails app who'll drop this in and tell me, bluntly, what falls over. Blunt clearly works on me. The repo and docs are on GitHub, it's MIT, and an issue telling me what broke is worth more to me than a star.

And if you build with AI too, which is most of us by now, maybe that's the whole takeaway: the tools made writing code cheap, but they didn't make being responsible for it cheap. That part is still on us. I learned it the embarrassing way. You don't have to.

So, genuinely curious: how does your team attribute LLM spend right now? Every answer I get is some flavor of "we hacked something together ourselves." Which is exactly the itch I'm still scratching.

And yes, full disclosure: AI helped me put this post together too. The difference, this time, is that I was actually in the room for it. ;)