Writing code got cheap. Being responsible for it didn't. (What shipping an AI-generated gem taught me)

#ai #rails #ruby #opensource

Quick warning before you read: this isn't a launch announcement. It's a story about getting something wrong in public, and changing how I work because of it. If that's not your thing, no hard feelings.

The confession

A little over a month ago I published a Ruby gem called llm_cost_tracker. Here's the honest part: almost all of it was written by an LLM. I prompted, it produced, I shipped. I felt pretty good about myself for about a week. Then I posted it to r/ruby asking for feedback, and people gave me exactly that.

They were right, too. A chunk of the code was just hallucinated. The example I keep coming back to, because it's so on the nose: my gemspec listed activesupport and activerecord as hard dependencies through add_dependency. The gem can't even load without them. And yet, inside the gem, there was code carefully checking at runtime whether ActiveSupport and ActiveRecord were present. It was defending against the absence of the two things it literally cannot run without.

Nobody writes that on purpose. The model produced something that looked careful, and I shipped it without noticing, because I hadn't really read it. Once you spot one of those, you start seeing them everywhere: guard clauses for impossible states, checks that check the checks, abstractions wrapped around abstractions. Paranoia mode. Over-engineering for situations that can't happen.

None of the feedback was mean. It was just correct, and it stung because I couldn't argue with any of it. You can't defend code you never understood in the first place.

The idea was fine. I wasn't.

Worth saying: the gem itself solves a real problem. If your Rails app calls an LLM, the monthly bill tells you what you spent and basically nothing about who spent it. You get totals per model, maybe per API key. What you don't get is your own world: which feature made the call, which tenant it belongs to, whether that prompt tweak you shipped last Tuesday quietly doubled your token usage.

That information only exists for a moment, at call time, inside your app. The provider has no idea what feature: "chat" means to you. Miss it there and it's gone for good.

The tools that already exist aim higher than I needed. Langfuse is full-blown observability, and self-hosting it means running Postgres and ClickHouse and Redis and S3. Helicone sits as a proxy on the path of every call. LiteLLM lives over in Python. I just wanted a small Rails-native thing that answered one question, which feature spent the money, and then got out of my way.

Good idea, bad execution. Not telling those two apart is what nearly made me delete the whole thing.

What I changed, and what I didn't

Here's where I want to be careful, because the easy version of this story is a lie.

I did not throw it all out and lovingly rewrite it by hand. That would make a nicer arc and it isn't true. The truth is messier. There is still a pile of generated code in this gem, and I'm still working through it, file by file.

What actually changed is how I work. Some of it I rewrite myself now. Plenty of it I still hand to agents, because swearing off the tools would be silly and I don't believe in it. But I delegate completely differently than I used to. I read every diff. I review it the way I'd review a coworker's PR, suspicious by default. And I don't commit anything I can't explain out loud. The gem that got roasted and the gem today aren't "AI-written" versus "human-written." They're "shipped blind" versus "actually mine."

I'm in the middle of that cleanup right now, today. Tearing out the paranoid guards, the pointless self-checks, the engineering that existed to look thorough rather than to do anything. It's slow and it's boring and nobody claps for it. I'm doing it anyway, because quality is the part I can actually control, and I'd rather tell you the cleanup is still happening than pretend I've crossed a line I haven't.

The lesson: writing code got cheap, so the job moved

This is the bit I'd hand to a younger me.

You can't give the machine the keyboard and walk off. What it gives you back is a draft, not a finished thing. The second I treated a draft as done, I'd quietly handed off my own understanding, and it took a stranger about thirty seconds to notice the hole. There's no faking that you understand your own code once someone starts asking questions.

But the reframe that actually stuck with me is this. Writing code is cheap now. It doesn't cost what it used to in hours or effort. And when the cost of making something drops, the value just moves somewhere else. Here it moved to quality, and to ownership. The hours I used to spend typing, I now spend reading, shaping, and standing behind whatever goes out under my name. That is not a smaller job. For a solo maintainer it might honestly be the entire job now.

The roast didn't talk me out of using AI. It talked me into raising my review bar to match how fast the code shows up. Cheap to generate, expensive to be responsible for. And the responsibility was always going to be mine.

Where it honestly stands

What it does: it logs every LLM call into your own Postgres or MySQL. Provider, model, tokens, cost, latency, a per-component breakdown, and a pricing snapshot so old numbers don't silently change when the provider updates prices. No proxy, calls go straight to the provider. No second datastore to run. Attribution is just tags you wrap around a call:

LlmCostTracker.with_tags(user_id: Current.user&.id, feature: "chat") do
  client = OpenAI::Client.new(api_key: ENV["OPENAI_API_KEY"])
  client.responses.create(model: "gpt-4o", input: "Hello")
end

What it's deliberately not: no prompt capture, no traces, no replay, not invoice-grade. If what you need is observability, go use Langfuse. I'll say that to your face.

And two things I'm not going to dress up:

It still has a lot of generated code in it that I'm reviewing and reworking. The direction is right. It is not finished, and I won't pretend otherwise.
It is not running in anyone's production yet, mine included. I've tested it against a separate test app with real OpenAI and Anthropic keys, paying for the calls out of my own pocket, streaming and all. Capture works, tags attribute correctly, the dashboard renders. But "works on my test app" is a long way from "battle-tested," and I know the difference. ## What I'm actually asking for

I want a handful of early adopters. People with real LLM calls in a real Rails app who'll drop this in and tell me, bluntly, what falls over. Blunt clearly works on me. The repo and docs are on GitHub, it's MIT, and an issue telling me what broke is worth more to me than a star.

And if you build with AI too, which is most of us by now, maybe that's the whole takeaway: the tools made writing code cheap, but they didn't make being responsible for it cheap. That part is still on us. I learned it the embarrassing way. You don't have to.

So, genuinely curious: how does your team attribute LLM spend right now? Every answer I get is some flavor of "we hacked something together ourselves." Which is exactly the itch I'm still scratching.

And yes, full disclosure: AI helped me put this post together too. The difference, this time, is that I was actually in the room for it. ;)