Building UsagePilot: Why I'm Creating an AI Token Metering SDK for Multi-Tenant SaaS

#webdev #typescript #saas #ai

A year ago, I was knee-deep in launching a B2B SaaS tool that leaned hard on OpenAI's APIs for smart features like automated report generation. At first, it felt like magic — our users loved the AI boost. But then the bills started rolling in. What began as a manageable $300 monthly OpenAI tab ballooned to over $3,000 in just four months as we scaled to a reasonably good number of active users.

The real kicker? I had zero visibility into who was driving those costs. Was it our enterprise clients running massive data analyses, or a handful of hobbyist users experimenting wildly? We had three fixed monthly pricing tiers, so I was basically footing the bill for the heavy hitters with revenue from the casual crowd. It hit me like a ton of bricks: if we didn't fix this, our margins would evaporate, and we'd be another AI startup burned by unchecked costs.

That frustration sparked a deep dive into the world of AI usage tracking. Turns out, I'm not alone — plenty of SaaS founders are grappling with the same "AI tax." But the tools out there? They left me wanting. That's why I rolled up my sleeves and started building UsagePilot, an open-core SDK tailored for multi-tenant SaaS apps. Let me walk you through the journey, the gaps I uncovered, and how I'm tackling them head-on.

Uncovering the Multi-Tenant Blind Spot

I spent late nights poring over docs for tools like Helicone, Langfuse, Portkey.ai, and even enterprise beasts like Datadog's LLM Observability. On paper, they sounded perfect: token counting, cost breakdowns, real-time dashboards. Helicone's proxy setup was a breeze — just swap your API endpoint, and boom, you're logging everything. Langfuse impressed with its open-source tracing for complex chains, and Portkey's gateway handled a whopping 250+ providers with slick features like caching and failover.

But here's where the cracks showed: none were truly built for multi-tenant SaaS chaos. Picture this — you've got a shared backend serving hundreds of customers (tenants) simultaneously. Each request needs pristine isolation: track tokens for Tenant A's chat completions without leaking into Tenant B's analytics. Langfuse, for all its strengths, relies on a global singleton in their SDK, forcing clunky workarounds like per-request client swaps or metadata hacks. It's workable, but messy, especially at scale where concurrency bites back.

Helicone? Their proxy treats your app as one giant entity. You can tag requests with user IDs, but true tenant dashboards or isolated quotas? Forget it—it's more bolt-on than baked-in. Portkey comes closest with virtual keys and metadata filtering, but even they stop short of seamless billing workflows or embeddable per-tenant views. And don't get me started on the big APM players like New Relic or Datadog; they're powerhouse for overall monitoring but feel like overkill for AI-specific metering, with pricing that spirals as your data grows.

The pattern was clear: these tools excel at observability for internal teams debugging their own AI experiments. But for SaaS folks like me, who need to turn AI usage into fair, revenue-aligned billing? It was like using a Swiss Army knife to carve a statue—possible, but far from ideal.

From Frustration to Foundation: Prioritizing Billing-Grade Precision

After patching together a Frankenstein setup with Langfuse and custom scripts (which broke twice during peak hours), I hit my breaking point. Why wasn't there a tool that treated billing as the North Star? Observability is great for devs, but invoicing demands unyielding accuracy — no "close enough" estimates when customer trust (and your revenue) is on the line.

That's the genesis of UsagePilot: a lightweight, runtime-agnostic SDK that puts multi-tenancy and billing-grade tracking front and center. I drew inspiration from the best bits of competitors — Helicone's easy proxy vibes, Langfuse's tracing depth, Portkey's broad provider support—while fixing their SaaS blind spots.

At its heart, UsagePilot enforces tenant context via typed dimensions, making isolation feel natural:

import { UsagePilot } from "usagepilot";
import { PostgresStorage } from "usagepilot/postgres";

type Dimensions = {
  user_id: string;
  organization_id?: string;
  feature?: "chat" | "analysis"; // For granular breakdowns
};

export const usage = new UsagePilot<Dimensions>({
  storage: new PostgresStorage({
    batchSize: 50, // Smaller batches for quicker flushes in high-traffic apps
    flushInterval: 60_000, // Every minute to keep data fresh
  }),
});

This isn't just syntax sugar—it's compile-time safety ensuring you never forget a tenant ID. Integrating with AI calls? Dead simple, wrapping your fetch to auto-capture context from headers:

import { createOpenAI } from "@ak-sdk/openai";
import { streamText } from "ai";

export const openAI = createOpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  fetch: usage.createFetch((req) => ({
    user_id: req.headers.get("x-user-id") || "anonymous",
    organization_id: req.headers.get("x-org-id"),
    feature: req.headers.get("x-feature"),
  })),
});

const stream = await streamText({
  model: openAI("gpt-4o"),
  prompt: "Analyze this quarterly report...",
  headers: {
    "x-user-id": "u123",
    "x-org-id": "o456",
    "x-feature": "analysis",
  },
});

No manual logging hassles—just seamless interception. And for accuracy? I went pluggable all the way: model-specific tokenizers (bye-bye tiktoken mismatches for Claude or Gemini) ensure penny-perfect counts, pulling from provider APIs where possible and falling back to vetted alternatives.

Hard-Won Lessons on Scale

Building this wasn't smooth sailing. My v1 prototype tanked under simulated load—every token event triggered a DB hit, choking our Postgres instance. The fix? Intelligent batching: buffer in memory, flush on thresholds, but update in-memory quotas instantly for real-time enforcement. It slashed overhead by 85% while keeping things snappy for edge deploys on Vercel or Cloudflare.

Another eye-opener: AI costs aren't uniform. Input tokens? Predictable, your prompts set the baseline. Outputs? Wild cards, especially in streaming where verbosity varies. One user's verbose essay generator cost us $50 in a day before we added per-request caps. Then there's model evolution — same prompt on GPT-4o today might chew fewer tokens than six months ago due to optimizations. Tracking historical trends became essential, not optional.

Through it all, I leaned on real stories from other founders. One shared how unchecked AI usage tanked their runway; another how misattributed costs sparked customer churn. These anecdotes shaped UsagePilot into more than code — a toolkit for turning AI from a liability into a profit center.

Open-Core: Free Foundations, Paid Power-Ups

Sustainability mattered too. Inspired by Supabase and Grafana, I went open-core under MIT: the SDK, storage adapters, and core metering are free forever, fostering community tweaks for new providers. But for SaaS magic? Paid tiers unlock Stripe/Chargebee syncs (usage to invoices in one flow), embeddable tenant dashboards, and enterprise extras like SSO.

It's a win-win: devs prototype freely, businesses scale confidently. Free tier caps at 100k events/month with basic Postgres; Pro ($49/mo) bumps to 5M events plus billing hooks; Enterprise gets custom everything.

Peering Ahead: From Tokens to AI Economics

UsagePilot is still in early development, and I'm planning to push more updates as I get closer to v1. Next up: full Anthropic/Google support, anomaly alerts (spot that rogue user early), and quality-cost correlations (link token spend to output scores for smarter optimizations). The dream? Evolve beyond tokens to meter GPU hours, image gens, or voice processing — whatever the AI future throws at us.

But the real thrill is connecting with other SaaS makers and exchanging ideas on what optimal AI token usage tracking needs to look like. If you're wrestling AI costs in your SaaS, hit me up. Share your war stories — what's your biggest metering headache? The GitHub's open, and I'm building in public, iterating on real feedback. Let's make AI profitable, not painful.

Konstantin is a principal engineer, crafting tools like UsagePilot to tame AI chaos in SaaS. Track the project on GitHub (github.com/usagepilot), snag it via NPM (npmjs.com/package/usagepilot), or follow updates on X (@usagepilot). Got a tale of AI billing woes? DM me — I'm all ears.