Rapls

Posted on Jun 9 • Edited on Jun 25 • Originally published at zenn.dev

Who pays for the tokens? Designing an AI plugin that doesn't break your users' wallets

#ai #webdev #productivity #llm

The biggest drop-off in my AI chatbot plugin wasn't on the feature page or the settings screen. It was right before one sentence: "get an API key and set up billing." People installed it. They activated it. And then, at the point of registering a card with a company they'd never heard of, to open a faucet with no visible price, they left. I only saw it when I compared install counts with the number of chats that actually ran. The gap was a canyon.

The token bill, that invisible faucet, opens on the user's side, not the author's. I build WordPress plugins and ship one with AI in it, and that asymmetry took me a while to see. This post splits cost into two wallets, the side that uses AI (you pay) and the side that ships AI (the user pays), and spends most of its time on designing for the second one, with the guards I actually wrote.

Setup

Prices and plans move fast. Treat the below as true when I checked, and confirm on each vendor's pricing page. The code is skeleton: fill in the price table, currency formatting, provider branching, and nonce checks on your side.

Using side: a Claude subscription (verify the current price), Codex CLI
Shipping side: OpenRouter, various provider APIs
Target: a self-built WordPress plugin (AI chatbot)

Cost lands in someone's wallet

Until recently, AI tools were a flat monthly fee. That's cracking. GitHub Copilot moved to usage-based billing on June 1, 2026, replacing request counts with credits consumed by input, output, and cached tokens, on the grounds that agentic workloads made the flat model unsustainable.

The general rule under that news is simple. AI compute costs real money, and that money lands in some wallet. A flat fee just had the provider absorb the landing and show you a smooth surface. Usage billing handed the faucet back to the user. In solo development it's the same: either you pay or the user pays. It can't hover in the air.

The using side vs the shipping side

When you pay, you hold the reins. Split flat and metered by use case, chunk long autonomous runs, keep the per-turn baggage light. It scales with how you work, so it stays manageable.

The hard one is the shipping side. Put AI in a product and the user pays while you design. Your own wallet has a natural brake, you don't use what feels expensive, but your brake doesn't reach the user's wallet. The drop-off above is that asymmetry made visible.

There's also a WordPress-specific assumption working against you: people expect a plugin to run for free. Drop "metered charges to an outside AI on every use" into that world and it collides head-on. Users flinch less at the price than at the unfamiliar kind of expense. Saying so up front, and offering a free way to try it first, keeps that collision soft.

Which wallet do you aim at?

There's a fork at the top of the design. Either the user brings their own key (you don't pay, but the initial setup is a wall), or you pay the providers and offer a flat subscription (the experience is smooth, but you carry the token bill and the runaway risk).

The second one is dangerous solo: you take a fixed amount but the outgoing token cost has no ceiling, so heavy users widen your loss. So I made bring-your-own-key the default, and put the work into making that first step as light as possible. The rest of this is that work.

Designing so the user's wallet survives

First, the skeleton for handling one request. Which guard sits before the call, and which sits after, is what decides the effect.

function rapls_chat_handle( $user_id, $message ) {
    // 1. before the call: caps on count and interval
    $gate = rapls_chat_check_limits( $user_id );
    if ( is_wp_error( $gate ) ) {
        return $gate; // show "limit reached" to the user
    }
    // 2. pick a model by weight (a user's explicit choice wins)
    $model = rapls_chat_pick_model( $user_id, $message );
    // 3. cap the output before calling
    $res = rapls_chat_call_api( $model, $message, array( 'max_tokens' => 512 ) );
    // 4. after the call: record usage (for the meter and the caps)
    rapls_chat_record_usage( $user_id, $model, $res['usage'] ?? array() );
    return $res;
}

A free way to try it

This helped most. Before any card, let them see one chat run. I use OpenRouter's free tier for onboarding so the key-and-card step can be skipped at first. Once they've seen it work, they can think about a key for real use.

A free tier isn't a foundation, though. It has rate and speed limits and the terms can change on the provider's whim. Treat it as a "try once" entrance, and show the path to their own key from the start. A design that leans on the free tier stops working the day that tier changes.

Caps that stop runaways by design

A daily ceiling caps the total, and a minimum interval stops rapid-fire and error loops. The interval guard matters most: the worst case, calls looping forever while nobody is watching, is mostly stopped by this one check.

function rapls_chat_check_limits( $user_id, $daily = 100, $min_interval = 2 ) {
    $today = 'rapls_chat_count_' . $user_id . '_' . gmdate( 'Ymd' );
    $last  = 'rapls_chat_last_'  . $user_id;

    if ( get_transient( $last ) ) {
        return new WP_Error( 'too_fast', 'Too many requests. Please wait a moment.' );
    }
    set_transient( $last, 1, $min_interval );

    $count = (int) get_transient( $today );
    if ( $count >= $daily ) {
        return new WP_Error( 'daily_limit', 'You have reached today\'s limit.' );
    }
    set_transient( $today, $count + 1, DAY_IN_SECONDS );
    return true;
}

Runaways happen from a plain config mistake or an error loop, not only from bad intent. This isn't about trusting users; accidents happen in good faith, so you close the path in the design.

Model tiering: take it cheap, escalate only when needed

The top model is overkill for a simple question. Let a user's explicit choice win, otherwise route by the weight of the request, and escalate once if the answer comes back weak.

function rapls_chat_pick_model( $user_id, $message ) {
    $chosen = get_user_meta( $user_id, 'rapls_chat_model', true );
    if ( $chosen ) {
        return $chosen; // the user keeps the reins on their wallet
    }
    $is_simple = mb_strlen( $message ) < 40
        && ! preg_match( '/why|reason|compare|detail|how/i', $message );
    return $is_simple ? 'cheap-model' : 'strong-model';
}

function rapls_chat_answer( $user_id, $message, $context ) {
    if ( preg_match( '/in detail|explain more|longer/i', $message ) ) {
        return rapls_chat_call_api( 'strong-model', $message, $context );
    }
    $res = rapls_chat_call_api( 'cheap-model', $message, $context );
    if ( rapls_chat_looks_weak( $res['text'] ?? '' ) ) {
        return rapls_chat_call_api( 'strong-model', $message, $context ); // once only
    }
    return $res;
}

A caveat: when escalation fires, that request runs both the cheap and the strong model, which can double its cost. Limit the retry to one, count both calls against the cap, and keep the escalation condition strict. Take it cheap, raise it only when you must.

Send fewer tokens, in and out

The fixed system prompt is the same every time, so cache it if your provider supports it and only send the changing question. Output tokens often cost more than input, so cap the response and steer it toward being concise. Short and to the point is better for the wallet and for the chat. Keep the per-provider differences (endpoint, auth, the shape of the cache directive) inside rapls_chat_call_api so the upstream code doesn't have to care.

function rapls_chat_call_api( $model, $message, $options = array() ) {
    $provider = rapls_chat_provider_of( $model );
    $system   = rapls_chat_system_prompt(); // fixed persona, same each time

    $body = array(
        'model'      => $model,
        'max_tokens' => $options['max_tokens'] ?? 512,
        'messages'   => array(
            array( 'role' => 'system', 'content' => $system ),
            array( 'role' => 'user',   'content' => $message ),
        ),
    );
    if ( rapls_chat_supports_cache( $provider ) ) {
        $body['messages'][0]['cache_control'] = array( 'type' => 'ephemeral' );
    }
    $res = wp_remote_post( rapls_chat_endpoint( $provider ), array(
        'headers' => rapls_chat_auth_headers( $provider ),
        'body'    => wp_json_encode( $body ),
        'timeout' => 30,
    ) );
    return rapls_chat_parse_response( $provider, $res );
}

The cache directive shape, the endpoint, and the auth all differ by provider, so the example above leans on one vendor's style; real code needs branching and the spec shifts, so check current docs. Using a single endpoint that fronts many providers, like OpenRouter, thins that branching out and pairs well with the free-tier onboarding.

Transparency: turn the invisible faucet into a visible one

Multiply the recorded usage by a price table to get a rough number, and show it. First the estimate, then the monthly accumulation.

const RAPLS_CHAT_PRICE = array(
    'cheap-model'  => array( 'in' => 0.0, 'out' => 0.0 ), // fill from the price table
    'strong-model' => array( 'in' => 0.0, 'out' => 0.0 ),
);

function rapls_chat_estimate_cost( $model, $usage ) {
    $p   = RAPLS_CHAT_PRICE[ $model ] ?? array( 'in' => 0, 'out' => 0 );
    $in  = ( $usage['input_tokens']  ?? 0 ) / 1000000 * $p['in'];
    $out = ( $usage['output_tokens'] ?? 0 ) / 1000000 * $p['out'];
    return $in + $out;
}

function rapls_chat_record_usage( $user_id, $model, $usage ) {
    $cost = rapls_chat_estimate_cost( $model, $usage );
    $key  = 'rapls_chat_usage_' . gmdate( 'Ym' );

    $stats = get_user_meta( $user_id, $key, true );
    if ( ! is_array( $stats ) ) {
        $stats = array( 'calls' => 0, 'in' => 0, 'out' => 0, 'cost' => 0.0 );
    }
    $stats['calls'] += 1;
    $stats['in']    += $usage['input_tokens']  ?? 0;
    $stats['out']   += $usage['output_tokens'] ?? 0;
    $stats['cost']  += $cost;
    update_user_meta( $user_id, $key, $stats );
}

Show that on the user's profile screen next to the model selector, and they can adjust for themselves. The estimate won't match the real bill, so label it as an estimate. Even so, seeing the count and a rough figure cuts the anxiety a lot, because the anxiety was never the amount, it was not knowing.

Mistakes I made

Assuming good work means people will pay. The wall isn't paying, it's not knowing how much.
Defaulting to the top model. From the user's side, that's quietly opening the priciest faucet all the way.
Shipping without caps. Your own wallet stops on instinct; your instinct doesn't reach the user's.
Hoarding the free entrance. If they stall at the door, there's no revenue to protect anyway.
Thinking longer answers are kinder. Long replies cost more, take longer to read, and feel verbose in a chat.

Every one of these came from designing the shipping side with a using-side mindset.

A note to my next self

The token bill always lands in some wallet. When you pay, you hold the reins; when you ship, you take on the twist of the user paying while you design. Decide which wallet you aim at first. Bring-your-own-key means putting the work into the entrance; author-pays means defending caps and pricing. Then the free entrance, choosable models, tiering, caps, and transparency. All of it is a way to remember that past the faucet you don't pay for, there's someone else's wallet.

The visible meter on the user's own screen is still on my list. There's always a wallet on the other side of the faucet. That's the part I don't want to forget.

References

GitHub Copilot is moving to usage-based billing - The GitHub Blog

Originally written in Japanese on Zenn. I build WordPress plugins.

Disclaimer: The experiences and decisions in this post are my own. English isn't my first language, so I use an AI assistant to help draft and edit the writing.

Top comments (2)

Max Quimby • Jun 13

The fork you describe — BYO-key wall vs you-eat-the-bill — is exactly where most of us get stuck, and I think the most underrated option is the hybrid: front a small metered allowance so the first run "just works," then hand the faucet over once they've felt the value. The setup wall isn't really about the card; it's about asking someone to commit before they've seen anything work. A few hundred tokens "on the house" turns that cold ask into a warm one. On the flat-subscription side, the thing that's bitten me is that the runaway risk is rarely the average user — it's the 1% who automate against you. Per-user hard ceilings plus a circuit breaker that degrades to a cheaper model instead of erroring keeps one bad actor from torching the unit economics for everyone. Curious how you're guarding the flat side — fixed monthly cap per user, or something more adaptive?

Rapls • Jun 14

You're right that the wall is the cold ask, not the card. The hybrid is the strongest version of the free entrance I described, and "a few hundred tokens on the house" is a better way to put it than I managed.

On guarding the flat side: honestly, I dodged it. I made BYO-key the default precisely because a fixed monthly cap with no ceiling on outgoing tokens scared me as a solo dev. But if I did run flat, I'd do close to what you said: a per-user hard ceiling, plus your circuit breaker that degrades to a cheaper model instead of erroring. Degrade-not-fail is the part I hadn't framed clearly, and it's better than a hard stop because the user keeps working while the unit economics stay safe.

The 1% who automate against you is the real shape of the risk. My interval cap was aimed there without me saying so. Adaptive is where I'd want to go next: tighten ceilings only for accounts whose usage curve looks automated, and leave everyone else alone.