Jijo Bose

Posted on May 26 • Originally published at jijobose.com

Building an Error Monitoring Tool Without Pricing Overages

#postgres #rails #saas #monitoring

The bill that arrives after the incident

Picture the worst version of a Tuesday. You ship a deploy, a downstream API starts timing out, and your retry logic turns one failure into forty. A single broken code path is now throwing the same exception in a hot loop. By the time you have rolled back, your app has emitted two million error events in ninety minutes.

Your error monitoring tool ingested every single one of them. It was very good at its job.

Then, a few days later, the second incident arrives: an invoice. The $26/month plan you signed up for has quietly become a $390 bill, because you blew through your included event volume and the meter kept running at some fraction of a cent per event. Nobody asked you. Nobody could ask you, because the events arrived faster than any human could approve them.

This is the part of usage-based monitoring that I find genuinely backwards. The pricing is anti-correlated with your wellbeing. The tool charges you the most at the exact moment you are already having your worst day. A traffic spike, a bad release, a noisy dependency, a retry storm: every one of these is both an operational emergency and a billing event. The product that is supposed to help you through the incident is, at the same time, metering you for the privilege.

I am building an error tracker called ErrSight, and early on I decided it would not work this way. No overage charges. Not "low overage charges," not "overage charges with a generous buffer." None. The ceiling on your plan is a real ceiling, not the starting line for a surprise invoice.

That turns out to be a more interesting engineering problem than it sounds, so let me walk through how it actually works.

Overage billing is a choice, not a law of physics

Before the architecture, it is worth being honest about why overage pricing is so common. It is not because it is the only way. It is because it is the easiest and most profitable way.

It is the easiest because the implementation is trivial: count what arrives, multiply by a rate, send the total at the end of the month. You never have to make a decision in the hot path. You never have to tell a customer "no." You just let everything in and reconcile later.

It is the most profitable because the meter runs before the bill arrives. By the time the customer sees the number, the spend already happened. They cannot decline it. The asymmetry is the entire business model.

Once you see it that way, "no overages" stops being a pricing gimmick and becomes a design constraint. It means moving the decision to the front, into the request path, while the customer can still be protected. Concretely, I wrote down three rules:

When an account is out of quota, stop ingesting and say so clearly. Do not silently accept the data and invoice for it.
Make it architecturally impossible to overshoot the cap, even under a burst of concurrent requests, because bursts are exactly when this matters.
If a customer genuinely needs more capacity, make getting it a deliberate, opt-in decision with a known price, not a default that happens to them.

Everything below is in service of those three rules.

Mechanic 1: stop, do not bill

The ingestion endpoint is the front door. Before it does any real work, it asks one question: is this project allowed to ingest right now? The answer comes from a single method that collapses every "no" reason into one place.

def drop_reason
  return "ingestion_paused"        if ingestion_paused?
  return "events_over_limit"       if organization.over_events_limit?
  return "storage_limit_exceeded"  if storage_limit_exceeded?
  nil
end

If there is a reason to drop, the controller returns an HTTP 429 Too Many Requests with a machine-readable code, and that is the end of it.

when "events_over_limit"
  notify_once(@project.organization_id, "events")   # one email, debounced
  render json: {
    error: "Monthly event limit reached",
    code:  "EVENTS_LIMIT_EXCEEDED"
  }, status: :too_many_requests

Notice what is not here. There is no branch that says "over limit, so accept the event and tack it onto the overage counter." Going over your limit is a 429, not a bigger invoice. The client SDK receives a clear, documented code (EVENTS_LIMIT_EXCEEDED) and can back off, buffer, or surface a warning in your own dashboards. The signal is honest: your data is being dropped, here is exactly why, and your bill is not moving.

The customer also gets one email when they hit the wall. Exactly one. A client hammering the endpoint while over quota could otherwise enqueue thousands of identical "you are over your limit" notifications per minute, so the notification is debounced with an atomic cache write that only the first caller in a one-hour window wins:

def notify_once(org_id, kind)
  key = "quota_notified:#{org_id}:#{kind}"
  if Rails.cache.write(key, true, unless_exist: true, expires_in: 1.hour)
    NotifyQuotaOverageJob.perform_later(org_id, kind)
  end
end

The customer gets told. The customer does not get charged.

Mechanic 2: you literally cannot overshoot the cap

Here is the part that took the most care, and it is the reason "no overages" is harder to build than overage billing.

The naive version of a quota check has a race condition that bursts will find immediately:

# WRONG: two concurrent requests can both pass this check
if organization.total_events_this_month + count <= organization.events_limit
  accept(events)
end

Imagine a project sitting at 49,950 events against a 50,000 limit, so there are 50 events of real headroom left. Two batches of 40 events arrive at the same millisecond, handled by two different Puma workers, possibly on two different replicas. Each batch on its own fits comfortably. But both workers read the same starting count of 49,950, both compute 49,950 + 40 = 49,990, both see that as under the limit, and both commit. The project lands at 50,030. The two batches were 80 events against 50 events of headroom, and the cap leaked by 30. Multiply that by a real burst across many workers and your "hard" limit leaks by thousands of events. Each leaked event is either free (you eat the cost) or billed (the customer eats it). There is no version of the leak that is fair.

A guarantee has to actually be a guarantee, so the reservation happens inside a transaction, serialized by a Postgres advisory lock keyed to the organization and billing period:

def reserve_events!(count:)
  org      = organization
  month    = org.quota_period_start
  lock_key = Zlib.crc32("errsight:quota:#{org.id}:#{month}") % 2**31

  transaction do
    # Every project in this org/period serializes through this lock,
    # so two concurrent bursts cannot both read "under limit" and
    # both commit. The lock releases automatically at transaction end.
    connection.execute("SELECT pg_advisory_xact_lock(#{lock_key})")

    current = Usage.where(organization_id: org.id, month: month).sum(:events_count)
    return false if current + count > org.events_limit   # the ceiling holds

    # reserve `count` against this month's usage, then return true
    bump_usage!(count)
    true
  end
end

pg_advisory_xact_lock gives me a mutex that lives in the database, not in any single Ruby process, which is the only place it can live if the limit is going to hold across many workers and replicas. Two bursts hitting the same account at the same instant now line up behind the lock. The first one reserves its quota and commits. The second one reads the post-commit total, sees there is no room, and gets false. The controller turns that false into a 429. The cap is exact, even at the millisecond boundary, even during the spike that an overage model would have cashed in on.

This is the trade at the heart of "no overages." Overage billing never needs this lock, because it never needs to say no. Choosing to say no means choosing to build the machinery that can say it correctly under load.

Mechanic 3: more capacity is a decision, not an accident

A hard cap with no escape hatch is just a worse product. The point is not to punish growth, it is to make growth a choice the customer makes on purpose, with the price known in advance.

So the limit a project is actually checked against is never just the plan limit. It is the plan limit plus any capacity the customer has deliberately added:

def events_limit
  plan_record.events_limit + active_pack_event_credit
end

def active_pack_event_credit(at: Time.current)
  purchased_packs.where(status: "active")
                 .where("expires_at > ?", at)
                 .sum(:events_credit)
end

There are two ways to add capacity, and both are opt-in:

Upgrade the plan. The tiers are flat monthly prices with included volume: Free is 5,000 events a month, Pro is $29 for 50,000, Growth is $79 for 200,000, Business is $199 for 750,000. You always know what the next step costs before you take it.
Buy an add-on pack. If you are mostly fine but had one heavy month, a $9 pack adds 50,000 events and 2 GB of storage on a 30-day rolling window. It is a one-time purchase, not a recurring commitment, and it stacks if you need a few.

The crucial difference from an overage line item is when the decision happens. An overage charge is a decision the system makes for you, after the spend, that you discover on an invoice. A pack or an upgrade is a decision you make for yourself, before the spend, at a price you agreed to. Same outcome of "you needed more and you paid for more," opposite relationship with the customer.

A second dial: capping the burn rate, not just the total

A hard monthly ceiling solves the billing problem, but go back to the retry storm from the top of this post: two million events in ninety minutes. Even with overage charges off the table, a spike like that can burn through an entire month of quota before lunch, and then ingestion is capped for the rest of the month and you are flying blind through the part of the incident that matters most. A ceiling on the total is not the same as a ceiling on the rate.

So every project also has a per-minute rate limit, and on paid plans the customer sets it themselves. The plan defines the maximum you are allowed to choose, and you pick any number underneath it. It is enforced by a fixed-window limiter that lives in Postgres rather than in process memory, because a per-worker counter cannot hold a real limit once you are running several Puma workers across replicas:

rate = IngestionRateLimiter.check!(@project, count: events_data.length)
unless rate.allowed
  response.headers["Retry-After"]       = rate.retry_after.to_s
  response.headers["X-RateLimit-Limit"] = rate.limit.to_s
  return render json: {
    error:       "Rate limit exceeded, retry in #{rate.retry_after}s",
    code:        "RATE_LIMIT_EXCEEDED",
    retry_after: rate.retry_after
  }, status: :too_many_requests
end

Now a runaway loop can spend at most the configured number of events per minute. The bad deploy still hurts, but it cannot vaporize your whole month in the first ninety minutes, and the Retry-After header tells a well-behaved SDK exactly how long to back off. The customer ends up inside two ceilings at once: the monthly total they are billed against, and the per-minute rate they chose. As a side effect, it also shields my ingestion path from a single misbehaving client, which is the first thing standing between a customer's spike and my own infrastructure bill. Which brings me to the cost side.

The economics that let me say yes to this

There is a reason a lot of founders would call "no overages" financially reckless, and they would be right if you ignore the cost side. If your own costs scale linearly and without bound, then capping the customer's bill while your infrastructure bill runs free is a great way to go broke on your most successful day. "No overages" only works if you have first made your costs predictable.

For an error tracker, the dominant cost driver is storage. Error events are write-heavy, append-mostly, time-ordered, and they pile up fast. So the events table is a TimescaleDB hypertable partitioned on time, with columnar compression that kicks in automatically after a week:

SELECT create_hypertable('events', 'occurred_at', migrate_data => true);

ALTER TABLE events SET (
  timescaledb.compress,
  timescaledb.compress_segmentby = 'project_id',
  timescaledb.compress_orderby   = 'occurred_at DESC, id'
);

-- Compress any chunk older than 7 days.
SELECT add_compression_policy('events', INTERVAL '7 days');

Segmenting by project_id and ordering by time means the recent, hot data stays fast to query for the dashboard, while everything older than a week gets squeezed into compressed columnar chunks. Error events compress extremely well, because they are full of repeating values: the same fingerprints, the same stack frames, the same environment strings, over and over. That repetition is exactly what columnar compression eats for breakfast.

The second lever is retention. Every plan has a retention window (7 days on Free, up to 90 on the higher tiers), and a background job prunes anything past it and re-derives usage so the numbers stay honest:

cutoff = org.retention_days.days.ago
count, bytes = EventRepository.prune_older_than!(project_id: id, cutoff: cutoff)

Compression bounds the cost of the data you keep. Retention bounds how much data you keep at all. Together they turn storage from an unbounded liability into a known, modeled number per plan. Once I can predict my cost per account, I can confidently promise a fixed price to the account.

I extended the same logic to hosting. The app runs on a platform with a hard spending cap and per-second billing, which suits a workload that is quiet most of the time and spiky during incidents. I am not going to ask customers to live with a predictable bill while I refuse to give myself one. The predictability has to go all the way down, or the promise on the pricing page is just optimism.

What this costs me, honestly

I want to be straight about the trade-offs, because "no overages" is not free for the person offering it.

I leave money on the table. Every overage charge I do not send is revenue I did not collect. The spiky months that would have been the most lucrative under metered billing are exactly the months I am choosing to cap. That is real money, and pretending otherwise would be dishonest.

A customer who hits the wall is a worse short-term outcome for them than being billed silently. Dropped events during an incident is a genuinely bad moment. I mitigate it with clear 429 codes, an immediate email, and one-click add-on packs, but the honest version is that a hard cap can bite. The bet is that being told "you are out of room, here is the button" is more respectful than being billed for data you never agreed to pay for, and that developers, of all customers, would rather have the explicit signal.

I had to build the hard version. The advisory lock, the atomic reservation, the debounced notifications, the usage reconciliation after pruning: none of that exists in a system that just counts and multiplies at month end. Saying "no" correctly is more code than never saying it at all.

I think it is worth every bit of that, because of one rule of thumb I keep coming back to:

Your billing model should never be anti-correlated with your customer's worst day.

If the only way your pricing makes its best money is by charging customers more during their outages, their spikes, and their emergencies, then your incentives are quietly pointed away from theirs. I would rather have a model where my best day and my customer's calm month are the same thing, and where their disaster does not show up as a line item on my invoice to them.

Wrapping up

"No overages" sounded like a marketing decision when I started. It turned into an architecture: a hard quota ceiling enforced by a database-level lock so it cannot leak under load, an honest 429 instead of a silent meter, opt-in capacity for the people who genuinely need it, and TimescaleDB compression plus retention to keep my own costs bounded enough that I can afford the promise.

If you have built quota or billing systems that try to stay on the customer's side, I would love to hear how you handled the boundary cases. The lock-and-reserve pattern is the cleanest answer I found, but I doubt it is the only one.

If you would rather see the result than the plumbing, the tool is live at errsight.com.

DEV Community