How We Score Tools: The Rubric Behind Every pickuma Review

#meta #blogging #webdev

Every review on this site ends with a number, and a number with no method behind it is just a vibe wearing a lab coat. So here is the method. This is the rubric we run each tool through before it gets a score, the weights we attach to each part, and the cases where we throw the number out entirely because it would mislead you.

We write this down for two reasons. First, so you can argue with it — if you think we weight pricing too lightly for solo developers, you now have something concrete to push against. Second, so we hold ourselves to it. A rubric you publish is a rubric you can be caught violating.

The five things every score measures

We score every tool across five dimensions. Each one gets a 1-to-10 sub-score, and the headline number you see is a weighted blend of the five. The dimensions are fixed; the weights are not, which we'll get to in the next section.

Capability is the obvious one, but it's also where most marketing pages lie by omission. We don't score the feature list. We score whether the feature survives contact with a messy, real workload — the kind you'd actually throw at it on a Wednesday afternoon.

Time-to-value is the dimension readers underrate most. A tool that scores a 9 on capability but takes two days to configure is, for most people, worse than a 7 that works in ten minutes. We measure this from a cold start: new account, no prior setup, clock running.

Reliability is the dimension we can't fully test in a single sitting, and we say so in the review when that's the case. A tool we've run for three months and a tool we've run for three days do not get the same reliability confidence, even if they behave identically in the demo. When our exposure is short, we cap the reliability sub-score rather than guess high.

Pricing honesty is separate from price. A tool can be expensive and honest, or cheap and dishonest. We penalize the gap between the number on the pricing page and the number on your invoice — seat minimums you discover at checkout, an export locked behind the next tier up, a free plan that throttles the one feature you came for.

Lock-in cost asks a single question: if you wanted to leave in a year, how much would it hurt? Tools that export clean, open formats score well here. Tools that trap your data in a shape only they can read score badly, no matter how good the rest of the experience is.

How we weight them (and why the weights move)

A fixed weighting would be easier to defend and worse for you. The right weight depends on what the tool is for and who's using it.

For an infrastructure tool a team will run in production, reliability and lock-in cost carry the most weight — a flaky database or a proprietary log format is a problem you live with for years. For a quick AI utility a solo developer might use for a single project, time-to-value and pricing honesty matter more, and lock-in barely registers because you're not betting your stack on it.

So the weights shift by category. We publish the weighting we used at the top of each review's scorecard, so a 7.5 in one category and a 7.5 in another aren't pretending to be the same measurement. They're not.

A single headline number is a compression, and compression loses information. Two tools can land on the same 8.0 for opposite reasons — one is brilliant but expensive, the other is cheap but shallow. Always read the five sub-scores, not just the blend. The number on top is a starting point for your decision, not the end of it.

We keep the rubric, the per-category weights, and every tool's sub-scores in a single shared workspace so the scoring stays consistent from one review to the next. If you're building your own evaluation process — for a team tool bake-off, a vendor shortlist, or your own writing — a structured doc that forces every option through the same columns beats a folder of scattered notes.

Where scores fall short

A rubric is a tool, and like every tool it has a range outside of which it produces nonsense. We'd rather tell you where ours breaks than pretend it doesn't.

The first limit is taste. Some tools are technically strong and genuinely unpleasant to use, and "unpleasant" resists a 1-to-10 score. We fold it into capability when it affects real work, but a review's prose will always carry nuance the number can't.

The second limit is timing. Scores are snapshots. A tool we rated a 6 last quarter may ship the exact feature that was dragging it down, and until we re-test, the published number is stale. We date every score and re-review when something material changes — but between those points, trust the date as much as the digit.

The third limit is you. Our weights encode an average reader who doesn't exist. If you're cost-sensitive, mentally raise the pricing weight. If you're building something you'll maintain for five years, raise reliability and lock-in. The sub-scores are there precisely so you can re-blend them for your own situation instead of inheriting ours.

A fast way to use any pickuma review: ignore the headline number on the first read. Go straight to the five sub-scores, find the one that matters most for your use case, and start there. The blended score is for skimming; the sub-scores are for deciding.

The goal was never to hand you a single digit and call it objectivity. It's to make our judgment legible — to show the inputs, the weights, and the seams — so you can take what's useful and override the rest.

Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.