The worst thirty seconds of my week used to happen in review. Someone senior would put a finger on one row of my output, a row scored 60 out of 100, and ask why it was a 60 and not a 40. And I did not know. Not on the spot.
What I did instead was open the scoring code, re-read the branches, do the arithmetic in my head against that input row, and reconstruct the answer live while the room waited. Half the time I got it right. The other half I said "let me check and get back to you," which in front of people who make decisions on that number is the same as admitting the number is a guess.
It took me a couple of those meetings to see the real problem. A score you cannot defend the moment you are asked is not a feature. It is a liability. If the person consuming your output has to trust you instead of the number, you did not ship a scoring system. You shipped yourself as a dependency, and you will be that dependency at 6pm on a Friday.
Two code paths, one quiet lie
The failure is almost always the same shape. The scoring function adds points. Somewhere else, usually a different file, a second function assembles the human-readable reason. Both encode the same thresholds. They agree on the day you write them, the tests pass, everyone moves on.
Then a threshold changes. It changes for a good reason, in the scoring function, and it ships. Nobody touches the reason builder, because why would they. It is in another file, the tests are still green, the diff looked complete. Now the number moves and the story stays put. The reason string is not merely stale. It is confidently, specifically wrong.
Take one concrete row and hold it fixed for the rest of this post:
lead = {"monthly_spend": 6000, "seats": 30, "renewed": True}
Here is the code that scores it, split across the two paths:
# BEFORE: the number and its justification live in two functions
# that happen to share a set of magic thresholds.
def score_lead(lead):
score = 0
if lead["monthly_spend"] >= 8000: # bumped from 5000 last quarter
score += 40
if lead["seats"] >= 25:
score += 35
if lead["renewed"]:
score += 25
return min(score, 100)
def explain(lead):
reasons = []
if lead["monthly_spend"] >= 5000: # nobody bumped this one
reasons.append("high spend")
if lead["seats"] >= 25:
reasons.append("large team")
if lead["renewed"]:
reasons.append("renewed before")
return ", ".join(reasons)
Run our row through both:
>>> score_lead(lead)
60
>>> explain(lead)
'high spend, large team, renewed before'
The spend threshold moved to 8000, so 6000 no longer earns the 40 points and the lead lands at 60. But explain still tests against the old 5000 line, so it leads with "high spend" as the top reason. The number went down because spend stopped counting, and the explanation says it is high because of spend. You are now debugging your own explanation against your own arithmetic, out loud, in front of the person who asked.
Build the reason where you add the points
The fix is boring and it is the whole point. Build the reason on the same line where you add the points. Same if, same condition, same line of sight. The reason is not documentation written after the fact. It is a byproduct of the decision, emitted at the instant the decision is made.
# AFTER: one branch owns both the points and the words for them.
# You cannot change the number without seeing the sentence.
def score_lead(lead):
score = 0
parts = []
if lead["monthly_spend"] >= 8000:
score += 40
parts.append(f"+40 monthly spend {lead['monthly_spend']} >= 8000")
if lead["seats"] >= 25:
score += 35
parts.append(f"+35 team of {lead['seats']} seats >= 25")
if lead["renewed"]:
score += 25
parts.append("+25 renewed at least once")
final = min(score, 100) # weights top out at 100, so the cap never bites
reason = f"{final}/100: " + ("; ".join(parts) if parts else "no signals fired")
return final, reason
Same row, one call now:
>>> score_lead(lead)
(60, '60/100: +35 team of 30 seats >= 25; +25 renewed at least once')
It is a 60 and not a 40 because team size and a prior renewal fired and spend did not, and the numbers add up in front of you. There is no second file to forget, because there is no second file. The if that grants 40 points is the only place that can claim the 40.
One caveat about that cap. Here 40 plus 35 plus 25 lands exactly on 100, so min never changes the total. If your raw points can overshoot the ceiling, either design the weights to land on the cap, or name the cap in the reason and reconcile against the pre-cap total. The mismatch has to stay visible, never quietly absorbed.
A test that catches the drift before the meeting does
"Just build it in the same place" is a discipline, and disciplines rot the first sprint you are in a hurry. So I pin it down as a property: the point values named in the reason must sum to the score. If they ever stop summing, the audit trail has started lying, and I want CI to fail instead of a stakeholder to notice.
import re
def test_reason_points_sum_to_score():
for lead in SAMPLE_LEADS:
score, reason = score_lead(lead)
named = [int(p) for p in re.findall(r"\+(\d+)", reason)]
assert sum(named) == score, reason
The test parses its own audit trail and checks the arithmetic against the score the function returned. Add a signal that bumps the score but forget to append its line, and the sum comes up short. Append a line but fat-finger the number, and it comes up wrong. Either way it goes red. Notice what the test does not do: it says nothing about whether the reason is well written. It checks that the reason is not lying. Those are different jobs, and in a review only the second one matters.
Sometimes the defensible score is zero
The pattern pushed me somewhere I did not expect. In crosswatch, the small harness I pulled this out of, I corroborate every number across two independent providers before scoring anything. Agree, and the region is CONFIRMED. Drift a little, REVIEW. Contradict each other hard, and the region is EXCLUDED and scores zero on purpose.
Not the average of the two numbers. Zero. When two independent sources disagree that badly, the honest answer to "what is the value here" is "we do not know," and the average of two contradictions looks exactly as confident as a good number while meaning nothing. A zero I can stand behind beats a 55 I cannot, and the EXCLUDED row carries the reason for it: the providers disagreed by this many points, so no score was more honest than one built on a contradiction.
The same welding buys something operational. Because every reason ships attached to its number, and the raw readings are stored with provenance (which provider, which run, which timestamp), I can rescore months of stored data under new thresholds without collecting anything again. Collection is expensive, judgment is free. Change a weight, rescore in milliseconds, and every reason moves in lockstep with every number, reconciliation test still standing guard.
So the next time someone points at a row and asks why it is a 60 and not a 40, the answer is already sitting in the cell next to the 60, and a test has already confirmed it adds up. Everything after that is just deciding where your if statements go.
crosswatch is on GitHub, MIT licensed, stdlib only, 63 tests: github.com/vinimabreu/crosswatch
Top comments (0)