Vinicius Pereira

Posted on Jul 5

Why is this a 60 and not a 40?

#python #testing #reliability #softwareengineering

The worst thirty seconds of my week used to happen in review. Someone senior would put a finger on one row of my output, a row scored 60 out of 100, and ask why it was a 60 and not a 40. And I did not know. Not on the spot.

What I did instead was open the scoring code, re-read the branches, do the arithmetic in my head against that input row, and reconstruct the answer live while the room waited. Half the time I got it right. The other half I said "let me check and get back to you," which in front of people who make decisions on that number is the same as admitting the number is a guess.

It took me a couple of those meetings to see the real problem. A score you cannot defend the moment you are asked is not a feature. It is a liability. If the person consuming your output has to trust you instead of the number, you did not ship a scoring system. You shipped yourself as a dependency, and you will be that dependency at 6pm on a Friday.

Two code paths, one quiet lie

The failure is almost always the same shape. The scoring function adds points. Somewhere else, usually a different file, a second function assembles the human-readable reason. Both encode the same thresholds. They agree on the day you write them, the tests pass, everyone moves on.

Then a threshold changes. It changes for a good reason, in the scoring function, and it ships. Nobody touches the reason builder, because why would they. It is in another file, the tests are still green, the diff looked complete. Now the number moves and the story stays put. The reason string is not merely stale. It is confidently, specifically wrong.

Take one concrete row and hold it fixed for the rest of this post:

lead = {"monthly_spend": 6000, "seats": 30, "renewed": True}

Here is the code that scores it, split across the two paths:

# BEFORE: the number and its justification live in two functions
# that happen to share a set of magic thresholds.

def score_lead(lead):
    score = 0
    if lead["monthly_spend"] >= 8000:   # bumped from 5000 last quarter
        score += 40
    if lead["seats"] >= 25:
        score += 35
    if lead["renewed"]:
        score += 25
    return min(score, 100)

def explain(lead):
    reasons = []
    if lead["monthly_spend"] >= 5000:   # nobody bumped this one
        reasons.append("high spend")
    if lead["seats"] >= 25:
        reasons.append("large team")
    if lead["renewed"]:
        reasons.append("renewed before")
    return ", ".join(reasons)

Run our row through both:

>>> score_lead(lead)
60
>>> explain(lead)
'high spend, large team, renewed before'

The spend threshold moved to 8000, so 6000 no longer earns the 40 points and the lead lands at 60. But explain still tests against the old 5000 line, so it leads with "high spend" as the top reason. The number went down because spend stopped counting, and the explanation says it is high because of spend. You are now debugging your own explanation against your own arithmetic, out loud, in front of the person who asked.

Build the reason where you add the points

The fix is boring and it is the whole point. Build the reason on the same line where you add the points. Same if, same condition, same line of sight. The reason is not documentation written after the fact. It is a byproduct of the decision, emitted at the instant the decision is made.

# AFTER: one branch owns both the points and the words for them.
# You cannot change the number without seeing the sentence.

def score_lead(lead):
    score = 0
    parts = []
    if lead["monthly_spend"] >= 8000:
        score += 40
        parts.append(f"+40 monthly spend {lead['monthly_spend']} >= 8000")
    if lead["seats"] >= 25:
        score += 35
        parts.append(f"+35 team of {lead['seats']} seats >= 25")
    if lead["renewed"]:
        score += 25
        parts.append("+25 renewed at least once")
    final = min(score, 100)   # weights top out at 100, so the cap never bites
    reason = f"{final}/100: " + ("; ".join(parts) if parts else "no signals fired")
    return final, reason

Same row, one call now:

>>> score_lead(lead)
(60, '60/100: +35 team of 30 seats >= 25; +25 renewed at least once')

It is a 60 and not a 40 because team size and a prior renewal fired and spend did not, and the numbers add up in front of you. There is no second file to forget, because there is no second file. The if that grants 40 points is the only place that can claim the 40.

One caveat about that cap. Here 40 plus 35 plus 25 lands exactly on 100, so min never changes the total. If your raw points can overshoot the ceiling, either design the weights to land on the cap, or name the cap in the reason and reconcile against the pre-cap total. The mismatch has to stay visible, never quietly absorbed.

A test that catches the drift before the meeting does

"Just build it in the same place" is a discipline, and disciplines rot the first sprint you are in a hurry. So I pin it down as a property: the point values named in the reason must sum to the score. If they ever stop summing, the audit trail has started lying, and I want CI to fail instead of a stakeholder to notice.

import re

def test_reason_points_sum_to_score():
    for lead in SAMPLE_LEADS:
        score, reason = score_lead(lead)
        named = [int(p) for p in re.findall(r"\+(\d+)", reason)]
        assert sum(named) == score, reason

The test parses its own audit trail and checks the arithmetic against the score the function returned. Add a signal that bumps the score but forget to append its line, and the sum comes up short. Append a line but fat-finger the number, and it comes up wrong. Either way it goes red. Notice what the test does not do: it says nothing about whether the reason is well written. It checks that the reason is not lying. Those are different jobs, and in a review only the second one matters.

Sometimes the defensible score is zero

The pattern pushed me somewhere I did not expect. In crosswatch, the small harness I pulled this out of, I corroborate every number across two independent providers before scoring anything. Agree, and the region is CONFIRMED. Drift a little, REVIEW. Contradict each other hard, and the region is EXCLUDED and scores zero on purpose.

Not the average of the two numbers. Zero. When two independent sources disagree that badly, the honest answer to "what is the value here" is "we do not know," and the average of two contradictions looks exactly as confident as a good number while meaning nothing. A zero I can stand behind beats a 55 I cannot, and the EXCLUDED row carries the reason for it: the providers disagreed by this many points, so no score was more honest than one built on a contradiction.

The same welding buys something operational. Because every reason ships attached to its number, and the raw readings are stored with provenance (which provider, which run, which timestamp), I can rescore months of stored data under new thresholds without collecting anything again. Collection is expensive, judgment is free. Change a weight, rescore in milliseconds, and every reason moves in lockstep with every number, reconciliation test still standing guard.

So the next time someone points at a row and asks why it is a 60 and not a 40, the answer is already sitting in the cell next to the 60, and a test has already confirmed it adds up. Everything after that is just deciding where your if statements go.

crosswatch is on GitHub, MIT licensed, stdlib only, 63 tests: github.com/vinimabreu/crosswatch

Top comments (6)

Bobai Kato • Jul 6

Sharp write-up. The best takeaway for me is this: explanations should be generated, not narrated; the moment your reason text can drift from scoring logic, trust starts decaying.

Also loved the “defensible zero beats a fake-precise average” point; that’s the kind of decision that prevents a lot of downstream bad calls.

Vinicius Pereira • Jul 6

"generated, not narrated" is a cleaner name for it than the one i used, i'm stealing that. and the decay you point at is the nasty part: a narrated reason never announces when it starts lying, it stays confidently specific while going wrong, so the failure looks identical to the healthy case right up until someone checks the arithmetic. generation doesn't make the drift easier to catch, it makes it impossible, which is the only version you can trust at 6pm on a friday.

funny thing is this is the same law you're building into OTA's receipts one level up. a governance receipt assembled after the fact from a second read of state is a narrated reason wearing json, it can drift from what actually fired the same way my explain() drifted from score(). the fix rhymes: emit the receipt on the line where the decision happens, not from a later pass. we came at it from different altitudes, scoring vs execution governance, and landed on the same rule, the explanation has to be a byproduct of the decision or it's just a story about it. appreciate you reading it this closely.

Bobai Kato • Jul 6

And that is probably the deeper product lesson too: trust does not come from richer explanation text, it comes from removing the possibility that explanation and decision can diverge.

That is why I think this matters beyond scoring or receipts. The same rule should apply to CI verdicts, policy checks, agent safety boundaries, even “why was this PR blocked?” surfaces. If the explanation is a second pass, it is already on the path to drift.

Appreciate the way you framed it. “A narrated reason wearing JSON” is going to stick with me.

Vinicius Pereira • Jul 6

trust as removing the possibility of divergence is the product thesis, yeah, better than how i said it. one prerequisite worth naming before anyone applies this to CI verdicts or PR-blocked surfaces: the rule only works if the decision has a single site to generate from. the failure that precedes every narrated reason is a decision smeared across three modules where no line actually owns the verdict, and then there is nothing to weld the explanation to. so half the work is refactoring until the decision exists in one place, and once it does the receipt almost writes itself. the reason is downstream of the architecture, which is why bolting explanations onto a system that never localized its decisions produces exactly the confident drift we both keep hitting.

Bobai Kato • Jul 7

Exactly.

A generated explanation only works if the decision itself has one owner.
If the verdict is smeared across multiple modules, the explanation is already a reconstruction and drift is almost inevitable.

So the sequence is:

localize the decision
emit the record from that decision site
then trust the result

Vinicius Pereira • Jul 8

exactly right, and i'd only add a fourth step because 'then trust the result' quietly assumes the weld holds as the code moves. it doesn't on its own. once the decision lives in one place the receipt writes itself, but the next person edits the branch and forgets the string, and now the reason is a confident lie instead of confident drift, which is worse. so the weld has to be pinned by a test that fails when the granted points and the stated reason stop reconciling. the discipline of keeping them on the same line rots, the assertion doesn't.

one refinement on localize too: it's the verdict you localize, not the inputs. you can have many readers feeding the decision, the quorum case literally runs two independent ones, but there's exactly one site where they collapse into a verdict, and that collapse point is the thing that has to own the reason and the abstention both. plural inputs, singular decision site. otherwise 'localize' reads as centralize everything, and that's not it.