Why we calibrate the indicator Total, not the raw scores

#hrtech #opensource #fastapi #nextjs

A 360 review collects answers from three to ten people about one person. Then the reviewer sits down to "calibrate" the result. The industry default is one of two paths: rewrite individual respondent scores, or override the per-competence percent at the very end. We tried both. Neither aged well. This week we shipped a third option, and we're not done thinking about it.

This post is half a release note, half an open question.

The raw-score calibration trap

The natural place to put calibration is on the raw answer. Respondent A gave 3, B gave 5. The reviewer thinks the real answer is 4, so they edit one of those numbers and recompute.

Two problems with that.

First, you've now lied about respondent A. Anyone who reopens the survey sees a number A never gave. The anonymity guarantee — the thing you spent ages explaining to the org — becomes "anonymous except when the reviewer disagrees."

Second, the indicator average is the wrong place to override anyway. A competence has five to ten indicators across three skill levels. The reviewer's gut isn't "the average of these answers should be 4.2." It's "this indicator at this level: the person clears it." Different math.

The other common shape is to leave raw answers alone and override the per-competence percent at the bottom. Cleaner. But you've now hidden which indicators the reviewer disagreed with. The output is correct, the audit trail is gone.

What we built

Reviewers pin a Total per indicator. Not on the raw answers. Not on the competence percent. On the indicator itself.

python
class CalibratedIndicatorTotal(Base):
    __tablename__ = "assessment_calibrated_totals"
    assessment_id: Mapped[UUID] = mapped_column(ForeignKey("assessments.id"))
    indicator_id: Mapped[UUID] = mapped_column(ForeignKey("competence_indicators.id"))
    skill_level_id: Mapped[UUID] = mapped_column(ForeignKey("skill_levels.id"))
    total: Mapped[int]  # 0..100
    calibrated_by: Mapped[UUID] = mapped_column(ForeignKey("users.id"))
    calibrated_at: Mapped[datetime]

Raw answers stay intact and stay visible in the same UI. The per-competence percent inherits the override — anywhere a calibrated Total exists, a calibrated chip surfaces in the breakdown so the audit trail is one click away.

Two operational rules around it:

Submissions lock while calibration is in progress. Late respondents can't trickle answers into a result the reviewer is already pinning. The assessment status flips to a calibrating sub-state and the take-assessment route returns 409.
Cancel calibration wipes every Total and restores raw averages. Not a soft delete — clean wipe. Cancel and you're back to the view a respondent saw.

The math: if a competence has indicators I1..I4 at skill level Intermediate and the reviewer pins Totals for I1 and I3 only, the competence percent at Intermediate is mean(pinned(I1), raw_mean(I2), pinned(I3), raw_mean(I4)). We don't silently propagate one pinned indicator to a whole level.

What we don't know yet

Three places where we made a call and aren't sure it survives contact with real teams.

Should the lock be hard or soft? Right now we hard-lock the whole assessment during calibration. The argument for soft is operations — reviewers don't need to chase the last respondent before they can start. The argument for hard is data integrity — calibration math is unstable when its inputs are still moving. We picked hard. Open question whether that's right for orgs with long review windows.

Should calibration carry a written rationale? The data model has room. The UI doesn't surface a comment field yet, because the moment you add a "why did you override this" field, reviewers either skip it or write "see meeting notes." Neither helps the person being reviewed. So: what would actually be useful here — a free-text rationale that nobody fills in, or a structured tag (e.g. "respondent bias", "context the survey missed") that constrains the input?

Re-entry semantics after Cancel. If a respondent submits after calibration was cancelled, their answer re-enters the raw average automatically. The alternative was an explicit "include late submissions" toggle. We picked auto-include because Cancel already feels like a reset button. But that's a guess.

DEV Community

Why we calibrate the indicator Total, not the raw scores

The raw-score calibration trap

What we built

What we don't know yet

Top comments (0)