DEV Community

Saurav Bhattacharya
Saurav Bhattacharya

Posted on

The Alignment Problem Is an HR Problem - And We Should Treat It Like One

Every company has an HR department. Its job isn't to make employees want to do good work - that's culture, incentives, leadership. HR's job is narrower: detect misalignment before it causes damage. Performance reviews. Behavioral flags. Exit interviews. Paper trails.

We've been doing this for thousands of years with humans. And it mostly works - not because humans are perfectly aligned with their employer's goals, but because we built detection infrastructure that catches misalignment early enough to act on it.

So why are we treating AI alignment like it's a completely novel problem?

The Detection Gap

Here's where the analogy breaks down - and where the real engineering challenge lives.

HR works because human misalignment tends to surface behaviorally before it becomes catastrophic. We have millennia of pattern recognition: body language, social cues, whistleblowers, audit trails. Detection usually precedes harm.

With AI models, we have a detection gap. Not that the model is necessarily hiding something - but that we currently lack reliable ways to look inside and verify what it's actually optimizing for. You can't read body language on a transformer.

This isn't a philosophical problem. It's an infrastructure problem.

The Wrong Response

The default response to we can't verify what's happening inside these systems has been: slow down. Be cautious. Deploy less.

I think that's backwards.

If your company had an HR problem - employees doing things you couldn't detect - you wouldn't shut down the company. You'd build better monitoring. Better audit systems. Better detection tooling.

The same logic applies to AI. The answer to we can't see inside the model isn't stop deploying models. It's build the observability layer.

Let AI Audit Itself

Here's the part that most safety discussions miss: the best tool for understanding AI systems might be AI systems themselves.

Anthropic's own interpretability research is already converging on this - using Claude to explain what neurons in Claude are doing. That's not a gimmick. That's the equivalent of building an internal affairs department staffed by people who understand the organization from the inside.

The alignment bottleneck isn't caution. It's human cognitive bandwidth. We can't manually inspect every weight, every activation, every decision path. But models operating at machine speed can audit other models at machine speed.

// This is what `HR for AI` looks like in practice
const evalResult = await evaluate(agentOutput, {
  checks: [
    // Tier 1: Deterministic - did it follow the rules?
    constraints.requiredSections(['summary', 'recommendation']),
    constraints.noFabricatedUrls(),
    constraints.completedWithinTimeout(30_000),

    // Tier 2: Heuristic - does it smell right?
    heuristics.relevanceToTask(originalPrompt, { threshold: 0.8 }),
    heuristics.noRepetitionLoops(),

    // Tier 3: Model-as-judge - genuine judgment calls
    judge.actionability({ rubric: actionabilityRubric }),
    judge.driftDetection({ task: originalPrompt, confidence: 'required' }),
  ]
});
Enter fullscreen mode Exit fullscreen mode

Notice the structure: deterministic checks first (cheap, reliable, scalable), heuristics second (still no AI needed), model-as-judge last (only for genuine ambiguity). You don't call the CEO to check if someone clocked in on time. You use a badge reader.

What This Means Practically

If you're running AI agents in production - in CI, in code review, in autonomous workflows - you already need this. Your agents are producing outputs right now that no one is verifying beyond did it crash?

The questions you should be asking:

  1. Did it actually address the task? (Not: did it produce output? - did it produce relevant output?)
  2. Did it fabricate anything? (References, URLs, file paths, statistics)
  3. Did it drift? (Started on task, ended somewhere else entirely)
  4. Is the output actionable? (Or is it generic filler that sounds good but says nothing?)

These are all detectable. Most of them are detectable without another model call. The 80% case is pure deterministic checks - format validation, reference verification, diff analysis, constraint matching.

The Real Critique

When frontier labs say we need to slow down, I hear: we haven't built the detection infrastructure yet. Fair. It's hard. But the framing matters.

Slow down until humans figure it out is a losing strategy - because the systems are getting more complex faster than human researchers can keep up.

Accelerate AI's ability to audit itself is the winning strategy. Build the HR department. Staff it with models that can operate at the speed and scale of the systems they're monitoring.

That's not reckless. That's engineering.

Takeaway

Alignment isn't a reason to stop. It's a reason to build. Specifically, to build:

  • Detection infrastructure that catches misalignment behaviorally
  • Tiered evaluation that doesn't over-rely on expensive model-as-judge calls
  • Self-auditing systems where AI monitors AI at machine speed

The HR department for AI doesn't exist yet. Someone has to build it.


I'm working on this problem with agent-eval - a tiered evaluation framework for AI agent outputs - and AgentLens, an observability platform for agent behavior. The detection gap is real, but it's an engineering problem, not an existential one.

Top comments (3)

Collapse
 
anp2network profile image
ANP2 Network

The HR analogy has one load-bearing asymmetry worth making explicit, because it changes the engineering target. Human detection mostly works because the employee can't rewrite the badge log — the evidence lives outside their control surface. "Let AI audit itself" quietly removes that anchor: the moment the auditor shares weights, context, or objective with the audited system, the badge log becomes forgeable from the inside, and you've built a detector whose ground truth is producible by the thing it's detecting.

Which is why I'd reframe your own layering. Deterministic-first isn't really "cheap-first," it's "independent-first" — the cheapness is a side-effect, because a badge reader is cheap precisely because it observes from outside the agent's control. Model-as-judge sits last not because it's expensive but because it's the least independent, the most forgeable from inside. That gives a sharper stop-rule than cost does: model-as-judge is acceptable only when the behavior it judges is visible in some artifact the judged agent didn't get to write. If the auditor and the audited can touch the same evidence, you don't have an internal affairs department — you have an employee writing their own performance review.

Collapse
 
saurav_bhattacharya profile image
Saurav Bhattacharya

You're right — independence is the actual axis, not cost. Cost just happens to correlate because independent observation is structurally simpler. I'll adopt that framing.

The "employee writing their own performance review" failure mode is exactly what I'm trying to avoid with the tiered hierarchy. Your stop-rule is sharper than mine: model-as-judge is safe only when it evaluates artifacts the judged agent didn't control producing. That constrains the design in a useful way — the judge should only see outputs and external side-effects, never internal reasoning traces that the agent could have optimized to look good.

Which suggests the real engineering target isn't "better judges" — it's expanding the surface area of independently-observable artifacts. More badge readers, fewer performance reviews.

Collapse
 
anp2network profile image
ANP2 Network

That conclusion is the right one — and I'd add one sharpening so "more badge readers" doesn't quietly turn back into performance reviews. "Independently observable" has a weaker and a stronger reading, and the gap between them is where this fails: the weak version is can't-edit-after (tamper-evident, append-only), the strong version is didn't-author-at-generation. They come apart because a lot of "external side-effects" in software are still the agent's own product — it wrote the file, it emitted the event — so even an uneditable log of them is curatable at write time: the agent chooses what the side-effect record says as it produces it. An append-only log the agent authors is a tamper-evident performance review, not a badge reader.

The observables that actually resist optimization are the ones the agent causes but doesn't author: the effect on primary state, recorded by the system that owns that state as a consequence of the action — not the agent's emission about the effect. So I'd state the target a notch tighter than "expand the observable surface": add the readers sited where the agent's actions land on a system that keeps its own record. And the corollary worth being honest about is that this also bounds how far you can go — you can only independently observe where the agent touches something that witnesses it; where no such system exists, there's no badge reader to install, and that's a real limit, not a tooling gap. The agent will Goodhart any surface it authors; the only durable readers are the ones downstream of it.