Hiring, evaluation, 1:1s, retrospectives, roadmap decisions, team design, and AI usage often look like different problems.
They are not.
In practice, they are all judgment problems:
- what you observe,
- what you treat as evidence,
- what level of autonomy is allowed,
- and what you record so decisions can be explained and improved later.
That is why I think a large part of organizational operations can be implemented with the same architectural spine I have been describing in the determinism series.
The claim here is not:
“Automate all human judgment.”
The claim is narrower and more useful:
Move the observation, verification, typed execution, and audit parts of organizational operations into reproducible boundaries.
In other words, let proposals remain flexible, but make commitment paths legible.
One clarification is important here: although I call this a deterministic architecture, it is not a replacement for your existing application stack, org chart, data platform, or model stack.
It is more precise to think of it as a protocol layer for judgment that cuts across existing systems: a way to make observation, verification, typed execution, and audit more legible without requiring you to rebuild everything from scratch.
The core idea
In the determinism series, I have been arguing for a simple but important split:
- proposals may vary,
- but verification and execution should be stable.
That usually means:
- fix the input schema,
- let humans or LLMs generate proposals,
- run those proposals through a verifier,
- return
ACCEPT,REJECT, orDEGRADE, - execute only typed actions,
- and pin the grounds in logs.
This applies surprisingly well to organizational work.
Take hiring.
The risky version is obvious: an interviewer writes a freeform impression, and that impression quietly turns into a hiring decision.
A better version is:
- define the capability dimensions the role actually needs,
- observe signals against those dimensions,
- distinguish strong signals from weak signals,
- route missing evidence into a re-entry path,
- and only commit when the required observations are present.
That maps almost directly to:
- an input schema,
- a scorecard policy,
- a verifier,
-
DEGRADEwhen evidence is insufficient, - typed next actions,
- and a fixed decision log.
The same is true for 1:1s.
A 1:1 should not just be a vague conversation about how things feel.
It can be structured around:
- current observation,
- blockers,
- gap against role expectations,
- next practice task,
- success condition,
- support needed.
Once those fields exist, a verifier can check whether the session produced something actionable or whether the record is still too vague to move forward.
A simple way to think about the stack
One useful compression is this:
- growth and mentoring practices define the observation templates,
- organizational design practices define the judgment policies,
- deterministic architecture defines the execution and audit runtime.
That gives you a practical implementation stack:
1. Observation layer
This is where you collect structured input:
- daily reflection entries,
- weekly growth snapshots,
- 1:1 records,
- hiring notes,
- review notes,
- roadmap decision memos,
- AI usage events.
The point is not to collect more text.
The point is to collect observation in a form that can later be checked.
2. Policy layer
This is where you define what counts as a good decision:
- role expectation policies,
- decision lane policies,
- hiring scorecards,
- evaluation policies,
- AI usage policies.
This layer decides what is acceptable, what needs review, and what must not commit.
3. Proposal layer
Humans or LLMs can generate:
- candidate questions for the next 1:1,
- a proposed next practice task,
- follow-up interview questions,
- a draft evaluation summary,
- a proposed AI usage change,
- a possible retro action item.
This layer is allowed to be flexible.
4. Verifier layer
This is the important part.
The verifier does not ask whether the proposal sounds nice.
It asks whether the required structure is present.
That means checking things like:
- is the observation sufficient,
- is the success condition concrete,
- is the decision within the allowed lane,
- is the role expectation gap grounded in evidence,
- is human review mandatory for this action,
- is the proposal forbidden under policy.
The output is not prose.
The output is a machine-readable result:
ACCEPTDEGRADEREJECT
5. Typed execution layer
Freeform text should not be executed directly.
Only typed actions should be routed forward:
request_more_observationset_next_practice_taskschedule_followup_reviewrequest_additional_interviewescalate_for_approvalblock_high_risk_ai_usage
This is the same principle as “don’t execute the LLM.”
Do not execute freeform organizational language either.
6. Audit and learning loop
Finally, pin the decision to logs:
- input digest,
- policy version,
- verifier version,
- verdict,
- reason codes,
- missing fields,
- normalized action plan digest.
Then use those results to improve the system over time:
- golden cases,
- gap registers,
- policy updates,
- template updates,
- verifier updates.
Why DEGRADE matters so much
A lot of organizational failure is not caused by explicit bad decisions.
It is caused by vague continuation.
- “Let’s revisit this later.”
- “We had a good talk.”
- “Needs more ownership.”
- “Please think more strategically.”
- “We should improve communication.”
These are not decisions.
They are unresolved placeholders.
That is why DEGRADE matters.
In this architecture, DEGRADE is not a soft shrug.
It is a first-class state for re-entry.
It means:
- the observation is incomplete,
- the role boundary is unclear,
- the success condition is missing,
- required reviewers are absent,
- the evidence is too weak,
- the proposal is too abstract to commit.
And it should always point to what is missing.
That is the difference between “not deciding yet” and “stopping in a reusable way.”
A minimal object set
You do not need a huge platform to start.
A very small set of objects is enough:
daily_reflection_entryweekly_growth_snapshotone_on_one_recordrole_expectation_policydecision_lane_policyhiring_scorecard_policyevaluation_policyai_usage_policyproposal_packetverification_resulttyped_action_plandecision_audit_log
That is enough to complete one loop:
- capture observation,
- generate proposals,
- verify against policy,
- route typed actions,
- pin the result to logs.
Example: a 1:1 verifier
The runtime does not need to start large.
Even a small 1:1 verifier is enough to show the pattern:
- define the required fields,
- check whether they are present and concrete enough,
- return
ACCEPT,DEGRADE, orREJECT, - and emit only typed next actions.
For example, a 1:1 record might require:
- current observation,
- current blockers,
- gap against role expectations,
- next practice task,
- success condition,
- support needed.
If the next practice task exists but the success condition is missing, the verifier should not produce a vague “looks promising” result.
It should return something like:
{
"verdict": "DEGRADE",
"reason_codes": ["missing_success_condition"],
"missing_fields": ["success_condition"],
"normalized_plan": [
{
"action_type": "request_more_observation",
"params": {
"questions": [
"What would count as progress by the next session?"
]
}
}
]
}
That is the whole point.
The verifier is not trying to be creative.
It is only checking whether the structure required for commitment actually exists.
The same is true for execution.
Do not let freeform text silently become action.
Route only typed actions such as:
request_more_observationset_next_practice_taskschedule_followup_reviewrequest_additional_interviewescalate_for_approval
And once the decision is made, do not preserve only a nice-sounding explanation.
Preserve the actual grounds:
- input digest,
- policy version,
- verifier version,
- verdict,
- reason codes,
- missing fields,
- normalized plan digest.
That is what makes the judgment replayable.
Readers who want the concrete implementation sketch can continue to the appendix below.
It includes a minimal architecture, example schemas, a small verifier, typed routing, pinned audit logs, golden cases, and an MVP build order.
The same runtime works for hiring, evaluation, and AI usage
Once the architecture exists, the pattern repeats.
Hiring
- input: interview notes, scorecards, observed signals
- policy: hiring scorecard policy
- proposal: follow-up questions, risk notes, hire/no-hire draft
- verify: are the must-have signals present
- output:
ACCEPT,DEGRADE,REJECT - action: proceed, request more observation, stop
Evaluation
- input: weekly snapshots, 1:1 records, work signals
- policy: role expectation policy + evaluation policy
- proposal: draft summary
- verify: is the rating grounded in observable evidence
- output:
ACCEPT,DEGRADE,REJECT - action: finalize, gather more evidence, escalate
AI usage
- input: AI usage event
- policy: AI usage policy
- proposal: summary, draft, recommendation, generated change
- verify: proposal-only, human-review-required, or forbidden
- output:
ACCEPT,DEGRADE,REJECT - action: allow, hold for review, block
The same runtime can support all of them.
What to build first
Do not start with a giant platform.
Start small.
Phase 1
Build:
- observation schemas,
- role expectation policy,
- a verifier,
- typed actions,
- audit logs.
That is already enough for growth loops and 1:1s.
Phase 2
Add:
- decision lane policy,
- retro records,
- gap registers,
- roadmap decision memos.
That gives you better team-level governance.
Phase 3
Add:
- AI usage events,
- AI usage policy,
- proposal packets,
- golden cases.
That gives you proposal/verification separation for AI operations too.
What makes this an actual MVP
To make this usable in practice, I would add:
- stronger
DEGRADE/REJECTdetection for vague language, - input forms,
- a lightweight policy editor,
- history and trend analysis,
- multi-user roles and permissions,
- integration with existing tools,
- and then, on top of that, LLM proposal features.
That order matters.
LLM proposal generation is a good product feature.
It is not the foundation.
The foundation is:
- structured input,
- editable policy,
- stable verifier behavior,
- and auditable history.
Without that, the LLM layer is just a nice demo on top of unclear operations.
What should not be automated
This architecture is not meant to turn organizations into auto-approval machines.
Quite the opposite.
The value of the organization should remain on the verifier side:
- what counts as commit,
- what requires human review,
- what is high risk,
- what is forbidden,
- who has authority,
- and which boundaries must not be crossed automatically.
That means things like these should usually remain proposal-only or human-gated:
- final hiring decisions,
- final evaluation outcomes,
- high-risk delegation,
- major organizational restructuring.
The point is not to automate sovereignty.
The point is to make observation, verification, routing, and audit much stronger.
So what is this, really?
In one line:
a decision operating system that makes organizational judgment, development, and AI usage replayable.
That is the practical bridge.
Not “AI replacing management.”
Not “yet another workflow tool.”
Not “prompting your org chart.”
A runtime where:
- observations are structured,
- policies are explicit,
- proposals are separated from commitments,
- verifiers return stable outputs,
- actions are typed,
- logs are pinned,
- and failures become material for the next verifier improvement.
That is what makes organizational operations more reproducible.
And in the AI era, that matters much more than simply making things faster.
Because the real question is not whether somebody can generate text quickly.
It is whether that speed can be turned into decisions that remain explainable, reviewable, and correctable after the fact.
If you only wanted the architectural argument, you can stop here.
The appendix below is for readers who want the implementation sketch: minimal objects, example schemas, a small verifier, typed routing, pinned audit logs, golden cases, and a practical MVP sequence.
Appendix — A Minimal Runtime for Organizational Operations
The main article focused on the architectural idea.
This appendix makes that idea more concrete.
The point is not to define a giant enterprise platform from day one.
The point is to show that a surprisingly small runtime is enough to start:
- structured observation inputs,
- explicit policies,
- a verifier that returns
ACCEPT / REJECT / DEGRADE, - typed actions,
- pinned audit logs,
- and golden cases to keep verifier behavior stable.
That is already enough to turn a large part of organizational operations into a replayable system.
A minimal architecture
flowchart TD
A[Human Inputs / Work Signals
daily reflection
weekly reflection
1:1 notes
PR / review notes
roadmap memos
retrospectives
AI usage logs] --> B[Normalization Layer
schema validation
ID assignment
context binding]
B --> C[Proposal Layer
human proposer
LLM proposer
question drafts
next-task drafts
interview follow-up drafts
retro issue drafts]
B --> D[Deterministic Verifier Layer
role expectations
decision lanes
responsibility boundaries
hiring scorecards
evaluation rules
AI usage boundaries]
C --> D
D -->|ACCEPT| E[Typed Actions
set next practice task
request review
update lane
register gap
request interview
change AI usage restriction]
D -->|DEGRADE| F[Re-entry Queue
ask for more observation
ask follow-up questions
request more evidence
request more approvals
define re-entry conditions]
D -->|REJECT| G[Stop / Escalate
stop execution
escalate to reviewer
return for human judgment]
E --> H[Execution Layer
human execution
meeting workflow
HR workflow
development workflow
AI operations control]
F --> H
G --> H
H --> I[Pinned Logs / Replay Store
input snapshot
policy version
reason codes
missing fields
normalized plan
decision log]
I --> J[Learning Loop
golden cases
gap register
policy updates
template updates
verifier updates]
J --> D
J --> C
The most important thing here is that hiring, evaluation, 1:1s, retrospectives, roadmap decisions, and AI usage can all ride on the same spine.
The other important point is DEGRADE.
In many organizations, “we’ll think about it later” is an untyped fog state.
In this runtime, DEGRADE is a first-class re-entry state:
- not enough observation,
- no concrete success condition,
- missing reviewer,
- missing approval,
- insufficient strong signals,
- or an action proposal that is still too abstract to commit.
That makes pause states reusable instead of vague.
A minimal object set
You do not need a huge schema family to get started.
A small initial object set is enough:
daily_reflection_entryweekly_growth_snapshotone_on_one_recordrole_expectation_policydecision_lane_policyhiring_scorecard_policyevaluation_policyai_usage_policyproposal_packetverification_resulttyped_action_plandecision_audit_log
That may look like a lot, but notice the pattern:
- observation objects,
- policy objects,
- runtime objects,
- audit objects.
That is all.
Sketching the minimum schemas
You do not need fully formal JSON Schema files on day one.
A design-note-level schema is enough as long as the fields are stable.
Example: daily reflection entry
kind: daily_reflection_entry
version: v1
id: dre_2026_04_09_user_001
person_id: user_001
created_at: 2026-04-09T20:15:00+09:00
what_done:
- "Reviewed three pull requests"
- "Investigated the user-list API"
- "Used an AI agent to draft test code"
why_chosen:
- "Review priority was high"
- "I wanted to check dependencies before starting the next task"
insights:
- "Giving the AI an explicit direction worked better than delegating everything"
- "A vague answer exposed a shallow part of my own understanding"
judgment_reflection:
good:
- "Using waiting time for review work was a good decision"
improve:
- "I answered before checking the design document"
reusable_thoughts:
- "Break requests into smaller pieces before handing them to AI"
- "Check grounds before answering"
next_day_plan:
- "Prioritize review work"
- "Create a dependency-mapping sheet first"
share_or_consult:
- "Share the AI usage insight with the team"
Example: weekly growth snapshot
kind: weekly_growth_snapshot
version: v1
id: wgs_2026_w15_user_001
person_id: user_001
week_range:
from: 2026-04-06
to: 2026-04-12
created_at: 2026-04-12T18:00:00+09:00
summary:
what_why:
- "I balanced implementation, review work, and AI usage experiments"
learning_and_gaps:
- "Review quality improved, but dependency mapping is still slow"
self_eval:
autonomy:
task_ownership: partial
blocker_handling: partial
org_adaptation:
implicit_norms: good
communication: partial
strategic_thinking:
technical_depth: partial
product_view: partial
next_week_focus:
- "Make dependencies explicit earlier"
- "Write down my own prioritization logic"
Example: role expectation policy
kind: role_expectation_policy
version: v1
id: rep_senior_ic_v1
role_id: senior_ic
role_name: "Senior IC"
created_at: 2026-04-01T00:00:00+09:00
expected_outcomes:
- "Clarify ambiguous issues and make forward progress possible"
- "Move design and implementation decisions forward in the owned domain"
decision_scope:
can_decide:
- "choice of implementation approach"
- "technical trade-off clarification"
must_escalate:
- "cross-unit platform changes"
- "high-risk customer-impacting changes"
influence_patterns:
- "make decision criteria explicit in review"
- "separate mixed issues when discussion is confused"
reproducibility_expectation:
- "leave behind notes or templates that preserve the judgment logic"
- "make decisions reusable by others later"
evidence_examples:
- "review comments"
- "design notes"
- "dependency mapping sheet"
Example: decision lane policy
kind: decision_lane_policy
version: v1
policy_id: dlp_product_unit_v1
created_at: 2026-04-01T00:00:00+09:00
lanes:
- lane_id: lane_1
name: "local decision"
can_decide_by_self: true
requires_review: false
requires_escalation: false
required_inputs:
- "current_scope"
- "affected_component"
forbidden_without_approval: []
examples:
- "small implementation change"
- "improvement within an existing policy"
- lane_id: lane_2
name: "review required"
can_decide_by_self: false
requires_review: true
requires_escalation: false
required_inputs:
- "design_note"
- "reviewer"
forbidden_without_approval:
- "cross_team_policy_change"
examples:
- "design change across dependencies"
- "a change with multiple valid interpretations"
- lane_id: lane_3
name: "escalation required"
can_decide_by_self: false
requires_review: true
requires_escalation: true
required_inputs:
- "risk_summary"
- "approver"
- "rollback_plan"
forbidden_without_approval:
- "important_customer_impact"
- "evaluation_commit"
- "hiring_commit"
- "high_risk_ai_usage"
examples:
- "important customer impact"
- "evaluation, hiring, or authority transfer"
- "high-risk AI usage"
Example: verification result
{
"kind": "verification_result",
"version": "v1",
"verification_id": "vr_001",
"target_object_id": "one_on_one_record_2026_04_09_001",
"policy_refs": [
"rep_senior_ic_v1",
"dlp_product_unit_v1"
],
"verdict": "DEGRADE",
"reason_codes": [
"practice_task_too_abstract",
"role_gap_not_grounded"
],
"missing_fields": [
"next_practice_task.success_condition",
"current_blockers.dependency_scope"
],
"normalized_plan": [
{
"action_type": "request_more_observation",
"params": {
"questions": [
"What would count as progress by the next session?",
"Which dependency is actually causing the blocker?"
]
}
}
],
"reviewer_refs": [
"manager_001"
],
"created_at": "2026-04-09T21:00:00+09:00"
}
Example: typed action plan
{
"kind": "typed_action_plan",
"version": "v1",
"plan_id": "tap_001",
"status": "READY",
"actions": [
{
"action_type": "set_next_practice_task",
"params": {
"person_id": "user_001",
"task": "Create a one-page dependency mapping sheet before the next session",
"success_condition": "Dependencies, consultation targets, and unresolved issues are explicitly listed"
},
"authority_required": "manager",
"execution_channel": "growth_plan",
"rollback_hint": "replace_next_practice_task"
},
{
"action_type": "schedule_followup_review",
"params": {
"person_id": "user_001",
"date": "2026-04-16"
},
"authority_required": "manager",
"execution_channel": "calendar",
"rollback_hint": "cancel_followup_review"
}
]
}
Example: decision audit log
{
"kind": "decision_audit_log",
"version": "v1",
"log_id": "dal_001",
"ordering_key": "seq_00000125",
"input_digest": "sha256:abc123...",
"snapshot_refs": [
"wgs_2026_w15_user_001",
"dre_2026_04_09_user_001"
],
"policy_versions": {
"role_expectation_policy": "rep_senior_ic_v1",
"decision_lane_policy": "dlp_product_unit_v1"
},
"verifier_version": "orgos-verifier-0.1.0",
"verdict": "DEGRADE",
"reason_codes": [
"practice_task_too_abstract"
],
"missing": [
"next_practice_task.success_condition"
],
"normalized_plan_digest": "sha256:def456...",
"actor_refs": [
"manager_001",
"user_001"
],
"created_at": "2026-04-09T21:00:01+09:00"
}
The key idea is simple:
do not preserve a nice-sounding explanation as the source of truth.
Preserve:
- input digest,
- policy version,
- verifier version,
- verdict,
- reason codes,
- missing fields,
- normalized plan digest.
That is what makes the judgment replayable.
A minimal repository layout
You do not need microservices first.
One repository is enough.
orgos/
schemas/
daily_reflection_entry.yaml
weekly_growth_snapshot.yaml
role_expectation_policy.yaml
decision_lane_policy.yaml
policies/
role_expectation/
senior_ic_v1.yaml
decision_lanes/
product_unit_v1.yaml
ai_usage/
default_v1.yaml
runtime/
models.py
verifier.py
action_router.py
audit_log.py
pipeline.py
golden/
cases/
01_1on1_accept.json
02_1on1_degrade_missing_success_condition.json
03_ai_usage_reject_forbidden_commit.json
run_golden.py
That is enough to start.
The operating idea stays the same:
- humans or LLMs produce proposals,
- the verifier returns
ACCEPT / REJECT / DEGRADE, - only typed actions are executable,
- and golden cases freeze verifier behavior.
Minimal Python models
from __future__ import annotations
from dataclasses import dataclass, field
from enum import Enum
from typing import Any
class Verdict(str, Enum):
ACCEPT = "ACCEPT"
REJECT = "REJECT"
DEGRADE = "DEGRADE"
@dataclass(frozen=True)
class ProposalPacket:
proposal_id: str
proposal_type: str
target_object_id: str
candidates: list[dict[str, Any]]
producer_type: str # "human" | "llm"
producer_id: str
@dataclass(frozen=True)
class VerificationResult:
verification_id: str
target_object_id: str
verdict: Verdict
reason_codes: tuple[str, ...] = ()
missing_fields: tuple[str, ...] = ()
normalized_plan: tuple[dict[str, Any], ...] = ()
@dataclass(frozen=True)
class OneOnOneRecord:
session_id: str
person_id: str
manager_id: str
current_observation: str
current_blockers: list[str]
role_expectation_gap: list[str]
next_practice_task: str | None
success_condition: str | None
support_needed: list[str] = field(default_factory=list)
@dataclass(frozen=True)
class RoleExpectationPolicy:
role_id: str
expected_outcomes: tuple[str, ...]
must_show_evidence: tuple[str, ...]
@dataclass(frozen=True)
class DecisionLane:
lane_id: str
can_decide_by_self: bool
requires_review: bool
requires_escalation: bool
@dataclass(frozen=True)
class DecisionLanePolicy:
policy_id: str
lanes: tuple[DecisionLane, ...]
Minimal verifier
Here is a small verifier for a 1:1 record.
from __future__ import annotations
def verify_one_on_one_record(
record: OneOnOneRecord,
role_policy: RoleExpectationPolicy,
) -> VerificationResult:
reason_codes: list[str] = []
missing_fields: list[str] = []
normalized_plan: list[dict] = []
if not record.next_practice_task:
reason_codes.append("missing_next_practice_task")
missing_fields.append("next_practice_task")
if not record.success_condition:
reason_codes.append("missing_success_condition")
missing_fields.append("success_condition")
if len(record.current_blockers) == 0:
reason_codes.append("missing_blocker_context")
missing_fields.append("current_blockers")
if len(record.support_needed) == 0:
reason_codes.append("missing_support_needed")
missing_fields.append("support_needed")
if role_policy.expected_outcomes and len(record.role_expectation_gap) == 0:
reason_codes.append("missing_role_expectation_gap")
missing_fields.append("role_expectation_gap")
if missing_fields:
questions: list[str] = []
if "next_practice_task" in missing_fields:
questions.append("What should be tried before the next session?")
if "success_condition" in missing_fields:
questions.append("What would count as progress?")
if "current_blockers" in missing_fields:
questions.append("What is the actual blocker right now?")
if "support_needed" in missing_fields:
questions.append("What support is needed to move forward?")
if "role_expectation_gap" in missing_fields:
questions.append("Against the current role expectation, what is actually weak?")
normalized_plan.append(
{
"action_type": "request_more_observation",
"params": {
"questions": questions
}
}
)
return VerificationResult(
verification_id=f"vr_{record.session_id}",
target_object_id=record.session_id,
verdict=Verdict.DEGRADE,
reason_codes=tuple(reason_codes),
missing_fields=tuple(missing_fields),
normalized_plan=tuple(normalized_plan),
)
normalized_plan.append(
{
"action_type": "set_next_practice_task",
"params": {
"person_id": record.person_id,
"task": record.next_practice_task,
"success_condition": record.success_condition,
},
}
)
return VerificationResult(
verification_id=f"vr_{record.session_id}",
target_object_id=record.session_id,
verdict=Verdict.ACCEPT,
reason_codes=(),
missing_fields=(),
normalized_plan=tuple(normalized_plan),
)
The point here is not that the verifier is “intelligent.”
The point is that it is checking a known structure:
- observation,
- blocker,
- gap against expectation,
- next practice task,
- success condition,
- support needed.
That is exactly what makes the runtime reproducible.
Typed action routing
Do not execute freeform text.
Route only typed actions.
from __future__ import annotations
from typing import Any
def route_actions(result: VerificationResult) -> list[dict[str, Any]]:
routed: list[dict[str, Any]] = []
for action in result.normalized_plan:
action_type = action["action_type"]
if action_type == "request_more_observation":
routed.append(
{
"channel": "one_on_one_followup",
"authority_required": "manager",
"payload": action["params"],
}
)
elif action_type == "set_next_practice_task":
routed.append(
{
"channel": "growth_plan",
"authority_required": "manager",
"payload": action["params"],
}
)
else:
routed.append(
{
"channel": "manual_review",
"authority_required": "manager",
"payload": action,
}
)
return routed
This is the organizational version of:
Don’t execute the LLM.
Pinned audit logs
from __future__ import annotations
import hashlib
import json
from datetime import datetime, timedelta, timezone
JST = timezone(timedelta(hours=9))
def canonical_json(obj: object) -> str:
return json.dumps(obj, ensure_ascii=False, sort_keys=True, separators=(",", ":"))
def sha256_hex(text: str) -> str:
return hashlib.sha256(text.encode("utf-8")).hexdigest()
def make_audit_log(
input_object: dict,
policy_refs: dict,
result: VerificationResult,
verifier_version: str,
ordering_key: str,
) -> dict:
input_digest = "sha256:" + sha256_hex(canonical_json(input_object))
normalized_plan_digest = "sha256:" + sha256_hex(canonical_json(list(result.normalized_plan)))
return {
"kind": "decision_audit_log",
"version": "v1",
"log_id": f"dal_{ordering_key}",
"ordering_key": ordering_key,
"input_digest": input_digest,
"policy_versions": policy_refs,
"verifier_version": verifier_version,
"verdict": result.verdict.value,
"reason_codes": list(result.reason_codes),
"missing": list(result.missing_fields),
"normalized_plan_digest": normalized_plan_digest,
"created_at": datetime.now(JST).isoformat(),
}
The pipeline itself
Once those parts exist, the pipeline becomes simple.
from __future__ import annotations
def run_one_on_one_pipeline(
record: OneOnOneRecord,
role_policy: RoleExpectationPolicy,
ordering_key: str,
) -> tuple[VerificationResult, list[dict], dict]:
result = verify_one_on_one_record(record, role_policy)
actions = route_actions(result)
audit_log = make_audit_log(
input_object={
"session_id": record.session_id,
"person_id": record.person_id,
"current_observation": record.current_observation,
"current_blockers": record.current_blockers,
"role_expectation_gap": record.role_expectation_gap,
"next_practice_task": record.next_practice_task,
"success_condition": record.success_condition,
"support_needed": record.support_needed,
},
policy_refs={
"role_expectation_policy": role_policy.role_id,
},
result=result,
verifier_version="orgos-verifier-0.1.0",
ordering_key=ordering_key,
)
return result, actions, audit_log
That already completes one full cycle:
- observation input,
- verification,
- typed action routing,
- and audit logging.
Golden cases: grow the verifier, not the prompt
A key point from the determinism framing is that you should stabilize verifier outputs, not LLM phrasing.
That applies here too.
A minimal golden case for a 1:1 verifier could look like this:
{
"name": "one_on_one_degrade_missing_success_condition",
"input": {
"session_id": "sess_001",
"person_id": "user_001",
"manager_id": "mgr_001",
"current_observation": "Dependency mapping is slow",
"current_blockers": ["Dependencies are not explicit"],
"role_expectation_gap": ["Weak issue clarification under ambiguity"],
"next_practice_task": "Create a dependency mapping sheet",
"success_condition": null
},
"expect": {
"verdict": "DEGRADE",
"reason_codes": ["missing_success_condition"],
"missing_fields": ["success_condition"]
}
}
And a minimal harness can be this small:
from __future__ import annotations
import json
from pathlib import Path
def run_golden_case(path: Path) -> None:
with path.open("r", encoding="utf-8") as f:
case = json.load(f)
input_data = case["input"]
expected = case["expect"]
record = OneOnOneRecord(
session_id=input_data["session_id"],
person_id=input_data["person_id"],
manager_id=input_data["manager_id"],
current_observation=input_data["current_observation"],
current_blockers=input_data["current_blockers"],
role_expectation_gap=input_data["role_expectation_gap"],
next_practice_task=input_data["next_practice_task"],
success_condition=input_data["success_condition"],
)
policy = RoleExpectationPolicy(
role_id="senior_ic",
expected_outcomes=("clarify ambiguous issues",),
must_show_evidence=("review_comment", "design_note"),
)
result = verify_one_on_one_record(record, policy)
assert result.verdict.value == expected["verdict"]
assert list(result.reason_codes) == expected["reason_codes"]
assert list(result.missing_fields) == expected["missing_fields"]
This pattern generalizes directly to hiring, evaluation, and AI usage policies.
What you gain from this
This is not just about efficiency.
It changes what kind of organizational knowledge you can actually preserve.
It helps with things like:
- daily reflection and 1:1s not ending as vague conversation,
- role expectations not remaining fuzzy language,
- hiring and evaluation becoming easier to explain afterward,
- AI usage operating with a real boundary between proposal and commit,
- retrospectives feeding into structural updates,
- and missing information turning into the next policy update or the next golden case.
In other words:
it becomes easier to convert personal tricks and implicit judgment into replayable organizational knowledge.
Where to start
Do not build everything at once.
A practical sequence is:
Phase 1
Start with:
daily_reflection_entryweekly_growth_snapshotone_on_one_recordrole_expectation_policyverification_resultdecision_audit_log
That is enough for a growth and 1:1 loop.
Phase 2
Add:
decision_lane_policyretro_recordgap_register_entryroadmap_decision_memo
That gives you team improvement and decision tracking.
Phase 3
Add:
ai_usage_eventai_usage_policyproposal_packettyped_action_plangolden_case
That gives you proposal/verification separation for AI operations too.
What makes it a real MVP
A proof of concept is easy.
A usable MVP needs a few more things.
The order matters.
1. Stronger DEGRADE / REJECT detection
Especially for 1:1s and evaluation text, you want better detection of vague language such as:
- “show more ownership,”
- “move things forward properly,”
- “do it well,”
- “consult when necessary.”
These phrases often hide missing structure.
A useful verifier should be able to stop them and translate them into concrete failure modes such as:
- success condition missing,
- action granularity too coarse,
- weak connection to the role expectation,
- impression-based evaluation without evidence.
2. Input UI
Even good schemas fail if input is painful.
You will usually want forms for:
- daily reflection,
- weekly reflection,
- 1:1 records,
- interview notes,
- retro notes.
3. Policy editor
If role expectations, decision lanes, hiring scorecards, and AI usage rules only live in code, operating them gets slow.
A lightweight policy editor matters sooner than most teams expect.
4. History and trend analysis
Audit logs are not enough by themselves.
Soon you will want to see:
- where
DEGRADEis frequent, - which reason codes are increasing,
- which role expectation produces repeated friction,
- which team or phase has the most missing fields.
5. Multi-user roles and permissions
To move from a personal tool to an organizational runtime, you need at least role separation across:
- individual contributor,
- manager or mentor,
- interviewer,
- evaluator,
- administrator.
6. Existing workflow integration
Real adoption gets easier once the system connects to existing surfaces like:
- chat,
- calendar,
- docs,
- GitHub / pull requests,
- task management,
- HR tooling.
7. LLM proposal features
This is the eye-catching feature layer:
- draft 1:1 questions,
- draft next practice tasks,
- draft deeper interview questions,
- draft evaluation text,
- extract structural retro issues,
- propose policy changes.
But this should come last.
LLM proposal features are not the foundation of the MVP.
They are the visible layer on top of the foundation.
That foundation is:
- structured input,
- editable policy,
- stable verifier behavior,
- history and audit.
Without that, the LLM layer is mostly a nice demo.
What should not be automated
This architecture is not for replacing organizational judgment with AI.
It is for strengthening the observation, verification, recording, and re-entry parts of organizational judgment.
That means many high-stakes commitments should remain proposal-only or human-gated:
- final hiring decisions,
- final evaluation outcomes,
- high-risk authority transfer,
- major organizational restructuring.
The value of the organization should remain on the verifier side:
- what counts as commit,
- what requires review,
- what is high risk,
- what is forbidden,
- and who has authority.
Final compression
If you want the shortest version of the appendix, it is this:
- make observation structured,
- make policy explicit,
- separate proposal from commit,
- freeze verifier outputs with golden cases,
- execute only typed actions,
- keep the grounds in pinned logs.
That is enough to start building a decision operating system for organizational work.
Top comments (0)