Jason (AKA SEM)

Posted on Apr 16

Stop Asking People to Describe Their Jobs. Start Extracting Their Rules.

#ai #opensource #architecture #agentaichallenge

published: true
"I've been building agent onboarding for months. Here's what everyone gets wrong about the interview approach — and what the system actually needs to look like."

There's a growing consensus in the agent space that the real bottleneck isn't installation or model selection — it's getting the human's operational knowledge out of their head and into a format an agent can use. I agree with that diagnosis completely. I've been saying it for months. Others are saying it now too, and that's a good sign.

But here's where I break from the pack: most people stop at "we should interview the user." They build a questionnaire, maybe a conversational flow, capture some answers, generate some config files, and call it done.

That's not a solution. That's a first draft of a solution. And I know because I've been deep in the build — not theorizing about what agent onboarding should look like, but actually writing the code, hitting the walls, and discovering what the system needs to survive contact with real humans.

What I've found is that the interview itself is only about 30% of the problem. The other 70% is what happens after — and almost nobody is talking about it.

The Dirty Secret: People Can't Answer the Questions

Everyone building an "interviewer agent" makes the same assumption: ask the right questions, get usable answers.

You won't.

I don't say this to be cynical. I say it because I've watched it happen. You ask someone "what workflows does this agent participate in?" and you get:

"It needs to handle the tickets that come in and make sure they get to the right person and follow up if nobody responds."

That's real intent. It's also completely unusable as an agent specification. Three distinct workflow steps are compressed into one sentence, with no decision logic, no escalation rules, no success criteria, and no scope boundaries.

Ask "what does success look like?" and you get "it does a good job." Ask "what tools does this agent need?" and you get "whatever it needs." Ask "how should it handle edge cases?" and you get the answer that haunts every agent builder alive: "it should use its best judgment."

This isn't the human's fault. This is the tacit knowledge problem doing what it always does. The more experienced someone is, the more their expertise has been compressed into instinct. A senior product manager doesn't think "cross-reference the revenue dashboard with churn data before forming an opinion." They open three tabs, glance at the numbers, and just know. The actual decision involved a hundred micro-evaluations reflecting thousands of hours of pattern recognition. That doesn't get articulated because it doesn't feel like a process. It feels like seeing.

So if your interviewer agent asks great questions and accepts whatever the human says at face value, you end up with agent specs full of vague intent and zero operational teeth.

The interview isn't enough. You need a refinement engine.

Elicit, Refine, Lock

Every question in the onboarding process I'm building follows a three-step cycle.

Elicit. The system asks the question in plain language and shows a formatted example of what a good answer looks like. Not to constrain — to show the shape. The human responds however they want. Broken English, bullet fragments, stream of consciousness, a voice-to-text dump from their car. Anything.

Refine. The AI takes that raw input and restructures it into what the agent spec actually requires. This isn't cosmetic reformatting. It's substantive work. The AI decomposes compound statements into discrete, addressable items. It identifies gaps and asks targeted follow-ups — "you mentioned ticket routing, but what determines which person gets which ticket?" It converts vague language into specific conditions. "Handle it quickly" becomes "respond within 2 hours for P3, 30 minutes for P1-P2" after asking what "quickly" means in their context. It separates what the agent does from what it doesn't do.

Lock. The refined version is presented alongside the original. Three options: approve and move on, edit and have the AI re-refine, or redo from scratch. Nothing locks without explicit approval. And once it's locked, later answers are cross-referenced against earlier ones. If you say the agent "never modifies billing records" in one track but describe a workflow with billing adjustments in another, the system flags the contradiction before anything ships.

Here's a concrete example. The system asks: "What rules does this agent always follow?"

The human types:

"don't close tickets without asking the customer, always cc the account manager on big accounts, don't mess with billing stuff"

The AI refines that into:

Always: Check customer tier before responding — enterprise gets priority. Log every action on a ticket with timestamp. CC the account manager on any enterprise ticket response.

Never: Close a ticket without customer confirmation. Share internal pricing with customers. Modify billing records directly — always route to billing team.

When uncertain: Default to escalating rather than guessing. Ask the customer for clarification rather than assuming.

Same intent. Completely different usability. The human provided the knowledge. The AI provided the structure.

And that line — agents don't have judgment, they have rules — is the principle the entire refinement engine is built on. Every instance of "use best judgment" has to be converted into explicit decision logic with defined conditions and outcomes. If the human can't articulate the rule, the AI helps extract it through follow-up questions. Vague intent never gets locked in. Only structured rules.

Tool Gaps Shouldn't Be Discovered After Deployment

There's another problem the interview surfaces that nobody's building for: tool discovery.

When you ask someone "what tools does this agent need?" they answer in terms of what the agent needs to do, not what tools exist in the catalog. They say "it needs to check our CRM" or "it has to pull data from our billing API." They don't know whether a tool for that already exists, what it's called, or whether it covers their specific use case.

So the interview includes a real-time tool discovery step. When the human describes a capability, the system checks the tool catalog. Full match? Here's the tool. Partial match? Here's what's covered and what isn't — the Salesforce read tool handles contacts and deals but doesn't pull activity history. No match? The system generates a tool specification that feeds directly into the CLI tool creation pipeline.

An agent should never be deployed and then discover it can't do half its job because the tools don't exist. That gap gets identified during the interview, and the system produces the spec that makes building the missing tool straightforward instead of a guessing game.

Agents can still be provisioned while a tool is pending. Some gaps are blocking — the agent literally can't function without that tool and stays in a pending_tools state until it's built. Others are non-blocking — the agent operates with reduced capability and a noted gap. The operator decides which is which during the interview, before anything reaches production.

One Pipeline. Every Agent. No Exceptions.

Here's something I learned the hard way building ArgentOS: if you have two onboarding paths, they will diverge, and the divergence will create agents with inconsistent identity states.

We had it. A core agent creation path that was careful — normalized IDs, checked for duplicates, wrote full config entries, bootstrapped workspaces. And a worker agent path that came later, probably under time pressure, and took shortcuts. Silent upserts that could overwrite existing agents without confirmation. Skeletal config entries missing half the fields. Legacy workspace paths pointing at directories that no longer existed.

Two paths, one system, growing quietly out of sync.

The fix wasn't patching the second path to match the first. The fix was eliminating the second path entirely. One pipeline. Every agent goes through it. Family agents and worker agents share the same identity validation, the same provisioning, the same base onboarding. They diverge only at governance — after identity is established, not before.

Family agents get lighter governance. They're flexible, reassignable, broader in scope. Worker agents get the business-class layer: department alignment, restricted tool access, compliance controls, operator-tier governance. But both enter existence through the same front door.

The interview naturally adapts to the type. A family agent interview might be four tracks — identity, operational context, tools, and communication style. A worker agent in a Business edition deployment gets those four plus organizational alignment, compliance and governance, tool governance, and operational boundaries. It's additive, not separate. Same process, deeper for workers.

The Part Nobody's Talking About: What Happens After the Interview

This is where I think the entire conversation about agent onboarding goes off the rails, and it's the thing I care about most.

Every approach I've seen treats the interview output as the destination. You capture the knowledge, you generate the config files, you provision the agent. Done. Ship it.

But that treats the agent's operational context as a snapshot. A moment in time. What you knew about your job on the day you sat down for the interview. And here's the problem with snapshots: they start decaying the moment you take them. You learn something new next week. A process changes. A tool gets replaced. A client relationship shifts. The snapshot doesn't know any of that.

If your agent is running on a static extraction from three months ago, it's operating on stale context. It's the same problem as the outdated wiki that nobody maintains — except now the outdated wiki is actively making decisions on your behalf.

This is the gap I've been engineering around in ArgentOS, and it required rethinking what memory actually means in an agent system.

Most agent memory systems are filing cabinets. You store things. You retrieve things. Maybe you search semantically. That's useful, but it's passive. The memory sits there until someone asks for it.

What I built is a closed loop. It's not my second brain. It's the agent's brain. And it works the way a brain should work — not just storing, but actively learning.

The loop has seven stages: act, prove, remember, reflect, compress, promote, reuse.

Act. The agent does work. Executes tasks. Uses tools. Produces outputs.

Prove. This is the honesty layer most people skip. Tool-claim validation compares what the agent said it did against what tools actually ran. If the agent claims it completed a task but no tools fired, the system knows. Fake productivity doesn't become learning. Only verified actions enter the memory system.

Remember. Verified actions, outcomes, operator corrections, task results — all captured into the memory system. Not just what happened, but the context around it.

Reflect. Contemplation cycles load recent tasks, memories, and lessons. They produce structured episodes — not raw logs, but synthesized reflections on what worked, what didn't, and why.

Compress. The Self-Improving System looks across episodes and tool outcomes to extract lessons: mistakes, successes, workarounds, discoveries. Pattern recognition across experience, not just single-event recall.

Promote. This is the part that changes everything. Strong patterns don't just stay as memories or lessons. They get promoted into Personal Skills — operator-specific learned procedures that carry higher authority than generic skills because they were earned in context. A Personal Skill isn't something someone wrote in a config file. It's something the agent learned from doing the work, making mistakes, getting corrected, and proving the correction worked.

Reuse. When the agent encounters a similar situation, Personal Skills match ahead of generic skills. The agent doesn't just remember what happened last time. It has a promoted, evidence-tested procedure for handling it. And the system surfaces which type of skill was matched — generic or personal — so the operator can see the agent's learning in action.

Then the cycle repeats. The reuse generates new actions, which get proven, remembered, reflected on, and potentially compressed into new or refined Personal Skills.

The key innovation is the Personal Skills layer. Before Personal Skills, reuse mostly meant: recall the memory, inject the lesson, maybe match a generic skill. That helped, but it didn't fully promote repeated experience into durable procedural authority. The agent could remember that something happened, but there wasn't a strong enough promoted layer for "I have done this before, this is now one of my learned procedures, and I should favor this over a generic skill because it came from repeated real work."

Personal Skills close that gap. They're operator-specific learned procedures derived from memories, episodes, SIS lessons, task outcomes, and operator corrections. Each candidate goes through review for procedurality, evidence, recurrence, and specificity. Only strong candidates get promoted. And promoted Personal Skills match ahead of generic skills — not because someone configured it that way, but because the agent earned that authority through validated experience.

The distinction between generic and personal skills is surfaced visibly to the operator. When the agent handles a situation using a Personal Skill, you can see it. You can see the difference between "I'm following a generic procedure" and "I'm following something I learned from working with you." That transparency builds trust, and it creates a feedback loop — the operator sees the learning happening, which encourages them to invest more in corrections and refinements, which generates more learning material, which produces more Personal Skills.

Contemplation cycles don't just produce reflections — they also review pending promotions, emerging Personal Skill candidates, recent lessons, and recent task outcomes. That review feeds back into storage so the agent keeps getting better at what it does. The memory system isn't passive storage. It's an active learning engine with a promotion pipeline.

Why This Changes the Onboarding Conversation

Go back to the interview now and think about what it's actually doing in this context.

The interview isn't generating config files. It's seeding the memory system. It's providing the initial substrate — the foundational knowledge that the agent's brain will build on from day one.

The Elicit → Refine → Lock process captures the operator's explicit knowledge: their rules, workflows, decision patterns, tool needs. That becomes the base layer. But the moment the agent starts working, the implicit knowledge starts accumulating. The corrections the operator makes. The edge cases that come up. The tools that work better than expected. The workflows that turn out to be wrong.

All of that feeds back through the loop. Some of it becomes lessons. Some of those lessons get promoted into Personal Skills. The agent doesn't just know what the operator told it during the interview. It knows what it learned from working with that operator over weeks and months.

The interview captures what you know today. The memory system captures what both of you learn tomorrow. And the Personal Skills layer means that learning isn't just remembered — it's operationalized. It becomes procedure. It becomes the agent's earned expertise.

That's the difference between a static snapshot and a living system. And it's why the interview, as important as it is, is just the seed. What grows from it is the actual product.

The Organizational Layer

For individual operators, this loop is powerful on its own. But for organizations deploying agents across teams, there's another dimension that multiplies the value.

In ArgentOS Business edition, onboarding works in layers. Before any agents are provisioned, the system interviews the operator — the person deploying ArgentOS into the organization. How is your company structured? What departments exist? What compliance requirements apply? What are your security policies? That becomes the governance foundation every agent inherits automatically.

Then each department head or team lead goes through their own interview. How does your team work? What decisions require approval? Where are your friction points? That produces department-scoped governance.

Then individual agents get provisioned within that structure. Each one inherits organizational constraints from the operator interview, departmental context from the team interview, and its own role-specific identity from the base interview. Three layers of context, all captured through conversation, all enforced through a governance hierarchy.

And here's the part that matters for the enterprise sale: this is institutional memory as infrastructure. The company's operating knowledge isn't trapped in people's heads or scattered across wikis nobody reads. It's structured, actively governing agent behavior, and — through the Personal Skills loop — continuously evolving as the agents learn the actual nuances of how each team operates.

When someone leaves, their knowledge doesn't walk out the door. It's already in the memory system. Already shaping agent behavior. Already part of the organizational fabric.

And the governance model enforces something most agent systems ignore entirely: inheritance that only flows in one direction. Organizational constraints flow down automatically. A department can add restrictions on top of organizational policy — but it can never remove them. An individual agent can add restrictions on top of department policy — but it can never override what the department or organization has set. The agent can't grant itself permissions the organization hasn't approved.

This maps directly to what I call the three-tier intent hierarchy. Operator intent sits at the top — organizational policies, compliance, security. Agent intent sits in the middle — role-specific scope, workflows, decision rules. Task intent sits at the bottom — specific instructions for a given piece of work. Each tier must operate within the constraints of the tier above it. The interview captures all three layers. The governance system enforces them at runtime.

For an MSP deploying this into a client environment, the pitch writes itself. A healthcare client gets HIPAA-aware governance baked into every agent at onboarding time, not bolted on after the fact. A financial services client gets SOC2 compliance constraints inherited by every worker agent in every department. Nobody has to remember to check the box. The system already knows.

Where This Can Still Break

I'm not going to pretend this is solved. The memory loop works. Personal Skills are real and they're producing measurable improvements in agent behavior. But we're teetering on something, and I want to be honest about the pressure points because I think they matter more than the polished version of the story.

Promotion thresholds. Right now we have a conservative deterministic gate for promoting a skill candidate into a Personal Skill. That's acceptable as a first pass, but it's not sufficient long-term. The next evolution needs real confidence scoring, reinforcement when a skill performs well, decay when it hasn't been used or validated recently, active demotion when evidence turns against it, and contradiction review when a new lesson conflicts with an existing promoted skill. Without those, the promotion layer is binary — in or out — and reality isn't binary.

Skill collision and drift. As the agent accumulates Personal Skills over time, some of them are going to conflict. A procedure learned in January might contradict a correction made in April. Without lineage tracking, the agent can't tell which one should win. A Personal Skill needs to be able to declare relationships: this supersedes that one, this narrows the scope of that one, this contradicts that one, this was merged from these two. Without that, you get an agent accumulating conflicting procedural authority — and that's how you get brittle, unpredictable behavior.

Injection weighting. Once memories, lessons, session summaries, and Personal Skills are all entering the agent's runtime context, you need an explicit precedence model. What wins when a memory says one thing and a Personal Skill says another? What happens when three different knowledge sources all claim relevance to the current task? Without deliberate injection weighting, prompt composition becomes accidental steering. The agent's behavior becomes a function of whichever piece of context happened to land closest to the instruction, not which one should actually govern the decision.

The procedural execution gap. Right now, promoted Personal Skills are still guidance — high-authority context injected into the prompt. They're not executable procedures. That's the difference between procedural memory and procedural runtime. The agent knows how it should handle something, but the handling still flows through the general reasoning path. The next step is execution mode for high-confidence Personal Skills — where the agent can follow a verified procedure directly when preconditions are met, rather than reasoning from scratch every time. But that only works after conflict resolution is solid. Executable procedures without contradiction handling is how you get bad behavior at speed.

Memory growth at scale. As the memory system accumulates months and years of experience, retrieval becomes a governance problem, not just a storage problem. You need compression, archival tiering, stronger recency-vs-importance balancing, and better review loops. An agent with a thousand Personal Skills and ten thousand memories needs to find the right three pieces of context for any given moment — and "right" is doing a lot of work in that sentence.

I'm listing these not because they're unsolvable, but because they're the honest next frontier. The learning loop works. The Personal Skills architecture is sound. But the difference between "works" and "works reliably at scale over time" lives in these pressure points. And that's where the build is right now.

The Divide That's Coming

I want to end on something uncomfortable.

Agents are about to create a visible divide — not between people who have access to AI and people who don't, but between people who can externalize their expertise and people who can't.

The people who invest the time — who sit through the interview, who correct the agent when it's wrong, who let the learning loop do its work — will get compounding returns. Their agents will get better every week. By the tenth agent they provision, the system already knows enough about how they work that setup takes minutes.

The people who skip all of that will install an agent, play with it for a weekend, hit the wall, and conclude agents are hype. They'll be wrong. The agent was fine. The problem was never the agent.

This divide already existed. People who could delegate well got promoted. People who could document their processes preserved institutional knowledge. Agents just make the divide visible. And they create a new, selfish incentive to close it — because for the first time, externalizing what you know doesn't just help the next person who takes your job. It gives you leverage right now. Your corrections become the agent's Personal Skills. Your knowledge becomes earned procedure. The more you teach, the more capable your system becomes.

That's not documentation as a chore. That's documentation as compound interest.

I'm building this into ArgentOS — a self-hosted AI operating system where the agent's memory isn't a filing cabinet, it's a brain. The unified onboarding pipeline and interview harness are next on the roadmap, building on a memory system that already knows how to learn.

This is a Frontier Operations dispatch. I write about what I'm actually building, not what looks good on a slide deck.

Jason Brashear has been building software since 1994 and is a partner at Titanium Computing in Austin, TX. He currently runs multiple AI-powered SaaS products and writes about the operational reality of AI-native development in the Frontier Operations Series.

DEV Community