Connor Hickey

Posted on Jun 2

Before the First Token -- The AI Coding Interview as Preregistration, Not Prompt Theater

#ai #interview #career #agents

The coding interview has always been a compromise with a bad conscience.

It tries to answer a serious question — can this person do the work? — by staging a performance that only partially resembles the work. A candidate is placed in front of an interviewer, handed a small problem, asked to think aloud, and expected to produce code under conditions stripped of nearly everything that makes software engineering real: the existing codebase, the tests, the team's conventions, the build system, the stale documentation, the product boundary, the hidden invariant, the reviewer who will ask why the diff is so large.

The industry knows this. It complains about LeetCode, whiteboards, anxiety, memorized patterns, and the strange theater of narrating thought while being watched. Then it keeps the ritual anyway, because the ritual has one defensible property: it produces a signal. Not always a fair signal. Not always the right signal. Some signal. A person who can reason clearly, write code, and recover under pressure probably has technical ability. A person who cannot may still be good, but the interview has no clean way to see it.

Artificial intelligence does not remove this problem. It sharpens it.

If software development increasingly happens with AI assistance, an interview that bans AI begins to resemble a purity test for a version of the job that is already aging. At the same time, an interview that simply allows AI can become worse than the old ritual. It may confuse the model's competence with the candidate's competence. It may reward tool access, private prompt libraries, hidden assistance, or the luck of a good completion. It may replace LeetCode theater with prompt theater.

The important question, then, is not whether candidates should use AI in coding interviews.

The important question is what the interview should make visible once AI is allowed.

The answer is older than AI.

Great engineers have always had to enter unfamiliar systems and decide what matters. They have always had to find the relevant files, preserve the hidden contract, identify the meaningful test surface, avoid unnecessary blast radius, and explain why a local change is safer than a heroic rewrite. They have always had to know what not to touch. They have always had to turn ambiguity into bounded work.

AI does not create that judgment.

AI creates a chance to preregister it.

That is the stronger version of the future coding interview. Give the candidate a repository, a realistic task, a fallible coding agent, and a bounded environment. Before the candidate asks the model for anything, require a short pre-agent record: what they believe the task is, which files likely matter, which invariants must hold, what the agent may touch, what it must avoid, what tests would count as evidence, and what "done" means.

The value of that record is not that it perfectly captures the candidate's judgment. It will not. The value is that it is written before the evidence arrives.

That is the point.

In science, preregistration does not matter because the first hypothesis is always correct. It matters because the researcher commits before seeing the result. They cannot quietly become the person who predicted whatever happened. The same logic applies here. Before the agent emits a patch, before the tests pass or fail, before the candidate sees what the machine makes easy or hard, the candidate has to write down their theory of the work.

Then the interview has something the old format rarely had: a record of prior commitment.

Did the candidate's map find the right subsystem? Did they protect the correct invariant? Did their definition of done survive contact with the patch? Did they revise their theory honestly when the code contradicted it? Did they hold the line against a model suggestion that violated their own stated constraints? Or did they retrofit a story after the agent produced something plausible?

That is the signal. The artifact alone is not the signal. The defense alone is not the signal. The signal is the full instrument: a pre-commitment, an AI-assisted attempt, the task reality established by tests and review, and a defense of the gap between them.

The old whiteboard measured the wrong fifteen minutes: keystrokes, syntax, and public performance. A better AI-era interview measures the earlier decision: how the candidate scopes the work before implementation begins, and how honestly they revise that scope when reality pushes back.

The future coding interview should begin before the first token.

This Is a Forecast, Not a Victory Lap

AI-assisted development is already common enough that hiring cannot ignore it. Stack Overflow's 2025 Developer Survey reported that 51% of professional developers use AI tools daily. The same survey reported that 66% of developers are frustrated by AI solutions that are "almost right," 45% say debugging AI-generated code is more time-consuming, and 46% actively distrust AI-tool accuracy. That is the world hiring is entering: heavy usage, low trust, and a widening need for human verification.

Companies are beginning to adapt. Canva announced in June 2025 that backend, machine-learning, and frontend engineering candidates are expected to use tools such as Copilot, Cursor, and Claude during technical interviews. Canva's rationale is direct: its engineers use AI in daily work, so interviews that prohibit those tools fail to assess how candidates would perform in the actual role. Canva is not alone. Google has been reported to be piloting a "code comprehension" round in which candidates read, debug, and optimize an existing codebase with an AI assistant, and Meta has been reported to evaluate AI-assisted interviews partly on verification — the same axis this essay treats as central. These are reported moves, not published doctrine, so they are evidence of direction, not proof of method.

That direction does not prove AI-assisted interviews are valid.

It does not even prove that AI tools currently improve senior engineering work in mature repositories. METR's July 2025 randomized controlled trial found that experienced open-source developers working on their own repositories took 19% longer with early-2025 AI tools, and METR framed the result as a snapshot of one relevant setting rather than a universal law.

That finding is not fatal to this argument. It makes the argument more honest.

The case for AI-aware interviews is not that AI is already an unambiguous productivity win in every repository. The case is that software work is moving toward human-machine collaboration, and the human skill that matters most in that collaboration is judgment over the tool.

This essay is therefore making a narrower claim than "AI is the future, so interviews must test AI fluency."

The claim is this:

As AI coding agents become normal development tools, interviews should use them to expose durable engineering judgment — especially scoping, context selection, verification, revision, and review — rather than treating generated code as proof of candidate ability.

That claim is partly present-tense and partly forecast. The present-tense part is that many developers already use AI daily and some companies already design interviews around it. The forecast is that the most durable hiring signal will not be prompt fluency, turn-budget discipline, or token thrift. It will be the candidate's ability to govern automation.

Current agent mechanics will rot. Prompt syntax will change. Context windows will grow. Agent interfaces will become more autonomous. Turn counts, file limits, approval modes, and "best prompting practices" may age quickly.

The durable skill is older than AI: understanding a system well enough to change it safely.

The interview should bet on that.

The Pre-Agent Record

A pre-agent work sample begins with a simple rule: the candidate may inspect the repository manually, but may not ask the AI assistant to act until they have produced a short setup record.

That record might include a task summary, a context map, an AGENTS.md or equivalent agent-instruction file, a verification plan, a risk note, and a definition of done.

The format matters less than the timing. The candidate has to commit before the model produces code.

A weak pre-agent record says: Use clean code. Follow best practices. Add tests. Be careful.

That is not judgment. That is engineering perfume.

A stronger record says:

The likely surface is validator.ts, schema.ts, and validator.test.ts. The router is out of scope unless the failing behavior cannot be reproduced at the validation layer. Preserve the public error-response shape. Reuse formatValidationError rather than adding a new formatter. Add a regression test for missing nested array values. Do not add dependencies. The change is done when the new regression test and the existing validator suite pass.

This is not merely an instruction to the agent. It is a claim about the system. The candidate is making falsifiable commitments: I think this is where the problem lives. I think this is the boundary. I think this is the invariant. I think this test would prove the change. I think these subsystems should remain untouched.

The agent's output then becomes one source of evidence. It is not ground truth. That distinction is crucial. If the candidate says the router is out of scope and the agent edits the router, the agent's behavior does not automatically prove the candidate was wrong. The model may have overreached. Resisting that overreach may be exactly the governance skill the interview is trying to measure.

The ground truth is not what the agent did. The ground truth is what the task actually required, established through tests, review, code ownership, and the stated product contract.

So the useful comparison is three-way:

Object	What it reveals
Pre-agent record	What the candidate believed before automation acted.
Agent behavior	What the model attempted, including useful discoveries and overreach.
Task reality	What tests, review, code ownership, and requirements show was actually necessary.

That three-way gap is where the signal lives.

If the candidate excluded the router and the task truly did not require it, rejecting the agent's router edit is a governance win. If the candidate excluded the router and the bug actually lived there, that is a scope miss. If the candidate accepted an agent edit that violated their own registered constraint, that is a discipline failure. If the candidate revised their scope after a failing test exposed a real dependency, that is not hypocrisy. It is a good update.

This is why sequencing matters.

Real engineering is interleaved. Engineers hypothesize, inspect, revise, run tests, discover new facts, and revise again. A forced pre-agent record is artificial. The artificiality earns its place because it creates a sealed envelope. Without the envelope, the candidate can become the person who always knew whatever the agent eventually discovered.

The pre-agent phase is not a simulation of the entire job. It is a measurement device. The candidate still gets to revise their theory. In fact, revision should be scored. But the interviewer now has a before-and-after trace: what the candidate believed before automation acted, what the code revealed, and how the candidate handled the gap.

That is a cleaner signal than watching someone type.

The Silent Expert and the Articulate Fraud

The preregistration frame solves one problem and exposes another.

It helps catch the candidate who rationalizes backward. Once the pre-agent record exists, the candidate cannot pretend they always knew the subsystem mattered, always intended to preserve that invariant, or always meant to run that test. The envelope was sealed before the agent acted.

But preregistration does not solve the silent expert problem.

Some strong engineers carry tacit judgment. They scope correctly by feel. They recognize danger before they can neatly explain it. Their knowledge is compiled from years of code review, migrations, outages, and the slow accumulation of scars. Ask them to write a polished context map under time pressure, and the artifact may look thinner than one produced by an articulate mid-level engineer who has read every essay about invariants and blast radius.

That is a real weakness. A legibility device can become a literacy test if it rewards only the people who are good at producing legibility artifacts.

The inverse failure mode is the articulate fraud.

This candidate writes a beautiful AGENTS.md. They name plausible invariants. They produce a neat context map. They keep the diff small. They add a regression test. They write a fluent PR summary. They sound like the kind of person who has read every essay about engineering judgment.

And they are wrong. They preserved the wrong invariant. They tested the obvious case but missed the contract. They excluded the subsystem that actually owned the behavior. They used the right language for the wrong system. A surface-level rubric will pass them.

So the interview cannot stop at the written artifact. It needs a defense, and the defense has to be specific enough to separate substance from performance without recreating the stress ritual of the old whiteboard. The interviewer should ask:

You said this response shape is public. Where is that established?
You excluded the router. What evidence would make you bring it back into scope?
You added this regression test. What bug would still pass?
You reused this helper. What assumptions does it encode?
You said the diff is minimal. Minimal relative to what?

For the articulate fraud, these questions expose the gap between the form of judgment and the substance of judgment. For the silent expert, these same questions give compressed judgment a chance to unpack itself.

This reintroduces a tension. Live questioning can distort performance. A North Carolina State University and Microsoft study found that performance in traditional technical interviews was reduced by more than half when candidates were watched by an interviewer, and that stress and cognitive load were higher in the public whiteboard condition.

The pre-agent interview should learn from that rather than reproduce it. The defense should be a review conversation around artifacts: the preregistration, the patch, the tests, and the deltas between them. The interviewer should not hover over every prompt. The candidate should have private work time. The pressure should move closer to the work and farther from performance theater.

The written record prevents post-hoc rationalization. The defense prevents polished artifacts from becoming cosplay. The two pieces need each other.

Compression, Not Context Dumping

AGENTS.md is useful here because it gives the pre-agent record a concrete form. The official AGENTS.md project describes it as "a README for agents": a predictable place to provide context and instructions that help AI coding agents work on a project, including setup commands, test commands, style guidance, and repository conventions. OpenAI's Codex documentation says Codex reads AGENTS.md files before work and stops adding project-instruction files once their combined size reaches project_doc_max_bytes, which is 32 KiB by default.

That budget is not just a technical footnote. It reveals the shape of the problem. Instructions are not free. They consume space and attention. They can clarify the work, or they can poison it.

A strong AGENTS.md is not a shrine to best practices. It is not a place to tell the model to be "world-class," "elegant," "production-ready," or "careful." Those words sound like engineering, but they rarely constrain anything.

A strong AGENTS.md is a compression test. It asks: what is the smallest set of instructions that materially reduces the agent's odds of making a bad change in this repository?

# AGENTS.md

## Objective
Make the smallest correct change for the assigned issue.

## Local commands
- Targeted tests: `pnpm test validator`
- Full tests: `pnpm test`
- Lint: `pnpm lint`

## Constraints
- Preserve the public error-response shape.
- Do not add dependencies.
- Reuse existing validation helpers before adding new utilities.
- Do not reformat unrelated files.
- Do not modify generated files.
- Do not broaden the task into router, database, or UI changes unless the
  validation layer cannot reproduce the failure.

## Completion standard
A change is complete only when:
1. the failing behavior is covered by a regression test;
2. the targeted validator test passes;
3. the candidate can explain the root cause, the patch, and the remaining risk.

This file is not valuable because it is long. It is valuable because it is local.

A 2026 paper evaluating repository-level context files makes the point sharper. Its abstract reports that context files tended to reduce task success rates compared with no repository context while increasing inference cost by over 20%. The body of the paper is more nuanced: developer-provided files marginally improved performance by 4% on average compared with omitting them, while LLM-generated context files had a small negative effect of 3% on average. The authors conclude that unnecessary requirements can make tasks harder and that human-written context files should describe only minimal requirements.

That is not a reason to abandon AGENTS.md. It is a reason to stop treating it as a talisman.

The study's useful lesson is not "context files are bad." It is that context files are only helpful when they are minimal, human-written, and operational — exactly the skill this interview should test. The candidate should not be rewarded for writing a context file. They should be rewarded for knowing what belongs in one, what does not, and why.

A bloated context file is not maturity. It is another way to lose control of the system.

Context Limits Are Diagnostic, Not Sacred

The first version of this idea is easy to overstate: test whether candidates can build with low tokens, low context, and limited AI turns.

That framing is too brittle.

Token thrift is not engineering excellence. A candidate who uses more context and produces a safer, clearer, better-tested patch is better than a candidate who performs austerity and ships fragile code. Inference will get cheaper. Context windows will grow. Agent interfaces will change. An interview that treats raw token count as a primary score will age into nonsense.

The better claim is narrower: context limits are useful because they force selection into the open. They are diagnostic, not sacred.

Without constraint, a weak candidate can dump the entire repository into the model and accept the first patch that passes visible tests. Abundance can hide weak judgment. A constraint forces the candidate to choose. Which files matter? Which tests matter? Which interface is public? Which layer owns the behavior? What should the agent ignore? What is the minimum evidence that the patch is correct?

But the constraint reveals only the candidate's model under that constraint. It does not automatically prove how they would work in production. A five-file limit can become a test of artificial scarcity if treated as a literal simulation of the job.

So the interview should not claim that real engineering is always token-bound. It should claim that real engineering requires relevance judgment, and bounded context is one way to make that judgment observable.

Research on long-context behavior supports the modest claim, not the inflated one. "Lost in the Middle" found that language-model performance can degrade depending on where relevant information appears in long contexts, with performance often highest when relevant information appears near the beginning or end rather than buried in the middle.

That does not prove a token-limited interview is valid. It only shows that context has shape, noise, and order. The hiring instrument still needs validation.

The context limit is not the score. It is scaffolding. The score is whether the candidate selected relevant context, revised that selection when evidence changed, verified the patch, and defended the tradeoffs.

Coachability Is Not Always a Failure

Every hiring test eventually becomes a game.

That was the fate of algorithm interviews. They began as a proxy for reasoning and became a curriculum of pattern recognition. Dynamic programming, graph traversal, sliding windows, heaps, tries, backtracking, binary search on answer — all trainable, all rehearsable, all capable of becoming ritual.

A pre-agent work sample can become ritual too. Candidates will memorize AGENTS.md templates. Interview coaches will teach context maps. Prep courses will drill candidates on naming invariants, writing risk notes, and saying "blast radius" with the right kind of restraint. Companies will turn a useful idea into another gate.

This objection is real. It also has a better answer than "use fresh repos."

Fresh repositories prevent candidates from memorizing this answer. They do not prevent candidates from training the orientation routine across fifty unfamiliar repos. The coachable object is not merely the answer. It is the meta-procedure: orient quickly, identify the test surface, name the invariant, constrain the diff, write the guardrail, defend the revision.

That sounds like a threat until you look at it closely. If preparation teaches candidates to read unfamiliar code, identify meaningful tests, preserve invariants, constrain changes, and defend tradeoffs, then preparation is teaching the job. That is different from memorizing an answer key. It does not eliminate Goodhart's law, but it changes what optimization produces.

Two honest caveats keep this from becoming wishful. First, the claim rests on a transfer assumption that algorithm interviews failed: that the coached version of "orient fast, name the invariant, defend the revision" stays attached to real competence rather than drifting into its own rehearsed performance, the way "recognize the dynamic-programming pattern" drifted from designing algorithms. That attachment is a bet, not a proof. Second, the defense is itself coachable — a prep course can drill convincing answers to "which API is public here." The format's protection is that grounded answers are harder to fake against a fresh repository than ungrounded ones. But "harder" is not "impossible."

The danger is not that candidates learn the form. The danger is that they learn only the form. That is why the defense matters. A candidate can memorize "preserve public API shape." They still have to answer: which API is public here? They can memorize "add regression coverage." They still have to answer: what does this test prove, and what does it not prove? They can memorize "keep the diff local." They still have to answer: local relative to which ownership boundary?

The format remains a scaffold only when it rewards grounded judgment. It becomes a cage when it rewards the appearance of judgment.

What Should Actually Be Scored

The final code matters, but it cannot carry the whole assessment. A passing patch can be lucky. A generated patch can be correct for reasons the candidate does not understand. A small diff can preserve the wrong thing. A fluent explanation can hide shallow comprehension.

The score should focus on durable engineering evidence:

Category	Score this	Do not overvalue this
Problem framing	Did the candidate identify the real task, constraints, and non-goals?	Confidence or speed
Preregistration quality	A committed initial theory whose confidence is matched to what manual inspection could actually establish	A clean, certain-sounding record that overstates what inspection could reveal
Context relevance	Did they choose the right files, tests, contracts, and invariants?	Raw token count or file count
Revision quality	Did they update their theory honestly when task evidence changed?	Sticking to the first plan for ego reasons
Verification	Did they define meaningful evidence before trusting the patch?	Running tests only at the end
Architectural restraint	Did they keep the change local and preserve boundaries?	Large, impressive rewrites
AI governance	Did they treat the model as fallible?	Prompt elegance
Defense	Could they explain why their record, patch, and tests fit this codebase?	Fluent PR theater

One row deserves a warning, because the preregistration frame creates a temptation it does not advertise. A sealed envelope rewards commitment, and commitment is easy to confuse with certainty. It is not the same thing. The candidate who writes "leading hypothesis is the validator, but the error shape could surface at the router, so I would confirm before ruling it out" is showing better judgment than the one who writes "the bug is in validator.ts, router out of scope" with false confidence — even though the second record is cleaner and more falsifiable. Good engineering in an unfamiliar repository includes knowing what manual inspection cannot yet settle. A record that flags genuine uncertainty where the code cannot resolve it is calibrated, not weak, and the rubric must not penalize it for being less tidy. The thing being scored is whether the candidate's confidence was appropriate to the evidence available — not whether the record reads as sure.

This fits adjacent evidence from selection research, but that evidence should not be overstated. The U.S. Office of Personnel Management describes work-sample tests as tasks or activities that mirror the tasks employees perform on the job, and it says structured interviews with higher degrees of structure show higher validity, rater reliability, rater agreement, and less adverse impact.

That motivates the format. It does not validate it. A pre-agent AI work sample would need its own validation: correlation with later job performance, interviewer agreement, candidate experience, adverse impact, false-positive rate, false-negative rate, and cost.

The most novel signal in the format — revision quality — is also the softest to score. Two interviewers may disagree about whether a candidate made an honest update, rationalized backward, or simply recovered from a bad first map. Honest revision, scrambling, and face-saving can look similar in a live room. That weakness does not kill the format. It means revision quality cannot be scored by vibe. It needs anchors.

A strong revision looks like this:

the candidate expands scope because a failing test or code path proves the original map incomplete;
the candidate resists model overreach and explains why the original boundary still holds;
the candidate updates the AGENTS.md or task plan explicitly rather than silently changing the story;
the candidate distinguishes "the agent found a file" from "the task required that file."

A weak revision looks like this:

the candidate ignores evidence because it threatens the original plan;
the candidate silently changes the story after the model output;
the candidate cannot explain why the plan changed;
the candidate says "the AI found it" without understanding the dependency;
the candidate accepts a patch that violates their own registered constraint.

The scoring instrument should make those anchors explicit. Multiple reviewers would help. A written delta between the preregistration, the patch, the tests, and the defense would help more.

This is a proposal grounded in adjacent evidence, not a proven instrument.

That sentence has to stay. Without it, the essay becomes the thing it criticizes: a confident performance standing in for evidence.

A Concrete Version

A usable interview might look like this:

You are given a small TypeScript service with a failing edge case in request validation. You may inspect the repository manually. Before using the provided AI assistant, write a short pre-agent record: task summary, context map, agent instructions, verification plan, risk note, and definition of done. After submitting that record, you may use the assistant inside the provided environment. Your final submission should include the pre-agent record, the patch, tests run, and a defense of how your theory changed or held.

Notice what is missing from the scored prompt: no sacred five-file limit, no sacred ten-turn budget, no raw token target. The environment may still impose practical limits. Those limits are scaffolding. They are not the skill.

The candidate should be judged on questions like: Did their initial context map point toward the files that mattered? Did they notice when their first theory was wrong? Did they preserve the correct boundary? Did they define a meaningful test? Did they keep the change smaller than the agent wanted? Did they catch model output that violated the repository's conventions? Could they explain the remaining risk?

A strong candidate may use more context than expected because the first map was incomplete. That should not count against them if the expansion was justified. A weak candidate may use little context because they prematurely narrowed the task and missed the real owner of the behavior.

The score is not austerity. The score is judgment under commitment.

Fidelity, Integrity, and Cost

The practical objections are serious.

A pre-agent AI work sample is more expensive than a standard coding screen. It requires a curated repository, a realistic task, a standardized environment, logging, a rubric, and trained reviewers. It is not cheap at scale.

There is also a cheating problem. A remote interview can be compromised by a second machine, an off-screen model, private prompt libraries, or outside help. A locked-down environment improves integrity but reduces realism. Let candidates use their real workflow and the interview becomes more faithful to the job, but less comparable. Force every candidate into the same sandbox and comparability improves, but the workflow becomes artificial. This is not a detail. It is the design problem.

The practical answer is not to use this format as a mass filter. Use cheaper screens earlier. Use the pre-agent work sample later, when the candidate is far enough along that the stronger signal is worth the cost. Keep the task bounded. Publish expectations. Provide the same environment. Do not score model-specific tricks. Pay candidates if the work becomes long enough to resemble real labor.

There is a quieter objection, and it is the one a skeptical hiring manager will actually raise. The comparison class for this format is not only the AI ban and the AI free-for-all. The real incumbent is the structured take-home followed by a code-review debrief: give the candidate a repository and a task, let them work however they like, then sit down and ask them to defend their choices. That format already captures much of what this essay prizes — scoping, restraint, verification, and the ability to explain a diff. It is cheaper. It is partly validated. And it asks for judgment in something close to the candidate's real workflow.

What the take-home debrief lacks is the sealed envelope. Because the debrief happens after the work, the candidate can narrate a clean story backward, explaining the choices they appear to have made rather than the ones they actually committed to. The pre-agent record is the one thing the debrief cannot reproduce: a timestamped commitment made before the evidence arrived. So the entire incremental case for this format reduces to a single empirical question. How much is that pre-commitment worth, net of the cost of forcing an artificial sequence onto work that is naturally interleaved? This essay argues the envelope is worth a great deal, because backward rationalization is exactly the failure the debrief cannot detect. But that is an argument, not a measurement, and it is the first thing a pilot should test: run the pre-agent work sample against a take-home debrief on the same candidates, and see whether the commitment adds predictive signal the debrief misses. If it does not, the cheaper instrument wins.

The format also has a structural blind spot worth naming. It mandates AI use after the pre-agent record, which means it cannot assess one real judgment its own evidence implies matters: deciding that a task is faster or safer done by hand. METR's result — experienced developers slowed down by AI in mature repositories — is partly a finding about engineers who should have declined the tool and did not. An interview that requires the tool forecloses that call. A mature version might let candidates justify not invoking the agent for part of the task, and score that too.

Fairness needs a harder treatment than "standardize the tools." AI-workflow familiarity is unevenly distributed. Some candidates have daily access to frontier tools. Others do not. Some have worked in companies that encourage agentic coding. Others have been forbidden to use it. Some are fluent in the current English-heavy style of machine instruction. Others may have the same engineering judgment and less practice encoding it into agent-facing documents.

Canva's own report cuts both ways here. It says candidates with limited AI experience struggled not because they could not code, but because they lacked the judgment to guide AI effectively and identify when suggestions were suboptimal. That supports this essay's thesis directly. It also sharpens the fairness problem: low AI experience may track unequal access, unequal workplace norms, and unequal opportunity to practice.

So the fairer version of the interview scores durable behaviors: context selection, verification, architectural restraint, revision, and defense. For junior candidates, it should not over-index on sophisticated agent orchestration. For senior candidates, it can demand more explicit scoping and review judgment. For staff-level candidates, the relevant task may be designing the human-AI workflow itself. The format has to be calibrated by level.

Conclusion: The Scaffold and the Cage

A bad hiring process becomes a cage. It traps candidates inside a ritual that serves the institution more than the work. LeetCode became that for many engineers: a narrow game standing in for a broad craft. It rewarded speed, rehearsal, and pattern recognition while missing slower forms of engineering judgment.

An AI interview can become the same kind of cage. It can reward prompt theater, tool privilege, memorized templates, synthetic confidence, and the polished performance of judgment without judgment underneath. Prompt theater asks whether a candidate can make the model say useful things. The better question is whether they can make the work safe enough for a model to touch.

The better version is a scaffold. A scaffold does not do the work. It makes the work possible. It gives shape, support, access, and constraint. The pre-agent record is that scaffold. It asks the candidate to commit before automation begins: the relevant context, the non-goals, the tests, the standards, the invariants, the risks, the stopping condition.

That commitment will sometimes be wrong. Good. The point is not to worship the first map. The point is to see how the candidate handles the distance between the map and the territory.

The future coding interview should not pretend AI does not exist. It should also not surrender the assessment to the machine. It should use the machine to reveal the human decision that matters most: how the candidate turns ambiguity into bounded, reviewable, verifiable work.

Before the first token, the candidate has not written the answer.

They have sealed the envelope.

Then the interview begins.

Top comments (2)

EGN Labs • Jun 2

Hey,

I really appreciated this article. The whole debate around AI in coding interviews usually swings between "ban it completely" and "let them use whatever," which honestly just turns the interview into a prompt engineering contest. Your "pre-agent record" idea actually focuses on what matters: engineering judgment.

In my day-to-day work maintaining CLI packages and managing deployment pipelines, generating the code is usually the easy part. The real headache is keeping the AI from going off the rails. When an agent tries to fix a localized bug, it loves to accidentally rewrite half the infrastructure or break some undocumented contract. Forcing candidates to write down invariants, scope, and success criteria before the agent touches anything is a great way to filter out people who just rely on lucky autocompletions.

One small practical thought on your "Revision quality" section:
In a real-world setup, good AI collaboration often means giving the agent some controlled autonomy. For instance, if my agent hits a roadblock—like a ModuleNotFoundError or a broken path—I expect it to act like an autonomous engineer. It should try a couple of fixes on its own (like running pip install or checking paths via git rev-parse) before stopping and asking me for help.

Maybe the interview rubric could also evaluate how a candidate defines this autonomy boundary. Do they know when to let the agent attempt a few self-corrections in a sandboxed environment, versus when to pull the plug the second it deviates from the AGENTS.md file? I feel like that adds another layer to assessing how well someone actually governs automation.

Anyway, great read. It really resonates with the reality of balancing human control and machine autonomy.

P.S. Just a quick reminder to push your latest changes to the repo!

Connor Hickey • Jun 3

Thanks — The failure mode you're describing is basically why I ended up building tooling around agent governance. Give a coding agent a sliver of permission and it's off rewriting infra; it's consistent enough to feel like a law.

Your autonomy-boundary point is the right extension of the rubric. The version I'd push: have the candidate register the boundary in advance, in the pre-agent record, rather than improvising when the agent stalls — these deviations are self-correctable in the sandbox (a ModuleNotFoundError, a broken path, a missing install), and these touch a registered constraint and halt it cold (the public contract, the dependency set, the out-of-scope files).

What makes it a real test is that those two rules collide. A self-correction is a deviation — a pip install or a git rev-parse path-check is the agent acting without asking.

So the signal isn't "did they let it run" or "did they pull the plug," it's whether they drew the line in the right place before the roadblock hit, and whether it held. Over-trusting (letting it grab a dependency it shouldn't) and over-controlling (killing a harmless retry) are opposite failures, and a pre-registered boundary makes both visible.

Really appreciate the read.