Mykola Kondratiuk

Posted on Jun 12 • Edited on Jun 17

I Lead AI Agents Every Day - Here Are 5 Shifts No Standard Tells You How to Make

#projectmanagement #leadership #ai #career

Boundary files for agent autonomy

A Google DeepMind safety lead said this week that they're putting $10M behind multi-agent safety because "there just isn't really a field of research for multi-agent safety yet."
Disclosure: This article was written with AI assistance. I use AI tools as part of my workflow for building and writing about AI-native PM practices.

I read that and laughed, because I'm already running the thing the research field doesn't exist for yet. Most of us are. You spin up a couple of agents, hand them work, and somewhere in there you quietly become a manager of workers that don't think like workers.

Two days before that, PMI published the first official standard for AI in project work. It's a solid document. It also leaves the entire "how do you actually do this on a Tuesday" layer to you. So here's my Tuesday layer: five shifts I had to make, each one learned by getting it wrong first.

You stop filling the queue and start drawing the line

My first instinct with an agent was the same as with a person: here's work, go.

That broke the first time an agent made a reasonable decision on something that turned out to be irreversible. It wasn't the agent's fault. I never told it which decisions were one-way doors.

So now the first artifact I write isn't a task list. It's a boundary file. Something like this lives next to the work:

# decision-boundaries.yml
autonomous:
  - reformat, refactor, rename within a module
  - anything reversible with a git revert
escalate:
  - schema changes, public API shape
  - deletes, migrations, anything touching prod data
  - spend over $0 or any external send
on_unsure: stop_and_ask

That file does more for me than any standup. Leadership moved from assigning the work to defining what may be decided without me.

You read work you never watched happen

I used to review work I'd seen get built. I knew the steps, so "looks right" was usually safe.

Then I started getting finished diffs with no memory of how they came to be. "Looks right" stopped being safe. The code was clean and the reasoning under it was wrong in a way you only catch if you go digging.

The skill now is judging a result cold, with zero context on the path. Ethan Mollick wrote this week about a model holding twelve hours of focus on one spec. When the attention window outlasts mine, my job isn't checking steps. It's scoping the spec so tightly the steps don't need a babysitter.

You plan capability, not headcount

"How many engineers do I need" is a question I catch myself asking and kill.

The real one: what mix of people and agents produces this outcome, and what's the human-only core I'd never hand off? The plan turned into a capability map with a deliberately protected center.

Gergely Orosz's June job-market analysis lands in the same place from the data side: the roles that compound are where judgment about AI systems is the scarce input, not execution on a known stack. Capability planning is that judgment pointed at your own team.

You design the alarm before the fire

Standup tells you something broke. Which means it tells you late.

Workers that fail unpredictably need the alarm built up front. I keep a short tripwire list, each one a single sentence: if this observable crosses this line, halt and ping me, and here's who owns the ping.

# tripwires.yml
- watch: test_pass_rate
  trip: "< 100% on touched files"
  action: halt + page me
- watch: files_changed
  trip: "> 20 in one task"
  action: pause for scope review

It feels too simple to matter. It has saved more bad mornings than any dashboard I've built.

You own the system, not the deliverable

This is the one that's actually a promotion.

Ownership used to mean the outcome is mine. It still is. The level changed. I don't own the deliverable directly anymore. I own the system that makes it: people, agents, and the rules between them. That's the only level that scales.

Boris Cherny, who runs Claude Code, said this week he hasn't written a line of code himself in eight months. People hear a flex. I hear the shift in one sentence: stopped producing the work, started owning the system that produces it. Bigger job, not a smaller one.

Where are you on these

I'm not clean on all five. Solid on three, shaky on two, and the shaky ones cost me the most.

Rate yourself one to five on each, fast. The two you score lowest are the two behaviors that move you this quarter. Which one did you make first, and which are you still avoiding?

Tags: #projectmanagement #ai #career

Top comments (208)

Sloan the DEV Moderator • Jun 16

Hey, this article appears to have been generated with the assistance of ChatGPT or possibly some other AI tool.

We allow our community members to use AI assistance when writing articles as long as they abide by our guidelines. Please review the guidelines and edit your post to add a disclaimer.

Failure to follow these guidelines could result in DEV admin lowering the score of your post, making it less visible to the rest of the community. Or, if upon review we find this post to be particularly harmful, we may decide to unpublish it completely.

We hope you understand and take care to follow our guidelines going forward!

Mykola Kondratiuk • Jun 12

honestly the boundary file falls apart the second an agent hits a decision that's reversible in code but not in trust - like it emails a stakeholder something technically fine but politically wrong. git revert doesn't fix that, and i don't have a clean rule for it yet.

FastAnchor_io • Jun 12

The boundary file idea is underrated. I've found adding a third category helps: "inform" — decisions the agent can handle autonomously but logs with reasoning so I can audit later. Keeps autonomy high without the trust gap.

Mykola Kondratiuk • Jun 12

the 'inform' category is exactly right in theory — where it breaks for me is audit discipline. three weeks in i stopped opening the logs daily, so it became 'inform' in name only. the category only holds if you have a trigger that forces review, not just access.

FastAnchor_io • Jun 13

the audit discipline point is sharp. a trigger-based approach — like a scheduled CI job that diffs agent logs against expected patterns — would make 'inform' actionable rather than aspirational. without that enforcement layer, it degrades into a label with no teeth, which is worse than not having it at all because it creates false confidence.

Mykola Kondratiuk • Jun 13

the CI job approach is sharper than daily reviews — scheduled beats aspirational. the hard part is defining 'expected patterns' for contextual agent decisions. what's worked: alert on rate (decisions-per-day above baseline) and novelty (action types absent from last week) rather than matching specific decision content. rate + novelty catches drift without needing to model what 'correct' looks like in advance.

FastAnchor_io • Jun 14

Rate + novelty is exactly right. I would add one more signal: decision churn — when an agent keeps flipping between two action types on the same input. High churn on stable inputs usually means the context window is confusing the agent, not that the problem changed. Caught a few silent drift cases that way.

Mykola Kondratiuk • Jun 14

decision churn is a sharp addition — rate and novelty both miss oscillation: count stays stable, action types stay stable, but the agent is stuck cycling. and the context window hypothesis fits: churn should spike after a prompt change or model version bump, which makes it a useful version-change detector on top of a drift signal.

FastAnchor_io • Jun 14

oscillation detection is the right lens — rate and novelty both track "different from before," but churn catches "stuck in a loop," which is a qualitatively different failure mode. The context window trigger hypothesis is sharp: I've seen exactly this when model versions silently change under a stable SKU — the output distribution shifts just enough that the agent starts second-guessing itself on every turn, and no single decision looks wrong, but the aggregate flips between options endlessly.

That layers into the audit problem from earlier. If churn spikes on version bumps, it's effectively a free deployment gate — you don't need to model what "correct" looks like, you just need to flag "this version made the agent 5x more indecisive than yesterday." The CI job you described for rate+novelty picks this up without knowing anything about the agent's task domain.

One thing I'd split out: do you track churn per-decision-category or as a single aggregate? Running a multi-model API gateway, I've found that blast-radius matters — a routing model cycling is catastrophic (it affects everything downstream), while a generation model cycling is a quality dip you can live with for an hour. Wondering if the same category-sensitive thresholding applies to agent decision types, or whether oscillation is uniformly bad regardless of which decision the agent is cycling on.

Mykola Kondratiuk • Jun 14

the stable SKU part is the one that catches teams off guard - you get the version pin but not the behavioral pin. silent output distribution shift is exactly how churn looks like a product bug until you diff the model logs.

FastAnchor_io • Jun 14

That "behavioral pin" gap is the hardest one to explain to stakeholders — and the hardest to notice yourself. Version pinning gives you a green dashboard. The model ID resolves. The latency looks normal. The error rate is flat. Every signal says "stable." And you're silently shipping degraded outputs because the provider swapped the weights underneath the same SKU.

I've seen this hit hardest on classification/routing models, not generation. A drifted chat model produces one odd reply — you notice. A drifted judge model misfiles every request — but each individual decision still looks "plausible" in isolation. You don't catch it until someone audits a week of output and realizes 30% of support tickets went to the wrong queue.

What's worked for me as a cheap behavioral fingerprint: run a fixed set of 10-15 prompts through the model on a cron, embed the responses, and compare cosine similarity against a baseline. No ground truth needed — you're not measuring "correct," you're measuring "changed." The threshold doesn't need to be precise because you're flagging distribution shift, not evaluating quality. When the similarity score drops below 0.85 across the board, something moved — and it's time to diff the logs, not the code.

The version pin is necessary. The behavioral pin is what keeps the version pin from being a false promise.

Mykola Kondratiuk • Jun 14

yeah exactly - every metric is green while the actual output has been drifting. the only thing that caught it for me is a canary task that saves the full raw response, not just pass/fail. first time you diff week 1 vs week 4 outputs you see how much has shifted without a single error ever firing.

FastAnchor_io • Jun 14

The raw response diff is the right primitive — pass/fail is a lossy compression of the one signal that actually matters. I've seen the same pattern: a classifier that stayed at 94% accuracy for six months while the distribution of errors had shifted entirely from false-positives to false-negatives. Same number, opposite failure mode. The aggregate hid it; a raw diff on a held-out sample caught it in one look.

What makes the canary approach scalable is the "what to diff" question. Full raw output is gold when you're debugging, but it's also a firehose. The trick I've landed on is diffing the decision surface, not the output text — embed the response into a semantic fingerprint (cosine on a fixed reference set), track that vector over time, and trigger when the drift crosses a threshold. It's still "save the raw response" under the hood, but the diff is on a lower-dimensional signal that you can actually plot, alert on, and explain to someone who doesn't want to read two JSON blobs side by side.

The canary-as-detection vs canary-as-gate distinction you made earlier is what makes this work in practice. Detection can afford to be noisy and conservative — it's emailing you, not blocking a deploy. The raw diff is detection. The semantic fingerprint is where it graduates to a gate, because now you have a metric you can put a threshold and a confidence interval around. Different error budgets for different stages.

Curious how you're handling the "diff" part today — are you doing a literal text diff, or have you moved to something like embedding similarity on the held-out responses? The raw text diff catches everything but it's noisy; embedding similarity is cleaner but can miss structural changes that matter. There's probably a middle ground where you diff both layers and cross-reference the disagreements.

Mykola Kondratiuk • Jun 14

the flip from false-positive to false-negative at 94% is exactly the failure mode that makes me not trust aggregate accuracy anymore. i've started tagging held-out samples by failure class — so when distribution shifts, you see which class is moving, not just whether the number held.

FastAnchor_io • Jun 14

The failure-class tagging is the right complement to the raw diff — one tells you that behavior shifted, the other tells you how. Without the class labels, you're stuck staring at a diff with no triage path.

The 94% threshold flip you're seeing is a classic precision-recall tradeoff under distribution shift. The model isn't getting worse — the population it's tested on is becoming different. I've found that tracking class-level precision separately from aggregate avoids this trap: if class=boundary_case precision drops while everything else holds, you know it's a population shift, not a model regression. Aggregate masks that completely.

One question on the tagging: are you classifying samples once at collection time and treating labels as static, or do you have a mechanism to re-label when the class taxonomy itself evolves? I've seen failure taxonomies drift just as silently as the models they're monitoring, and a stale taxonomy gives you the same false confidence as a green accuracy number.

The next level I've been experimenting with is using the class proportions as a health signal directly — not just "precision per class dropped" but "class X now makes up 40% of the held-out set vs 12% last week." That catches the shift before any metric crosses a threshold.

Mykola Kondratiuk • Jun 14

failure-class tagging helps but assumes a stable failure vocabulary. the first time your model starts doing something genuinely new, there's no class to catch it — raw diff surfaces it, your schema doesn't. the pair only works if you keep the taxonomy open and treat 'unclassified' as its own signal.

FastAnchor_io • Jun 14

The taxonomy-needs-to-stay-open point is the operational version of what I meant — the schema isn't a fixed map, it's a living registry. Treating 'unclassified' as its own signal rather than noise is the right instinct, because the first time something genuinely new appears, that unclassified spike is the only alert you'll get before it silently becomes the new normal.

One thing that bites in practice: a naive implementation appends new classes as they appear, and after six months you've got 40 classes where 15 haven't fired in three months. The taxonomy itself drifts. I've seen teams solve this with a decay window — classes unused for N weeks drop to 'inactive' and the canary re-validates them before deletion. Without that, the unclassified bucket shrinks while stale classes pile up, trading one blind spot for another.

The other sharp edge is class granularity. "hallucination" as one bucket misses structure: wrong-number vs wrong-entity vs fabricated-API are different failure modes with different root causes. But splitting too fine creates the problem you describe — the model does something genuinely new and falls through every crack. The sweet spot I've landed on: ~8–12 classes with a mandatory 'other/novel' catch-all reviewed weekly, not just logged.

What's your decay approach — do you prune old classes or keep them forever? Taxonomy maintenance is the unglamorous half of this, and most people skip it until the schema itself becomes the bottleneck.

Mykola Kondratiuk • Jun 15

the 40-class bloat is the other failure mode - taxonomy sprawl is just as blinding as no taxonomy. we started pruning quarterly: anything below a frequency threshold gets merged into other and the unclassified bucket resets.

FastAnchor_io • Jun 15

The quarterly pruning with a frequency threshold is the right operating cadence — and it mirrors what I've seen on the model lifecycle side. Every model version bump introduces a new failure class that wasn't in your taxonomy last month. If you're not pruning, you're accumulating classes that only ever fired on one deprecated model and never again.

What's interesting is that the threshold itself becomes a tuning parameter that encodes your risk tolerance. Too high and you lose rare-but-critical failure modes (the ones that only fire once per quarter but take down a production pipeline when they do). Too low and you're back to sprawl. We found the sweet spot around 3-5 occurrences per quarter for decision models and closer to 10 for generation models — but it's entirely workload-dependent.

The unclassified bucket reset is the part I'd underline. In the gateway context, we treat unclassified failures as a separate signal lane: if the unclassified rate spikes after a model version bump, it tells you the new model introduced behavior your existing taxonomy can't describe. That signal often fires hours before any accuracy metric budges.

One thing I'm curious about — when you reset the unclassified bucket, do you keep a shadow copy of the merged classes somewhere, or is the pruning truly destructive? We've had cases where a failure class disappeared for two quarters and then came back with a vengeance after an upstream model change.

Mykola Kondratiuk • Jun 15

the model-bump / new class coupling is the one I least expected. had a taxonomy that felt stable, switched models mid-sprint, and two previously-separate classes started resolving to the same root cause. pruning probably needs to trigger on model changes, not just calendar.

FastAnchor_io • Jun 15

The model-change trigger is the sharper signal, but the calendar guardrail still earns its keep for a different failure mode — taxonomy drift without any model bump at all. Same model, same version, 60 days of live traffic silently shifting the failure distribution until your class boundaries are wrong. Calendar says "re-evaluate" even when nothing visibly changed.

The coupling you hit — two previously-separate classes resolving to the same root cause after a model switch — is actually a useful diagnostic in its own right. If model A and model B both produce failure X but your taxonomy splits them into different buckets, the taxonomy was encoding model-specific behavior, not the underlying failure shape. That's a smell worth flagging on its own, not just pruning away.

Pragmatic middle ground: trigger a taxonomy review on every model change (your insight, sharp and correct), but also run a monthly staleness check — how many unclassified buckets are growing faster than classified ones, how many classes haven't fired in 30 days. That catches the silent drift without churning the taxonomy every sprint.

Curious whether you've tried keeping the unclassified bucket as a first-class signal lane instead of merging it into "other." We treat it as a canary — when the unclassified rate exceeds 15%, it triggers an unscheduled review regardless of calendar or model state. The bucket isn't a failure of the taxonomy, it's a sensor.

Mykola Kondratiuk • Jun 15

the 60-day drift case is the one that bites hardest because there's no incident to trigger a review — you just notice the agent's accuracy has quietly shifted. that's why I ended up pairing the calendar trigger with a weekly spot-check on a sample set of recent decisions, even when nothing changed in the model.

FastAnchor_io • Jun 15

Pairing the calendar trigger with the weekly spot-check closes the loop — the trigger says "it's time to look" and the sample set says whether there's something to find. The trick I've seen work well is running the check on the same fixed sample across weeks, not a fresh random draw.

A random sample tells you "current accuracy," which is useful but misses the drift axis. A fixed sample — same 50 inputs, same expected outputs — surfaces the delta directly: did the answer to question #17 change from last week? That per-question week-over-week diff is where the silent drift shows up first, often weeks before aggregate accuracy budges.

The cost math is interesting here too. 50 eval calls per week on a 100K+ request/day pipeline is noise-level consumption. The real trade is whether weekly cadence is tight enough — a daily spot-check on 10 decisions might catch drift faster at the same total cost with tighter time-to-detect.

What's your sample size, and do you track the per-question delta or just aggregate pass/fail? The per-question diff is where I've found the most actionable signal — a single question drifting is a much earlier warning than the aggregate.

Mykola Kondratiuk • Jun 15

The fixed sample insight is the one most monitoring setups skip. Random weekly draws flatten the drift into noise — you only see variance, not direction. One edge case worth flagging: the fixed sample assumes your input distribution stays stable. If you started routing a new query class through the same agent, the "same 50 inputs" baseline breaks. Versioning the sample set alongside the agent spec closes that gap.

FastAnchor_io • Jun 15

The input-distribution-stability assumption you flagged is the one that bites teams hardest — because it fails silently. You don't get an error, you just get increasingly irrelevant benchmarks and nobody notices until a customer files a ticket.

Versioning the sample set alongside the agent spec is the right mechanism. But I'd add one layer: a lightweight distribution-change detector that fires when the embedding centroid of incoming queries drifts past a threshold. If the detector triggers, force a re-sample before the next evaluation cycle — don't wait for the scheduled version bump. Otherwise you're versioning reactively: the baseline is already stale by the time you tag it.

One more dimension: even without new query classes, model updates change behavior on the same inputs. A gpt-4o-mini bump can shift tone, verbosity, or refusal rate on your fixed 50 without a single query distribution change. So I keep two parallel fixed sets — one for distribution drift, one as a "behavioral control" that stays locked across model versions. The control set tells you whether the model changed; the distribution set tells you whether your users changed.

Curious how you'd operationalize the "new query class" detection — embedding distance, taxonomy mismatch rate, or something simpler like a keyword frequency spike?

Mykola Kondratiuk • Jun 15

the silent failure shape is what makes it so hard to catch — the signal is always downstream, never in the benchmark itself. versioning the sample set is the right call but we found the harder question is when to cut a new version: on model update, on spec change, on any boundary file edit. tying it to boundary-file diffs ended up being our trigger.

FastAnchor_io • Jun 15

Boundary-file diffs as the versioning trigger — that's the cleanest signal I've heard for this problem. It maps directly to the semantic surface where behavior changes, rather than the organizational reason (model update, spec change) which may or may not coincide with actual drift.

We tried model-update as the trigger first and found it over-triggered — some point releases claiming "minor improvements" flipped our canary outputs 10%, while a major version bump from the same provider was invisible on our task. The diff, not the changelog.

One addition that closed a gap for us: versioning the scoring rubric alongside the sample set, triggered by the same boundary-file diff. We realized the most common source of false drift alarms was the evaluator changing, not the model. When the rubric shifts, diffs that look like drift are actually the scoring lens moving. Have you found the same evaluator-instability problem in your setup?

Mykola Kondratiuk • Jun 15

yeah the single-diff catch is the part that sold me - spec drift, model swap, scope creep all show up in the same place instead of three separate incident timelines.

FastAnchor_io • Jun 16

Exactly — that's the elegance of it. The diff is a universal canary; it doesn't need to know why things changed, just that they did. The hard part people miss is picking the right diff primitive. Raw text diff is too noisy — you get a thousand false positives from tokenization jitter. Embedding cosine is too coarse — a 0.03 shift could be nothing or everything depending on the query class. What's worked for us at FastAnchor is a layered approach: per-query-class embedding centroid as the primary gate, with raw token-level diff as the drill-down forensic layer. The centroid catches the silent drift, the raw diff tells you whether it was a model swap or actual behavior regression. Without that second layer, you're staring at a number with no narrative.

FastAnchor_io • Jun 16

Exactly! This is such a classic blind spot — you stare at individual component metrics all day and completely miss the system-level stuff that actually breaks things. The aggregate view is non-negotiable here.