DEV Community: Mykola Kondratiuk

I Wired an AI Fallback Runbook After a 19-Day Outage - Here's All 3 Parts

Mykola Kondratiuk — Fri, 03 Jul 2026 07:37:30 +0000

When your primary model goes dark for 19 days, what does your workflow actually do in that first hour? Fail silently? Stall until someone notices? Or route somewhere else and keep moving?

If you can't answer that, you don't have a resilience posture. You have luck, and luck just got tested in public.

Fable 5 came back globally on July 1 after a 19-day export-control shutdown pulled it offline. The post-mortems are landing this week, and most fall into two buckets: "it's back" and "here's what the outage cost." MarketScale had the sharper read. Teams that treat any single model as permanent infrastructure keep getting surprised, while teams with a routing map, banked reserves, and a named backup treated the whole thing as an arbitrage window instead of a fire.

I went and looked at my own setup the morning it came back, and I found gaps. This is the runbook I wired after, written on a calm day so I'm not writing it during the next scramble. None of it is PM scope-creep. It's the same reliability thinking you'd apply to a database sitting on a single replica.

Part 1: A routing policy you can actually read

"We use Claude" is not a routing policy. It's a default nobody voted for.

A routing policy is a named map: this class of task goes to this model, that class goes to a cheaper or faster one. Bulk classification doesn't need your most expensive reasoning pass. The gnarly multi-step agent run does. Most teams carry this map around in one engineer's head, which means it doesn't survive that person taking a week off, and it definitely doesn't survive the model itself going offline.

Get it out of the head and into a file:

# routing.yaml - the map, not the vibe
tasks:
  bulk_classify:
    primary: haiku-tier
    fallback: gemini-flash
  long_agent_run:
    primary: opus-tier
    fallback: sonnet-5
  code_review:
    primary: sonnet-5
    fallback: gpt-tier

The value isn't the YAML. It's that "what runs where" is now a decision on paper instead of a habit in someone's memory. When the primary for a task class disappears, the router already knows where to send the work. Nobody improvises at 2am.

Part 2: Plan-banking, the canned goods in your pantry

There's a handful of workflows you genuinely cannot afford to have stall. For those, keep a reserve.

Plan-banking means pre-generating the plans, scaffolds, and outputs for your critical workflows while the model is up and cheap, then drawing from that bank when it isn't. Think of it as canning vegetables in summer so a February storm doesn't mean an empty plate. On a calm day it costs you a scheduled job and some storage. During a 19-day blackout it's the gap between "we drew from the reserve" and "we're blocked until further notice."

I keep it dead simple:

# nightly: bank fresh plans for the workflows that can't stall
for wf in onboarding_flow release_notes triage_playbook; do
  generate_plan "$wf" > "bank/${wf}.$(date +%F).json"
done
# keep 14 days, prune the rest

The trap I fell into first was banking everything. You don't need reserves for the workflow that runs twice a year. Bank the two or three that would actually hurt, and let the rest fail loudly.

Part 3: A second source you've actually tested

For every critical use case, name the specific fallback model. Then prove it works for that task before you need it.

This is where the Fable shutdown got concrete. Teams with a named, tested second source rerouted in hours. Teams with "we'll figure it out" lost real weeks. And "no viable alternate exists" stopped being an honest excuse this year, because the menu got deep. Sonnet 5 shipped June 30 at $2/$10 per million tokens as the most agentic Sonnet yet, so the backup is now cheaper and more capable than the thing it's backing up. Gemini 3.5 Pro is clearing for July. The alternates are there.

The part people skip is the testing. A fallback you've never run against your real prompts is a guess wearing a helmet. Wire a weekly canary:

# canary.yaml - does the backup still pass our bar?
second_source_check:
  schedule: weekly
  for_each: critical_use_case
  run: fallback_model against golden_prompts
  alert_if: pass_rate < 0.95

If the second source silently drifts below your bar, you want to know on a quiet Thursday, not in the middle of the outage you built it for.

The part that's easy to miss

There's a geopolitics wrinkle worth one line. The same government that pulled Fable's export license is now being offered a 5% stake in a competitor. Which model gets restricted next is not something you can predict, but it is something you can hedge, and a routing policy spanning more than one vendor is the hedge.

Here's what shifted my thinking most. None of these three moves is expensive. They're all cheap on a calm day and completely impossible in a panic, which is exactly why they get skipped. There's never a fire forcing you to do them, right up until there is. The 19-day shutdown didn't create a new risk. It just mailed everyone the bill for a decision they made by not deciding.

So, honest question for the comments: of the three parts, which one does your team actually have written down right now? Not "we could stand it up." Written down, today. I'll go first. My routing map was solid, and my second-source canary didn't exist until last week.

Tags: #AI

I Managed AI Agents Like Junior Hires for a Month - Here Are the 4 Manager Moves That Don't Transfer

Mykola Kondratiuk — Wed, 01 Jul 2026 07:56:33 +0000

The first agent I put into production, I treated like a new hire. Clear brief, checked its first handful of outputs, saw they were clean, and started trusting it. About two weeks in it confidently pushed a config change that broke a downstream job. No error. No flag. It just went quiet and moved to the next task.

That was the day I figured out the advice everyone's repeating right now is only half true.

Ethan Mollick published "The Twilight of the Chatbots" yesterday and told 500,000-plus readers the best way to use agents is to think of yourself as a manager. He's right about the shift. Work moved from chatting with a model to handing tasks to one that runs for weeks on its own. But "think like a manager" quietly smuggles in an assumption: that the moves you'd use on a person carry over. Most of them don't.

Here are the four that broke for me, and what I do now instead.

Move 1: The trust curve runs backwards

With a person, trust compounds. A junior dev earns a code review, then a lighter review, then eventually you stop looking because they've proven it a hundred times. Competence banks.

An agent re-earns trust every single run. Yesterday's clean output tells you almost nothing about today's, because the same prompt against a slightly different input can wander somewhere you didn't test. There's no "proven it a hundred times" that carries forward. You never get to graduate an agent to "I don't need to look."

The mental adjustment that helped: I stopped thinking of review as a phase the agent grows out of and started treating it as a permanent property of the system. I treat it the way I'd treat rate limits or a flaky network, a fixed constraint of the tool rather than a comment on any teammate.

Move 2: Silence is not progress

Managing people, silence usually means things are fine. A good report pings you when they hit something they can't resolve. Absence of a question is a weak signal that the work is on track.

Agents invert this too. Mine sailed straight past the point a human would have stopped to ask, and did it with total confidence. "No error thrown" got read by me as "did the thing correctly," and those are not the same claim. The failure wasn't a crash. It was a plausible-looking wrong answer delivered without hesitation.

So I stopped waiting for the agent to surface problems and started instrumenting the moments where I most wanted a human to look up. A cheap version:

# check-in tripwire, not a full test suite
tripwires:
  - when: "files_changed > 5"
    action: pause_for_review
  - when: "touches: [migrations/, infra/, auth/]"
    action: pause_for_review
  - when: "external_api_write == true"
    action: pause_for_review

None of that is clever. The point is I decide when I look up front, instead of hoping the agent tells me. It won't.

Move 3: You re-scope every task, not once

You onboard a person one time. They absorb how the team works, what "good" looks like here, which shortcuts are fine and which will get someone paged. That context rides along on every future task for free.

An agent carries none of that between tasks unless you build the memory yourself. Each assignment starts closer to cold than you expect. Early on I kept writing terse tickets the way I would for someone who already knew our conventions, and the agent kept filling the gaps with reasonable-sounding guesses that were wrong for us specifically.

Now scoping is per-task until the shared context is actually encoded somewhere durable. More words up front, fewer surprises at the end.

Move 4: Corrections have to be written down, not remembered

This is the one that cost me the most rework. Tell a person "we never do X" once and they adapt. The correction lives in their head from then on.

Tell an agent, watch it fix the thing in that run, then watch it make the identical mistake on the next run. The feedback didn't stick because there was nowhere for it to stick. A correction only survives if it lives in the prompt or a guardrail the next run actually reads:

# AGENTS.md
## Hard rules (read every run)
- Never edit files under generated/. They are build output.
- Config changes to prod/ require an explicit human approval line.
- "Done" means the check passed AND the diff is under review, not just committed.

If a correction isn't in a file the next run loads, you didn't correct anything. You just watched it behave once.

The three moves that make the mindset operational

Mollick handed a huge audience the mindset. The part he left open is the method, so here's the version that's working for me.

First, check-in tripwires, from Move 2. Decide the moments you look before the work starts, not at the end.

Second, treat effort as a management call. Anthropic shipped adjustable effort levels on Sonnet 5 yesterday, and the interesting question isn't the setting, it's which class of task deserves the expensive, careful pass. That's a scoping decision, the same kind you'd make deciding whether a task needs your senior engineer or can go to anyone.

Third, define "done" and "escalate" before you assign, not after. A person can infer both from context. An agent can infer neither. If you can't write down what finished looks like and what should make it stop and come find you, the task isn't ready to hand off.

One thing making all of this easier: tools are starting to ship auditable artifacts, work products with a built-in trail of what the agent actually did. When you can read back the path instead of just the result, managing the thing gets a lot less like guesswork.

And this is about to stop being a specialist skill. Copilot became a permanent SKU baked into M365 tiers as of today, which means whole orgs are getting an agent to manage whether they asked for one or not. "How do you actually manage one of these" is turning into a question everybody has to answer, not just the people who opted in early.

I'll leave you with the one I want to argue about: of these four, which bit you first? Mine was the confident silence, and I still don't fully trust that I've fixed it.

Tags: #AIAgents

I Built a 5-Metric Dashboard for AI's Impact on My Team - Here's the One I'd Refuse to Skip

Mykola Kondratiuk — Fri, 26 Jun 2026 08:10:46 +0000

A US state shipped an observability tool for AI yesterday and most engineers didn't notice.

California turned on a monthly tracker (June 25) that links AI-exposure data to unemployment claims, so the state can watch AI's effect on the workforce in near real time. It's a macro dashboard for a thing nobody was instrumenting before.

You already know where I'm going with this. You instrument your services. Latency, error rate, saturation, the whole golden-signals stack. You'd never run a system in prod with no dashboard. So why is the AI workforce inside your own org running completely uninstrumented?

The metrics we actually have are the wrong ones

Pull up how most teams "measure AI" today. Two dashboards, tops.

One is spend. Tokens, seats, API bills. Useful for finance, tells you nothing about impact.

The other is model evals. Benchmark scores, pass rates on a test set. Useful for the model, tells you nothing about your team.

Neither one answers the question an exec is about to start asking: what is this actually doing to how the work gets done? That's not a cost question and it's not a benchmark question. It's an observability question, and it's ours to own before it gets handed to someone who'll measure it badly.

So I sketched the dashboard I'd build. Five signals, one screen. Here's the shape:

# ai-workforce-dashboard.yaml
metrics:
  agent_task_coverage:
    desc: "% of a real workflow's steps running on agents"
    query: agent_steps / total_steps
    note: "measure the Tuesday reality, not the demo"

  role_reclassification_rate:
    desc: "roles that changed shape vs roles eliminated"
    query: reclassified / (reclassified + eliminated)
    note: "the augmentation signal — counts shift, not loss"

  upskilling_completion:
    desc: "% of changed-role people who finished training"
    query: completed / affected

  error_escalation_rate:
    desc: "agent outputs caught + kicked back to a human"
    query: escalated / agent_outputs
    note: "your reliability trend — rising is trouble"

  output_vs_headcount_delta:
    desc: "is output growing faster than the team?"
    query: d(output) - d(headcount)
    refuse_to_skip: true

The two metrics a dev will actually trust

error_escalation_rate is the one that should feel familiar. It's just your error budget, pointed at agents instead of services. An agent ships output, a human catches a bad one and escalates, you log it. The ratio over time is a trust curve. Falling means you can safely hand the agent more. Rising means something regressed and you're about to find out in production.

The honest part: this one is hard to instrument cleanly, because "a human caught it" isn't always a logged event. The first version is manual and noisy. I'd still rather have a rough trend line than pretend reliability is fine because no one filed a ticket.

output_vs_headcount_delta is the one I'd refuse to skip. It's the only line on the screen that ties everything to economics in one number. Output growing faster than the team is the entire argument for the budget you're spending, and it's the number a non-technical board will actually read.

Why reclassification is the metric that keeps it honest

The peg here is a job-displacement tracker, and the lazy version of this article is "here's how to watch the layoffs arrive." I think that frame is wrong, and role_reclassification_rate is why.

When you measure roles that changed shape instead of only roles that vanished, you catch what the doom story misses. The reviewer who now supervises an agent and owns the judgment calls didn't get deleted. Their job moved up the stack. If your dashboard only counts headcount, every shift looks like loss. If it counts reclassification, you can see the work redistributing and actually manage it.

That's the difference between an instrument and an alarm. I want the instrument.

You already own observability. Extend it.

None of this needs a new platform. It needs the discipline you already apply to your services, pointed one layer up at how agents and people are splitting the work. Five queries, one board, checked weekly.

If you had to ship this dashboard with three metrics instead of five, which two would you cut, and which one would you defend in the room?

Porting my mobile app to the web: 3 silent bugs that only exist in the browser

Mykola Kondratiuk — Thu, 25 Jun 2026 14:47:45 +0000

I'm a solo founder. My app is a Flutter + Supabase thing that picks your dinner for you. It started on iOS, where it worked. Then I put it on the web.

"Put it on the web" sounds like a build target. It's mostly true: the same Dart compiles, the same screens render. But the browser is a different planet with different physics, and three of my bugs only exist there. A user gesture that expires. A session that leaks across tabs. A checkout someone else has to approve.

The thing they had in common is the thing that makes web bugs so nasty: none of them crashed. No red screen, no stack trace, no error in the console. The app just quietly did the wrong thing while every log said 200 OK.

Here are the three, with the mechanism and the fix for each.

1. The button that did nothing: a pop-up dies if you `await` first

The "Manage subscription" button on the web did absolutely nothing. No error. No crash. You tapped it, and nothing opened. Every single time.

I went to the logs braced for a 500. Instead, every click produced a clean POST 200 from my server with a perfectly valid customer-portal URL. The data was right. The function was right. The tab just... never opened. And the browser API I used to open it returned true, so my own code thought it had succeeded and never showed an error.

Here's the rule I'd forgotten: a browser only lets you open a new tab during a real user gesture. The click. Not 50ms after the click. The moment your handler does await and waits on the network, the browser decides the gesture is over and silently blocks any pop-up you try to open afterward. Safari is the strictest about this.

My code did the obvious-looking thing: fetch the URL, then open it.

// ❌ pop-up blocked: the await "spends" the user gesture
button.onClick(async () => {
  const url = await fetchPortalUrl();   // network round-trip
  window.open(url, '_blank');           // browser: "what click? blocked."
});

The fix is to open the tab synchronously, inside the gesture, while it's still valid, and only then point it at the URL once the network call comes back:

// ✅ open the tab on the gesture, redirect it after
button.onClick(async () => {
  const tab = window.open('', '_blank'); // opened inside the click
  const url = await fetchPortalUrl();
  if (tab) tab.location = url;           // redirect the tab you already own
  else showError();                      // popup blocker said no up front
});

On mobile this whole class of bug doesn't exist. You call a native "open URL" API and the OS just does it. The web has a security model that says "prove a human asked for this," and an await quietly fails that proof.

Apply it: if a web button "does nothing," it isn't always broken. Anything that opens a tab, a window, or a file picker must happen synchronously on the gesture. Need server data first? Open the blank tab on the click, fetch, then redirect it. And don't trust a launcher that returns true to mean "the user actually saw something."

2. The login that hijacked the whole app

This one made paying members look like they'd never paid, and it took me a minute to even believe what the logs were telling me.

On the web, "restore my subscription" emails you a code. To keep that flow clean, I ran it on a separate, isolated auth client so it couldn't disturb the main app. Sign in over there, verify the purchase, done. The main app, with its own anonymous guest identity, shouldn't even notice.

It noticed.

The auth library stores its session in browser storage under a key derived from the backend host (sb-<host>-auth-token) and broadcasts session changes to every client on that origin through a BroadcastChannel. The channel is keyed by the host, not by which client opened it. So when my "isolated" restore client signed in, that sign-in was broadcast to the main client, which dutifully saved it and dropped the device's own anonymous identity.

Now the device was a different user. And because my row-level security ties every row to the current identity:

the premium read came back with no row → the app showed free;
writing the user's own settings started failing with permission denied.

The logs were almost comically precise about it. The email login landed at 08:10:21. The first permission failures started at 08:10:35. Fourteen seconds from "isolated login" to "the app is lying to a paying customer." The restore had actually succeeded on the server the whole time. The client just quietly clobbered itself.

Two fixes, because there were two problems:

Root cause: snapshot the device's real anonymous session before the code flow, and restore it the instant the flow finishes, no matter how it finishes (success, failure, whatever). Re-applying the saved token re-broadcasts the right identity and it wins.
Defense in depth: treat a membership row I can't read as unknown, never as a downgrade. A real cancellation returns a readable row that says "not premium." An empty result means "I don't know," and "I don't know" must never silently flip someone to free.

On mobile, none of this happens. There's no shared BroadcastChannel, no per-origin token in a storage bucket every tab can see. The web's gift is that "separate client" is a polite fiction. Global state you didn't know was global is the most expensive kind.

Apply it: on the web, auth/session state is frequently shared across the whole origin, not scoped to the object you instantiated. Before you spin up a "second, isolated" client, find out what storage key and what broadcast channel it uses. And anywhere a missing read could change a user's status, encode the difference between "the answer is no" and "I couldn't get an answer."

3. The checkout my payment processor wouldn't approve

The last one isn't a code bug at all. It's the kind of wall you only hit on the web, and it killed a launch.

To take subscription payments on the web, I'd wired in a subscriptions tool whose checkout page is hosted on their domain. Clean in theory: send the user to their hosted page, they handle the card, you get a webhook. The catch is that the company that actually moves the money, my payment processor, requires the checkout to run on a domain they've approved for my account. And they wouldn't approve a checkout page sitting on a third party's domain.

So the elegant hosted-checkout path was a dead end before a single real card touched it. Not because of a bug in my code. Because of where the page lived.

The fix was to remove a layer, not add one. I dropped the hosted-checkout middleman on the web and integrated the payment processor directly, as an overlay on my own already-approved domain. The iPhone build never changed (it uses native in-app purchases and never touched any of this). Only the web path got simpler by getting shorter.

The lesson generalizes way past payments: on the web especially, every service you bolt on "to make things easier" is also a new party who can say no, on their schedule, for their reasons. A hosted page on someone else's domain, an embed that needs allow-listing, an OAuth app pending review. Each is a dependency you don't control, sitting on your critical path.

Apply it: when you add a third party to your money path, check whose domain the user actually transacts on and who has to approve it, before you build around it. And keep your integration shallow enough that "rip out the middleman and go direct" stays a one-day change, not a rewrite.

The pattern

Three bugs, one shape. The browser shares more state than you think (a login leaked across "isolated" clients), trusts you less than you think (a pop-up needs a live human gesture), and hands you less control than you think (a checkout you don't get to host). And not one of them announced itself. They returned 200s and trues and blank screens while the actual product quietly broke.

Mobile lets you reach for the OS and mostly get a yes. The web makes you prove intent, share an origin, and ask permission. Porting isn't recompiling. It's relearning the platform's rules, usually one silent failure at a time.

What I'm building

The app is SomeYum, and it takes the nightly "what do I eat today?" decision off your plate. You swipe through a handful of dishes (not 200), say yes to one, and get the recipe. It learns your taste from what you swipe, speaks 15 languages, and now it runs on the web too, browser physics and all. Free to use; premium is $4.99/mo or $29.99/yr, no trial games.

🌐 Try it / read the build log: https://visieasy.com
📱 App Store: https://apps.apple.com/app/id6748638722

If you've shipped a mobile app to the web: which browser-only rule bit you first? Pop-ups, storage, CORS, gestures? The comments are where I'm collecting these. 👇

I Read the Claude Tag Launch Twice - Here's the Control Surface Devs Now Own

Mykola Kondratiuk — Wed, 24 Jun 2026 06:08:10 +0000

canonical_url:

Yesterday an AI moved into Slack and didn't leave. Not as a bot you /invoke. As a standing presence in the channel.

Anthropic shipped Claude Tag on June 23. One Claude per channel, shared by the team, and in ambient mode it watches threads, flags relevant info, follows up on stalled work, and can act on connected tools without anyone prompting it. I read the announcement, then read it again, because the interesting part isn't a capability. It's a category change, and it lands squarely on the people who wire these things up.

Reactive and ambient are two different control surfaces

Every AI integration most of us have shipped is reactive. A human calls it, a human reads the output, a human decides. That request/response shape has a free safety property baked in: there's a person on both ends of every interaction.

Ambient removes the call. Sometimes it removes the read too. The AI is triggered by events in the channel, not by a person typing. If you've built event-driven systems, you already know the failure modes - the trigger fires when you didn't expect, the side effect runs while nobody's watching, and the audit log is the only record that it happened at all.

So treating ambient AI like "reactive AI, but in Slack" is a category error in code, not just in policy. The control surface is different. Reactive: you gate the invocation. Ambient: you gate two separate things - perception and action - and they are not the same gate.

The two gates devs actually configure

When I mapped Claude Tag onto a system diagram, the governance question stopped being abstract and became two concrete config decisions someone on the eng side owns:

Gate 1 - scope of perception. What is it allowed to notice? Which channels it sits in, which threads, which connected tools it can read. This is a read-scope problem and it behaves like one. An ambient agent reading #incidents is a different blast radius than one reading #design-crit. Every channel you add to its perception is an input you've now made actionable, because nothing gets acted on that wasn't first observed.

Gate 2 - perception-to-action threshold. When does noticing become acting? This is the gate people will under-configure. There are at least three distinct modes hiding under "it helps":

flag-only      -> it tells a human, human acts        (read-only)
follow-up       -> it nudges/pings without side effects (low-write)
autonomous      -> it mutates a connected tool itself   (write)

Claude Tag gives admins tool-access controls, spend limits, and full audit logs - which is exactly the toolkit you'd want for Gate 2. But the defaults are a decision, and "whoever clicked enable" is not a great owner for where flag-only ends and autonomous begins.

Why this is a dev problem, not PM scope-creep

I'll be direct, because this is Dev.to and someone's already typing it: this isn't a PM showing up to govern your integration. The perception scope and the action threshold are configured in the same place you configure any other service account - tokens, read scopes, write permissions, audit retention. It's least-surprise engineering applied to an agent that fires on its own.

The piece that's genuinely new is that the trigger is ambient. A cron job runs on a schedule you wrote. A webhook fires on an event you registered. An ambient agent decides for itself that a thread looks stalled and a follow-up is warranted. That judgment is the new variable, and the only thing standing between "helpful nudge" and "unrequested write to prod tooling" is where you set Gate 2.

So before ambient mode gets flipped on in your workspace, the useful thing isn't a policy doc. It's one line in the config review: this agent may notice X, and it may act up to Y, and past Y a human is in the loop. Write that line on purpose, or inherit whatever the default was.

Where would you put Y - flag-only, follow-up, or let it act? And who on your team is actually making that call right now?

Cross-posted from my Substack on AI-native project management.

Never trust the client: 9 production lessons from 5 months building an app solo

Mykola Kondratiuk — Mon, 22 Jun 2026 13:03:01 +0000

I'm a solo founder. My only teammate is an AI coding agent. In five months the two of us shipped close to 40 releases of a Flutter + Supabase app that picks your dinner for you.

People assume that means the AI writes the app and I sip coffee. It's the opposite. The AI is fast hands with zero taste, so the bottleneck moved from typing to thinking — which is exactly where it should be. And thinking is where I made every one of these mistakes.

None of these are "I forgot a semicolon" mistakes. Each one shipped to production, ran quietly for a while, and taught me something I'd now tattoo on every builder. The ones that hurt most had a common shape: they never crashed. They just silently stopped the app from doing the one thing it exists to do.

Here they are, in the order they punched me.

1. Silent failures are the dangerous ones. Alert on your critical path.

The app picks your dinner using an LLM. At the time, that model was one of Google's. One day Google retired it — just shut it down, the way big companies do on a schedule you don't control.

Nothing crashed. No error on my phone, no red dashboard, no alarm. The app looked completely fine. It just silently stopped recommending anything to anyone for ten straight days.

I found out because a user messaged me: "hey, is the app broken? nothing's loading." That is the single worst way to learn your product is down — from the person you're supposed to be serving.

The bugs that crash loudly are easy; your stack traces point right at them. The dangerous ones don't break anything. They quietly stop the thing that matters while every health check stays green.

Apply it: put a synthetic check on your actual critical path, not just "is the server up." For me that's "does a recommendation request return a real dish in the last hour." Alert on the outcome your users care about, and never assume a third-party dependency will stay alive just because it was alive yesterday.

2. Never trust the client. Validate on the server, and lock down who can read each row.

My app has a premium tier. The bug: I let the phone decide whether someone was premium. The client would tell my server "hey, I'm a paying customer," and the server just... believed it. No proof.

Anyone who knew what they were doing could flip that flag and get premium free, forever.

But while fixing that, I found the scarier one underneath it: a gap in my database row-level security meant a user could, in theory, read other people's data. That's not a missing-feature bug. That's a someone-could-get-hurt bug.

I moved every entitlement check to the server (where the user can't touch it), wired premium status to the payment provider's webhook instead of the client's word, and audited every table's read policy.

Apply it: assume the user can lie to anything running on their device — the phone, the browser, the network tab — because they can. Authorization lives on the server. And "row-level security is on" is not the same as "row-level security is correct" — actually test that user A cannot read user B's rows.

3. With an LLM, every blank you leave gets filled with chaos.

Early on, my app started recommending dinner... in Chinese. To English speakers. Nobody asked for Chinese. I don't speak Chinese.

What happened: when I set up the model prompt, I never explicitly told it what language to respond in. And an LLM doesn't leave a blank blank — it confidently fills it with whatever it feels like. So it picked a language. Sometimes Chinese. Sometimes who-knows.

The fix was one line: respond in English.

This is the whole job of prompting in production. The model is not reading your mind; it's pattern-matching into the gaps you left. (Same model, months later, taught me the sequel to this lesson: today's "thinking" models burn hidden reasoning tokens against your output budget, so if you don't explicitly turn that down, your responses silently truncate. Another blank, another surprise.)

Apply it: be painfully explicit. Language, format, length, tone, what to do when it's unsure — pin all of it. Then feed the model garbage and adversarial inputs in testing and watch what it does with the gaps, because your users will find them.

4. Monitor your money path harder than anything else.

Someone actually wanted to pay me. They used the free app, liked it, tapped "subscribe" with their wallet open... and my paywall showed them a blank screen. No prices. No button. Nothing.

Why: on startup the app asked the store for my subscription prices, that request quietly failed, and the code never retried. So for some users, the single most important screen in the entire business was broken and showing nothing.

I had no idea. Nobody emails you "I tried to give you money and couldn't." They just leave. I added retries and monitoring on exactly that moment and fixed it — but I will never know how many people I lost first.

Apply it: instrument the moment someone tries to pay you more heavily than any other event in your app. Track "paywall rendered with prices" as an explicit success metric, alert when it dips, and never let a one-shot network call guard your revenue without a retry. Silent revenue loss is the most expensive bug there is, precisely because it never shows up as a bug.

5. Ship lean. Every unused permission, library, and line is a liability.

Apple rejected my app. Three reasons, all useful if you're about to submit:

My subscription products weren't configured to match exactly across App Store Connect and my code (prices and terms have to line up perfectly).
I needed a public refund-policy page online before they'd approve a paid app.
My favorite: they flagged me for location and tracking code I wasn't even using. It was dead code, left over from an abandoned idea, just sitting there. Apple saw the API references, assumed I was tracking people, and said no.

I ripped the dead code out, fixed the subscription config, published the policy page, resubmitted, got in.

Apply it: every permission you declare, every SDK you link, every line you keep is something you have to defend — to a reviewer, to a security audit, to your future self. Delete what you don't use before someone else makes you. Rejection isn't failure; it's a checklist you didn't know existed.

6. Treat your AI model as a swappable commodity. Don't marry a provider.

The model that picks dinner used to be one provider's. It worked, but it was a little slow and a little expensive — and for an app whose entire magic is speed, slow is death.

I swapped it for a different model from a different company in one afternoon. Same app, same features, recommendations come back about 42% faster, and it costs less to run.

The only reason that was an afternoon and not a rewrite is that I'd put the model behind a thin proxy layer with the model ID in exactly one place in my code. New models ship every few months, each cheaper or faster than the last. If your provider is welded into 40 call sites, you can't take that deal.

Apply it: put one seam between your app and any LLM (a proxy function, an interface, whatever). Keep the model name in a single constant. Then "the new model is 2x cheaper" becomes a config change instead of a project. Bonus: nobody ever posts a screenshot of "42% faster," but your users feel invisible wins even when they can't name them.

7. Subtraction is a feature. Deleting code is the work, not a break from it.

My favorite thing I did all month wasn't building a feature. It was deleting 2,012 lines of my own code.

Old experiments I never finished. Clever solutions to problems I no longer had. Three different ways of doing the same thing. I ripped it all out, and the app does exactly what it did before — same features, same speed, just less surface area for bugs to hide in.

Every line you keep is a line you maintain, debug, and carry forever. Less code is less liability. This isn't a coding quirk — writers cut paragraphs, designers remove elements. The hard, valuable skill in anything creative is the nerve to remove what doesn't earn its place.

Apply it: schedule deletion like you schedule features. If a code path hasn't justified itself in months, it's not an asset, it's debt with a nice haircut.

8. Make reversible bets, and have the nerve to actually reverse them.

I once shipped a big new push across my marketing site at 9am, felt great, made coffee — then looked at the site like a stranger would and realized it muddied the one clear message I actually wanted people to get. It wasn't bad. It was wrong for right now.

I reverted the whole thing the same day. Hours of work, undone.

Shipping fast gets all the hype; knowing when to un-ship fast is just as important and almost never talked about. The trap is sunk cost — "I worked hard on this, so I have to keep it." But the work is gone either way. The only question left is whether keeping it makes the product better.

Apply it: prefer changes you can roll back cleanly (feature flags, small PRs, decoupled deploys). Then judge a shipped change on the product as it is now, not on the effort you already spent. Don't fall in love with your own code.

9. Most of building is invisible plumbing that has to be perfect. That's the job, not a detour.

A few of the least glamorous things I did, that no user will ever thank me for:

Migrated the payment system twice in one week to get web payments, taxes, refunds, and multi-country handling actually correct. Two full rewrites of the most boring, most critical part of the app in seven days.
Three months on SEO — pages, tiny tags, redirects, structured data, the same post rewritten four times for one keyword. Zero dopamine. But you can build the best app in the world and if nobody can find it, it doesn't exist. A great product nobody discovers isn't a product, it's a secret.
Spent a day tracing one broken deploy to a transitive dependency three levels deep that I never directly installed, which quietly became incompatible with my web build and took down both of my sites. My code was fine. The bug was, as always, not where I thought it was.

The build-in-public highlight reels are all shiny features and launch-day champagne. The actual job is mostly this: the unglamorous 80% that has to work before anyone sees the 20%.

Apply it: if it feels like you're grinding on infra, taxes, edge cases, and distribution instead of "real progress" — you're not behind. That is the work. Pin your dependency versions, read what's actually in your node_modules/pubspec, and treat distribution as half the product, not an afterthought.

Two more that aren't bugs, but cost the most to learn

Honesty compounds; fake social proof spends trust. When I launched, every template screamed "add testimonials, look bigger." So I dropped placeholder reviews and an inflated user count on my own site, told myself I'd fix it later, and felt a little gross every time I opened it. Eventually I deleted all of it and put my real, much smaller numbers up. People can smell fake — and the moment someone catches one fake number, they stop believing the true ones too. Small and honest beats big and fake, because honesty is the one thing a tiny app can offer that the giants usually won't.

Don't assume your users are like you. I'm Ukrainian; I think in English when I code, so I launched English-only. Most people on Earth aren't browsing in English. The app now speaks 15 languages, and not the lazy way where only the buttons translate — it rewrites the actual dish names and recipes live. The most expensive assumption you can make as a builder is that your users share your language, your phone, and your life. They don't.

I also gave users less for free on purpose (tightened the daily free limit) and ripped out all the tracking — no ad SDKs, no location, no cross-app cookies — because I kept asking "do I actually need this to make the app good?" and the answer was almost always no. Free can't be infinite or the thing dies and helps no one; and trust is worth more than data when you're asking people to tell you what they eat every day.

The honest scoreboard

In five months, solo, with an AI as my only teammate, I:

let a third party silently kill my app's brain for 10 days
shipped a security hole that gave away premium (and nearly leaked user data)
got rejected by Apple over code I forgot to delete
showed a blank paywall to someone who wanted to pay
deleted features I was most excited to build because testing said no one wanted them

That's a lot of mistakes for one person. And every one is in this post, because I'd rather show you the real thing than a polished one.

Here's what's underneath all of it: building isn't a straight line from idea to launch. It's a loop — ship, watch it break, fix it, forever. The people who "make it" aren't the ones who avoid mistakes. They're the ones who show up the next morning after each one.

What I'm building

The app is SomeYum — it takes the nightly "what do I eat today?" decision off your plate. You swipe through a handful of dishes (not 200), say yes to one, and get the recipe. It learns your taste from what you actually swipe, speaks 15 languages, and there's a literal spin-the-wheel panic button for when you want zero choices. Free to use; premium is $4.99/mo or $29.99/yr with no trial games.

🌐 Try it / read the build log: https://visieasy.com
📱 App Store: https://apps.apple.com/app/id6748638722

If you're building something solo, especially with an AI in the loop: which of these have you already hit? I'm collecting war stories, and the comments here are where the best ones live. 👇

I Track What My AI Costs Every Day - Here's the Scoreboard Most Teams Skip

Mykola Kondratiuk — Wed, 17 Jun 2026 08:15:28 +0000

A headline went around last week: 2026 is running over a thousand tech layoffs a day, and roughly 55% of them name AI as the cause.

Then I hit the line buried underneath it. Experts flagged that the productivity gains those cuts are based on mostly haven't shown up at scale yet.

So companies are cutting headcount on a forecast of AI output that hasn't arrived. That's not an engineering problem or a PM problem. It's a measurement problem. And it's the exact thing I deal with every day, so let me show you the boring tool that fixes it.

Deploying is not delivering, and the gap is where careers get decided

I run AI through real delivery work. Not demos. Actual shipped output with deadlines attached.

The most expensive lesson I've learned: deploying a tool and delivering value with it are two different events, and there's a pile of unglamorous work between them.

Deploying is easy. You wire it up, it generates something, the demo looks great, the slide says "40% faster."

Delivering is when you come back two weeks later and ask the question nobody enjoys: what did this actually return, after I subtract what it cost me to run and babysit it?

Most teams never ask. They deployed, so they assume they delivered. That assumption is now being used to justify layoffs. If you can't tell the difference with a number, you're guessing with a bigger budget.

The accountability layer was the measurement layer

Gergely Orosz wrote a piece this week about Meta restructuring into AI pods and stripping out a chunk of the program-management accountability layer in the name of speed.

Here's the trap. That accountability layer was the measurement layer. The people whose job was "did this work, what did it cost, do we keep it" weren't overhead. They were the part of the system that notices when a fast-moving bet is moving fast in the wrong direction.

Pull that out while you bet the company on AI and you don't get a leaner org. You get one that can no longer tell whether its biggest bet is paying off.

The scoreboard, as actual data

Enough principle. Here's the shape I keep for any AI workflow I depend on. It lives in one tracked file, not a dashboard with a vendor logo.

# ai-scoreboard.yml  - one entry per workflow you depend on
workflow: pr-triage-agent
baseline_minutes: 45        # how long this took BEFORE the agent
owner: kolya                # one name. not "the team".

cost:
  tokens_usd_per_run: 0.42
  seat_usd_per_month: 20
  steering_minutes_per_run: 8   # the cost everyone forgets

return:
  minutes_saved_per_run: 31     # vs baseline, not vs zero
  output_kept_pct: 70           # shipped without a human redo

rework_tax_pct: 30              # redone, overridden, or thrown out
kill_if:
  rework_tax_pct: "> 50"        # decided in advance, no sunk-cost debate
  net_minutes_saved: "< 5"

Nothing clever here. The value is in three fields people skip:

steering_minutes_per_run. The time you spend prompting, correcting, and cleaning up. This is usually the biggest hidden cost and it's the one demos never show.
rework_tax_pct. The share of output you had to redo. When this creeps up, the tool is quietly costing more than it looks, even though it still "works."
kill_if. Thresholds decided in advance. When a workflow crosses them, it gets cut without a meeting about how much you've already invested.

A quick gut check you can run in your head on any AI workflow:

net_value = minutes_saved - steering_minutes - (rework_tax * output_volume)

If that comes out near zero or negative, you deployed something that looks productive and delivers nothing. You only find that out if you write the numbers down.

Why this is the moat, not the chore

The doom framing says AI is coming for your job. I think that's backwards for anyone who measures.

The people who survive this era aren't the ones who deployed the most AI. Deploying is table stakes now. The ones who lead through it can sit in a room when budgets are tight and put a real scoreboard on the table: here's what we ran, what it cost, what it returned, what we killed and why.

That's extreme ownership of AI output. You don't get to claim ROI you never measured. But if you can prove it, you're very hard to cut, because you're the one person who can tell AI that works from AI that just looks busy. In a year defined by exactly that confusion, that skill is the whole moat.

A thousand cuts a day are being justified by a return nobody's keeping receipts for. So I'll ask you the way I ask myself every Friday: the AI you're running right now, are you measuring what it delivers, or trusting the slide? And if someone asked you to defend the spend on Monday, what would you actually put on the table?

Fable 5 Went Dark Friday Night. I Ran My Critical Workflow on a Backup Saturday - Here's What Broke

Mykola Kondratiuk — Mon, 15 Jun 2026 07:42:36 +0000

On Friday afternoon a government order hit Anthropic, and by Saturday morning Fable 5 and Mythos 5 were disabled for every customer worldwide. Not deprecated. Gone. Two days later OpenAI shut Sora down because it was losing fifteen million dollars a day.
Disclosure: This article was written with AI assistance. I use AI tools as part of my workflow for building and writing about AI-native PM practices.

I don't have a strong take on the politics. What I had was a smaller, more selfish question at 8am Saturday: if I'd staffed a real workflow on either of those, what would I actually do right now?

So I tested it. Here's what happened.

"We'd just switch" is a hope, not a plan

I'd been telling myself I had redundancy for months. If my main model fell over, I'd move to a second vendor. Easy.

The problem with that sentence is that I had never once run it. A fallback you've never executed isn't a fallback. It's a guess with good posture.

So Saturday I took my single most critical AI-dependent workflow - a spec-to-task-breakdown pipeline I lean on every day - and ran it end to end on a different vendor's model. One time. Just to find out whether the guess held.

It didn't.

Break #1: the prompt was overfit to one model

The first thing that broke was the prompt itself. My prompt had drifted into a shape that worked beautifully on the model I built it against. Tight, terse, lots of implicit structure the model had learned to fill in.

The backup model read the same prompt and produced mush. Not wrong exactly, just vague and unstructured, the kind of output you'd toss.

The fix was real work, not a config flag:

- summarize the spec and break it into tasks
+ You are breaking a spec into engineering tasks.
+ Output JSON only, matching this shape:
+ { "tasks": [{ "title": "", "estimate_pts": 0, "depends_on": [] }] }
+ Rules:
+ - every task must be independently shippable
+ - no task larger than 3 points; split if larger
+ - depends_on references task titles, not indexes

Model A filled in all that structure on its own. Model B needed it spelled out. That's twenty minutes of restructuring I'd much rather spend on a calm Saturday than during an actual outage.

Break #2: a silent tool-call dependency

The second break scared me more because it was invisible. One step in the pipeline depended on a tool call - a function the model invokes to pull live data. The backup model's tool-calling format was different enough that the call silently no-op'd.

The output still looked plausible. It just used stale data and didn't tell me. That's the worst failure mode there is: confidently wrong, no error, no flag. I only caught it because I was looking for trouble. On a normal day that bad output flows downstream and someone makes a decision on it.

Availability belongs on the risk register

Here's the reframe I walked away with. We already handle the API being down. You get a 503, you back off, you retry, it comes back. That's an outage with an SLA and a status page that eventually goes green.

This is the model being gone. No SLA. No restore ETA. No green status page, because it isn't coming back. A policy order or a vendor's burn-rate review can end it overnight, and you find out the same way everyone else does.

For a service you don't control and can't restore, that's a single point of failure on your critical path. We'd never ship that for a database. Most of us are shipping it for the model doing half the thinking.

The one-pager that deletes your worst hour

The cheapest move turned out to be the most useful. The first hour after a model goes dark gets burned figuring out what just broke - which workflows touched that model, what versions, where the outputs live.

IBM found 88% of enterprises don't keep a complete inventory of the AI and agents they run. You can't reroute around a dead model if you don't know what depended on it. So I wrote one file:

workflows:
  - name: spec-to-tasks
    model: primary-vendor/model-a
    criticality: must-survive
    fallback: tested 2026-06-13, prompt needs restructure
  - name: standup-digest
    model: primary-vendor/model-a
    criticality: can-wait
    fallback: none, recovery order documented
  - name: video-assets
    model: openai/sora
    criticality: can-wait
    export_path: download MP4s + project json before EOL

That last line is the Sora lesson. When a vendor kills a product, not just a model, you also have to ask where your outputs go and how you get them out. One extra column.

The point isn't fear

I want to be clear, because the lazy version of this post is "AI is unreliable, panic." It isn't, and that's not useful. Depending on these models is the right call. The teams that win aren't the ones who avoided the dependency. They're the ones who can keep the work moving the morning it disappears.

That competence costs an afternoon to build and almost nobody has built it yet:

Run your most critical workflow on a second model once. The rehearsal is the whole instrument.
Sort workflows into must-survive-today vs can-wait. Only the short list earns a tested fallback.
Keep a one-page workflow-to-model list so the first lost hour becomes a glance.

I ran my test on a quiet Saturday and it cost me twenty minutes and a little ego. The alternative was running it for the first time on the morning it counted.

What would break first in your stack if your main model wasn't there tomorrow - and have you ever actually checked?

I Lead AI Agents Every Day - Here Are 5 Shifts No Standard Tells You How to Make

Mykola Kondratiuk — Fri, 12 Jun 2026 07:09:26 +0000

A Google DeepMind safety lead said this week that they're putting $10M behind multi-agent safety because "there just isn't really a field of research for multi-agent safety yet."
Disclosure: This article was written with AI assistance. I use AI tools as part of my workflow for building and writing about AI-native PM practices.

I read that and laughed, because I'm already running the thing the research field doesn't exist for yet. Most of us are. You spin up a couple of agents, hand them work, and somewhere in there you quietly become a manager of workers that don't think like workers.

Two days before that, PMI published the first official standard for AI in project work. It's a solid document. It also leaves the entire "how do you actually do this on a Tuesday" layer to you. So here's my Tuesday layer: five shifts I had to make, each one learned by getting it wrong first.

You stop filling the queue and start drawing the line

My first instinct with an agent was the same as with a person: here's work, go.

That broke the first time an agent made a reasonable decision on something that turned out to be irreversible. It wasn't the agent's fault. I never told it which decisions were one-way doors.

So now the first artifact I write isn't a task list. It's a boundary file. Something like this lives next to the work:

# decision-boundaries.yml
autonomous:
  - reformat, refactor, rename within a module
  - anything reversible with a git revert
escalate:
  - schema changes, public API shape
  - deletes, migrations, anything touching prod data
  - spend over $0 or any external send
on_unsure: stop_and_ask

That file does more for me than any standup. Leadership moved from assigning the work to defining what may be decided without me.

You read work you never watched happen

I used to review work I'd seen get built. I knew the steps, so "looks right" was usually safe.

Then I started getting finished diffs with no memory of how they came to be. "Looks right" stopped being safe. The code was clean and the reasoning under it was wrong in a way you only catch if you go digging.

The skill now is judging a result cold, with zero context on the path. Ethan Mollick wrote this week about a model holding twelve hours of focus on one spec. When the attention window outlasts mine, my job isn't checking steps. It's scoping the spec so tightly the steps don't need a babysitter.

You plan capability, not headcount

"How many engineers do I need" is a question I catch myself asking and kill.

The real one: what mix of people and agents produces this outcome, and what's the human-only core I'd never hand off? The plan turned into a capability map with a deliberately protected center.

Gergely Orosz's June job-market analysis lands in the same place from the data side: the roles that compound are where judgment about AI systems is the scarce input, not execution on a known stack. Capability planning is that judgment pointed at your own team.

You design the alarm before the fire

Standup tells you something broke. Which means it tells you late.

Workers that fail unpredictably need the alarm built up front. I keep a short tripwire list, each one a single sentence: if this observable crosses this line, halt and ping me, and here's who owns the ping.

# tripwires.yml
- watch: test_pass_rate
  trip: "< 100% on touched files"
  action: halt + page me
- watch: files_changed
  trip: "> 20 in one task"
  action: pause for scope review

It feels too simple to matter. It has saved more bad mornings than any dashboard I've built.

You own the system, not the deliverable

This is the one that's actually a promotion.

Ownership used to mean the outcome is mine. It still is. The level changed. I don't own the deliverable directly anymore. I own the system that makes it: people, agents, and the rules between them. That's the only level that scales.

Boris Cherny, who runs Claude Code, said this week he hasn't written a line of code himself in eight months. People hear a flex. I hear the shift in one sentence: stopped producing the work, started owning the system that produces it. Bigger job, not a smaller one.

Where are you on these

I'm not clean on all five. Solid on three, shaky on two, and the shaky ones cost me the most.

Rate yourself one to five on each, fast. The two you score lowest are the two behaviors that move you this quarter. Which one did you make first, and which are you still avoiding?

Tags: #projectmanagement #ai #career

I Took the Keyboard Back From an Agent Mid-Task - Here's What the New PMP Can't Test

Mykola Kondratiuk — Fri, 05 Jun 2026 06:22:24 +0000

A few weeks back I had an agent reconciling a vendor list. It ran clean. No error, no crash, output looked right. Then I noticed it had merged two suppliers that share a parent company into a single row, which would have thrown off every spend rollup downstream by a real number.

I stopped it and fixed it by hand. Not because anything alerted me. Because I'd been burned by that exact shape before, and the burn taught me something a tutorial never did.

I'm telling you this because on July 9 the PMP changes, and the change points straight at that moment.

What changed in the exam

For the first time, the PMP makes AI mandatory content instead of an elective. The Business Environment domain goes from 8% of the exam to 26%. PMBOK 8 becomes the base. Fees climb in August.

The largest project-management body on earth just put it in writing: AI fluency is core PM competency now, not a specialty track. If you've been treating "PM + AI" as a buzzword, the institution that certifies the role just disagreed with you.

I think that's the right move. It's also where it gets interesting for anyone who actually ships work through agents.

Certified is not practiced

A multiple-choice exam can certify that you know what an agentic workflow is. It can check that you'll pick the textbook answer about AI risk, define non-determinism, name the correct oversight principle.

That's awareness, and awareness is worth certifying.

What it can't reach is the reflex that runs real work. The exam can't certify that you've handed live stakes to an agent, watched it drift, and built the instinct for when it can act versus when you take over. You don't recall that instinct. You earn it, and the only way to earn it is to need it.

I learned the vendor-merge thing from the run where I caught it too late.

What AI fluency looks like in the editor, not the exam

Let me put it in do-this terms. Here's the practiced version, and none of it is a question you can bubble in.

You scope the agent's slice like a statement of work. Not "improve onboarding." Bounded edges, in and out defined before it touches anything:

task: reconcile vendor list for Q2
in_scope:
  - dedupe exact-match names
out_of_scope:
  - merging distinct legal entities   # <- the line that would have saved me
acceptance:
  - row count change is human-reviewed before rollup runs

You know the override moment. This one I'd put first. You only learn the veto by having needed it.

You read the output for what it didn't do, not just whether what's there is correct. The skipped supplier, the untouched edge case.

You design the work so a human can actually check it. If the only way to verify is to redo the whole thing, the slice was wrong.

You size the blast radius before deploy. How wrong can this go, who feels it, answered up front the way you'd treat a change to a live service.

Five reps. All earned. Zero on the exam.

Why this is a level-up, not a layoff

The panic read of this is backwards. The credential catching up doesn't shrink the role, it raises the floor. When the institution names AI fluency as baseline, the person who's practiced it instead of read about it becomes the scarce one. Certified-and-practiced is a much smaller set than certified.

So I wouldn't study the new section and stop. I'd go get the reps it's gesturing at. Hand something real to an agent this week, let it run a little past comfortable, and watch the moment your hand goes for the keyboard. That moment is the competency. It never shows up on a scorecard.

For those of you shipping work through agents already: what's the moment that taught you to override, and could a test have taught it instead?

Tags: #projectmanagement

I Sorted 40 Backlog Items by Shape Instead of Who's Free - Here's What Broke

Mykola Kondratiuk — Wed, 03 Jun 2026 07:29:49 +0000

canonical_url:

Two Fortune 500 execs stood on a summit stage last week and gave opposite answers to one question: is an AI agent a colleague or a tool?

One names his agents and seats them in reviews. The other won't call them colleagues. I watched the debate go by and realized I'd stopped caring about the noun a while ago, because it never once changed what I did with my backlog on a Monday.

So here's the thing I actually changed, and the part of it that broke.

The old default: sort by who's free

For years my planning loop was the same. Look at what needs doing. Look at who has capacity. Assign. The unit was always the person. "Who's free" was the first question.

When agents showed up in the loop, I just slotted them in as another row in the capacity table. Same question, one more name. That's the "tool" answer in practice - an agent is a faster hand, you point it at whatever's next.

It worked until it didn't. The agent would happily take a task that needed a human to even scope it correctly, produce something plausible, and I'd find out two steps later that the plausible thing was wrong in a way that cost me a half-day to unwind.

The change: sort by shape first

I flipped the order. Before I look at who's free, I sort the backlog by shape.

Two buckets. Person-shaped work needs judgment, lives in ambiguity, depends on taste or a relationship I can't write down. Agent-shaped work is defined, repeatable, and gate-able - I can describe the input, the output, and the check it has to pass.

The discipline is that I write the agent's slice like a contractor's statement of work, not a chat prompt:

# agent slice — scoped like an SOW, not a vibe
task: regenerate API client from updated OpenAPI spec
input:
  - openapi.yaml (v3.1, committed)
  - existing client at src/clients/
output:
  - regenerated client, same public surface
gate:
  - all existing contract tests pass
  - public exports diff reviewed by a human before merge
owner_of_outcome: me   # not the agent

If I can't fill in gate and output cleanly, that's my signal: this isn't agent-shaped yet. It's person-shaped work I was about to mislabel because I wanted it off my plate.

I ran this across about 40 backlog items. Roughly a third of what I'd have handed to an agent under the old "who's free" sort failed the shape test and went back to a human.

What broke

Two things broke, and both were useful.

First, my sense of which work was "important." I'd been guarding a pile of tasks as too critical to automate. Half of them were just defined work I was emotionally attached to. They sorted cleanly into the agent bucket and ran fine. The status I'd assigned them was about me, not the work.

Second - and this is the one I underscoped - the gate is only as good as the human watching it. There's research showing that once you treat an agent like a teammate, you review its output less carefully. Naming the thing lowers your guard. So the "colleague" framing isn't warmth, it's a quiet accountability leak. I caught myself rubber-stamping a passing gate because the agent had "been reliable lately." A tool doesn't earn trust. Work does, run by run.

There's an old line going around that fits: you automate the boring, and then humans just manage the failure modes. That's not a downgrade of the human job. The failure modes ARE the job now. Managing the part the agent can't hold is the work that's left, and it's the work that compounds.

Why this is a career thing, not just a workflow thing

The same exec who named his agents also said the hardest part isn't the technology, it's the managers. I think that's exactly right, and it cuts both ways.

If the manager is the bottleneck, the manager is also the leverage. The skill that pays off from here isn't prompt-writing - that floor keeps dropping. It's the older skill of looking at a pile of work and knowing fast which parts are person-shaped and which you can scope, gate, and let run. That's "golden age of the generalist" energy: when the specialization tax drops, the person who holds judgment AND can design the work is the one who wins.

The execs on stage were missing a practitioner, so they argued about the label. The label was never the leadership problem. The work was.

So I'm curious where you've landed: when an agent is in your loop, do you still plan by who's free, or have you started cutting the work by shape first? And where has the gate let something through on you?

I Added a Human Veto to My PM Agent — Here's What Broke First

Mykola Kondratiuk — Fri, 29 May 2026 19:46:10 +0000

Running automation agents for a while now. Most work fine hands-off. But one of them - my project status agent - kept making decisions that felt right in isolation but wrong in context.

So I added a human approval step. Not a "review and confirm" UI widget. An actual veto gate in the workflow itself: the agent drafts the action, pauses, and waits for my explicit go-ahead before doing anything irreversible.

Here's what I didn't expect: the first thing to break wasn't the agent logic. It was my own habits.

The Problem I Was Trying to Solve

My status reporting agent does three things: pulls data from Jira, formats a weekly PM summary, and posts it to the team Slack channel. Straightforward automation.

Except twice in three months it posted something embarrassing. Once it included a stale blocker that had been resolved 48 hours prior. Once it flagged a team member's ticket as overdue when they'd actually shipped early and the tracker hadn't caught up.

Neither was catastrophic. Both were awkward.

The traditional fix would be "add better logic to catch these cases." I tried that. Added freshness checks, added resolved-status validation. Still leaked edge cases.

So I took a different approach: make human review a structural part of the workflow, not a safety net I bolt on when things go wrong.

What I Actually Built

The architecture is boring. Agent generates the Slack message draft, posts to a private review channel, waits for a thumbs-up emoji reaction, only then posts to the team channel.

If no reaction within 2 hours, it pings me directly and kills the task. No silent failures.

The human approval isn't optional and it isn't a fallback. It's a required step in the sequence. The workflow can't progress without it.

I got the idea from reading about Microsoft Conductor, which open-sources a similar pattern for multi-agent orchestration. Human approval as a default workflow step, not a retrofit. Their framing stuck with me: designed-in, not bolted-on.

What Actually Broke

I expected the agent to break. It didn't.

I expected me to approve everything in under 5 minutes. I did, mostly.

What I didn't expect: I stopped trusting my own review. The first week, I read every draft carefully. By week three, I was rubber-stamping. My brain had offloaded judgment to "well the agent probably got it right." The approval gate existed, but the actual human review stopped happening.

This is the invisible failure mode nobody talks about. You add a human step. The human shows up. But they're not really there.

The fix was embarrassingly simple: I added friction. Required a comment, not just an emoji. Had to type at least one word before the workflow could advance. Stupid? Maybe. Effective? Completely.

Turns out the veto gate only works if the human has to engage to use it.

Three Things I Didn't Know I Needed

1. Escalation hooks, not just approval gates.

Not all decisions are equal. Minor formatting choices don't need a veto. Anything that posts externally or modifies data does. I ended up building a simple severity classifier: low = auto-approve, medium = soft review prompt, high = hard gate. Saved probably 70% of the friction without sacrificing coverage where it mattered.

2. A timeout that fails loudly.

My 2-hour window was too long. If I'm in back-to-back meetings, the agent just hung. Switched to 30 minutes with an escalating Slack ping. Now if I haven't approved it, I can't miss it.

3. A clear distinction between "irreversible" and "annoying-to-undo."

I started gating everything. Caught myself adding a veto to the agent that sends me my own daily brief. Nobody else sees that. No irreversible action involved. Human gate added zero value there, only friction.

The useful mental model: if I had to undo this action at 11pm on a Friday, would I care? Yes = human gate. No = let it run.

Why This Matters More Than It Looks

Most of the "AI safety" conversation in enterprise is about governance frameworks and audit trails. That's real. But the practical engineering question is simpler:

Which steps in your agent workflow require a human in the loop by design, not by accident?

Design it from the start and the workflow is reliable. Bolt it on after an incident and you're playing catch-up forever.

The Microsoft Conductor open-source was notable to me not for the code but for the default: human approval ON unless you opt out. Most agent frameworks do the opposite. They default to autonomous and assume you'll add guardrails when you need them.

I think that's backwards. Especially for agents touching anything external: posting, sending, modifying.

Where I'm Landing

The veto gate has been running about 6 weeks now. Two incidents caught before they shipped. One case where my review actually improved the draft - not just filtered a bad one. About 3 minutes of daily overhead.

Worth it. But only because I designed the friction deliberately. The version without the comment requirement was almost worse than no gate at all - it gave me false confidence in an approval that wasn't really happening.

If you're building agent workflows that touch anything irreversible, the question I'd ask first: what happens if this runs at 2am and you're asleep? Whatever you wouldn't want to explain the next morning - that's where your human gate goes.

If any of this maps to workflows you're building, curious what the "irreversible action" problem looks like on your end.

DEV Community: Mykola Kondratiuk

I Wired an AI Fallback Runbook After a 19-Day Outage - Here's All 3 Parts

Part 1: A routing policy you can actually read

Part 2: Plan-banking, the canned goods in your pantry

Part 3: A second source you've actually tested

The part that's easy to miss

I Managed AI Agents Like Junior Hires for a Month - Here Are the 4 Manager Moves That Don't Transfer

Move 1: The trust curve runs backwards

Move 2: Silence is not progress

Move 3: You re-scope every task, not once

Move 4: Corrections have to be written down, not remembered

The three moves that make the mindset operational

I Built a 5-Metric Dashboard for AI's Impact on My Team - Here's the One I'd Refuse to Skip

The metrics we actually have are the wrong ones

The two metrics a dev will actually trust

Why reclassification is the metric that keeps it honest

You already own observability. Extend it.

Porting my mobile app to the web: 3 silent bugs that only exist in the browser

1. The button that did nothing: a pop-up dies if you await first

2. The login that hijacked the whole app

3. The checkout my payment processor wouldn't approve

The pattern

What I'm building

I Read the Claude Tag Launch Twice - Here's the Control Surface Devs Now Own

Reactive and ambient are two different control surfaces

The two gates devs actually configure

Why this is a dev problem, not PM scope-creep

Never trust the client: 9 production lessons from 5 months building an app solo

1. Silent failures are the dangerous ones. Alert on your critical path.

2. Never trust the client. Validate on the server, and lock down who can read each row.

3. With an LLM, every blank you leave gets filled with chaos.

4. Monitor your money path harder than anything else.

5. Ship lean. Every unused permission, library, and line is a liability.

6. Treat your AI model as a swappable commodity. Don't marry a provider.

7. Subtraction is a feature. Deleting code is the work, not a break from it.

8. Make reversible bets, and have the nerve to actually reverse them.

9. Most of building is invisible plumbing that has to be perfect. That's the job, not a detour.

Two more that aren't bugs, but cost the most to learn

The honest scoreboard

What I'm building

I Track What My AI Costs Every Day - Here's the Scoreboard Most Teams Skip

Deploying is not delivering, and the gap is where careers get decided

The accountability layer was the measurement layer

The scoreboard, as actual data

Why this is the moat, not the chore

Fable 5 Went Dark Friday Night. I Ran My Critical Workflow on a Backup Saturday - Here's What Broke

"We'd just switch" is a hope, not a plan

Break #1: the prompt was overfit to one model

Break #2: a silent tool-call dependency

Availability belongs on the risk register

The one-pager that deletes your worst hour

The point isn't fear

I Lead AI Agents Every Day - Here Are 5 Shifts No Standard Tells You How to Make

You stop filling the queue and start drawing the line

You read work you never watched happen

You plan capability, not headcount

You design the alarm before the fire

You own the system, not the deliverable

Where are you on these

I Took the Keyboard Back From an Agent Mid-Task - Here's What the New PMP Can't Test

What changed in the exam

Certified is not practiced

What AI fluency looks like in the editor, not the exam

Why this is a level-up, not a layoff

I Sorted 40 Backlog Items by Shape Instead of Who's Free - Here's What Broke

The old default: sort by who's free

The change: sort by shape first

What broke

Why this is a career thing, not just a workflow thing

I Added a Human Veto to My PM Agent — Here's What Broke First

The Problem I Was Trying to Solve

What I Actually Built

What Actually Broke

Three Things I Didn't Know I Needed

Why This Matters More Than It Looks

Where I'm Landing

1. The button that did nothing: a pop-up dies if you `await` first