Drafting OKRs With AI Without Writing Meaningless Goals

#tutorial #webdev

Ask a model to "write OKRs for my team" and you get back something that reads like a strategy deck and means nothing. "Objective: Become the market leader in developer tooling. Key result: Significantly improve user satisfaction." Every word is grammatical. None of it is measurable. You cannot tell on December 31st whether you hit it.

The problem is not the model. The problem is that drafting OKRs is mostly an exercise in resisting the easy phrasing your brain reaches for, and a model with no stake in the outcome reaches for that phrasing faster than you do. Used carelessly, an LLM is a vagueness amplifier. Used as an adversarial editor, it is genuinely useful — but only if you structure the work so the model is forced to commit to numbers and dates.

We drafted three quarters of OKRs through Claude and GPT-4-class models to find where they help and where they quietly make things worse. Here is what actually works.

Make the model commit to a number or reject the line

The single highest-leverage move is to forbid unmeasurable key results at the prompt level. A key result that cannot be expressed as a metric with a starting value, a target value, and a deadline is not a key result — it is an aspiration. Most AI-drafted OKRs fail this test on the first pass.

Give the model the rule and make it self-check. A prompt that works:

"Draft 3 key results for this objective. Each key result MUST contain: a metric, a baseline (current value), a target value, and a date. If you cannot supply a real baseline, write [BASELINE UNKNOWN] instead of inventing one. Reject any key result that describes an activity ('launch X', 'ship Y') rather than a measurable outcome."

The [BASELINE UNKNOWN] instruction matters more than it looks. Left to its own devices, a model will fabricate a plausible-sounding baseline — "improve activation from 22% to 35%" — when it has no idea what your current activation rate is. That fabricated 22% then anchors the whole quarter. Forcing the model to flag missing data turns a confident hallucination into a visible to-do.

Never let a model invent baselines. If it writes "increase retention from 40% to 55%" and you never gave it your retention number, that 40% is a guess dressed as a fact. Treat every number in an AI draft as [CITATION NEEDED] until you replace it with a value from your own analytics.

Catch the activity-disguised-as-outcome trap

The most common failure in real OKRs — human or AI — is the key result that measures effort instead of impact. "Ship the new onboarding flow" feels like a result. It is a task. You can ship it and have onboarding get worse.

Models are good at spotting this if you ask them to specifically, and bad at avoiding it on their own. So run a second pass whose only job is the outcome/output distinction:

"For each key result below, classify it as OUTCOME (measures a change in user or business behavior) or OUTPUT (measures work completed). Rewrite every OUTPUT as an OUTCOME, or explain why no outcome metric exists yet."

This second pass routinely flips half a draft. "Ship onboarding flow" becomes "raise day-7 retention for new signups from baseline X to target Y." The shipping is now implied — you obviously have to build it — but the goal is the user behavior, not the commit.

The model is useful here precisely because it has no ego about the work. A human author wrote "ship the flow" because shipping the flow is the thing they control and the thing they will be busy doing. The model does not care, so it will happily reclassify it.

Use the model to generate the objections, not just the goals

Once you have a measurable draft, the highest-value use of the model is pre-mortem, not authoring. Feed it the finished OKR set and ask it to attack:

"You are a skeptical VP reviewing these OKRs. For each key result, name one way the team could technically hit the number while making the product worse, and one reason the target might be sandbagged or unrealistic."

This surfaces the gaming risk that every metric carries. Target "reduce support tickets by 30%" and the model will point out you can hit that by making it harder to find the support button. That is exactly the conversation you want to have before the quarter, not in the retro.

Keep your real OKRs, their baselines, and their check-in history in one document the model can read back to you. The draft quality jumps when the model can see last quarter's actuals — it stops proposing targets that ignore your trend line and starts grounding them in where you already are.

This is where a structured workspace earns its keep. If your objectives, baselines, and weekly check-ins live in scattered docs and Slack threads, every AI session starts from zero and you re-explain context you already wrote down. Keeping them in one queryable place — a database with the objective, metric, baseline, target, owner, and confidence per row — means you can paste the whole picture into a prompt and get grounded edits instead of generic ones.

What the model still cannot do for you

A model can enforce structure, kill vague phrasing, and stress-test targets. It cannot tell you whether the objective is the right one. "Should we be growing activation or revenue this quarter" is a judgment call about strategy and resourcing that depends on context the model does not have and should not fake. If you ask it to pick your objectives, it will produce four confident, plausible, generic ones — and confident generic strategy is worse than no strategy, because it looks done.

Use the model downstream of the decision. Decide what matters with your team. Then use AI to force that decision into measurable, un-gameable, baseline-grounded key results — and to argue with you about whether the numbers are honest.

Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.