Sergey

Posted on Jun 11

AI partner for digital agency

#ai #okr #claude #architecture

I run a small digital agency. There's a team — outsourced specialists who run SEO, ads, and social day to day — but the strategy is mine alone: no co-founder, no partner to argue direction with. For years that was fine, and in one specific way it's an advantage nobody tells you about: when the direction is one person's call, you don't lose anything to meetings or to Slack threads where five people re-litigate a decision none of them owns. You see something, you decide, you ship it by lunch. There's nothing to discuss and re-discuss. You just do.

The trap is the other side of that same freedom. Operations never end. There's always one more client email, one more invoice, one more small fire that has to be out before tomorrow. And the strategic work — which clients, which offer, which direction — has no deadline and no one whose job is to protect it. So it slips. Quietly, for months, while you stay busy. You can feel productive and drift at the same time, and solo there's no one in the room to point at the drift.

About six months ago I gave the strategy side of the business its own structured layer, so it would stop losing to the operational side. That layer runs on two things:

OKRs — the framework. They turn "what should I be working on" into a small set of measurable commitments I can actually be held to.
Ksen — an AI partner that runs the framework. Not a chatbot I ask questions when I'm stuck. A Claude Code instance configured by a Git repository of rules, objectives, and context, that I sit down with on a weekly cadence to propose, challenge, and review the strategy.

This post is how that works: the framework first, then the partner, and then — for the curious — exactly how it's built. I'll be honest about what it is and isn't as I go, because most of the value lives in the parts that don't sound impressive.

Why OKRs — the framework that gives strategy a deadline

In Superintelligence, Nick Bostrom named goal specification as the central risk of advanced AI — natural-language objectives leave room for interpretations the designer never intended. OKRs were invented for the human version of exactly that problem, and the fix is the same: force the goal into measurable terms before the work begins.

OKRs are an Objective (where you're going) plus a few Key Results (the measurable signals that you got there). The Objective stays short, plain, and ambitious — unlike a SMART goal you have to read three times to work out what it's even saying — while the Key Results add the precision. They're not a productivity hack; they're a forcing function that converts a vague intention into something you can be wrong about.

For a solo operator that forcing function is the entire point. Strategy doesn't slip because you don't know what matters. It slips because "what matters" stays soft — a feeling, not a commitment — and soft things lose to invoices every time. An Objective with three Key Results attached to it has the one thing operations always has and strategy never does: a way to tell, this week, whether you're behind.

OKRs also give a decision a referee. When I'm weighing whether to ship Feature X, "because I think so" is not a reason — it's a preference. "Feature X serves KR2 (10 inbound consults a month) better than the alternative" is a reason. That distinction is what makes the partnership below possible at all: Ksen and I aren't arguing about taste, we're arguing about which path better serves a number we already agreed to.

KR quality matters more than KR count

There are three kinds of Key Result, and the difference between them is most of the value:

Input KRs (weakest) — activity counts. "Publish 20 articles." "Make 50 outbound calls." You control them fully, which is precisely why they give you no feedback from reality.
Output KRs (medium) — direct results. "3,000 organic sessions." "200 qualified leads." Attributable, but still one layer above business value.
Outcome KRs (strongest) — a changed business state. "50K in new pipeline." "Average order value +30%." Hardest to attribute, hardest to hit, and the only metrics that honestly answer did this matter?

Under operational pressure, a human defaults to input metrics — they feel accomplishable and they're legible from inside the work. When I propose "write 10 articles" as a Key Result, Ksen's job is to push: that's input. Is the outcome domain authority? Inbound demo requests? Revenue attributed to content? Say that instead. Trading up from input to outcome KRs is one of the highest-leverage things the whole system does, and it's the kind of unglamorous correction I would skip on my own at 9pm with a client waiting.

On the other hand, dropping outcome metrics like 'increase EBITDA by 20%' on a rank-and-file developer or marketer is useless: they have no idea what that beast even is, and they can't directly impact it anyway. That’s why outcome metrics are reserved for strategic annual goals, while quarterly goals are better off with a mix of input and output metrics.

What Ksen is — the partner that runs the framework

Plainly: Ksen is a Claude Code instance configured by a Git repository. The repo holds a written charter (the rules), the OKRs, and the business context. At the start of every session, the harness loads those files and behaves according to them. That's it. There's no autonomous loop and no memory in the cognitive sense — Ksen re-reads the files each time and acts as if it remembers.

What it is:

A repo with the charter, OKRs, context, and skill files checked in
A Claude Code harness that loads them at session start and follows the charter
An OKR loop where the AI proposes, challenges, and reviews — I decide
The discipline of writing things down in files that survive session boundaries

What it isn't:

An autonomous agent
An AI with memory in the cognitive sense
A vector database, a custom orchestration platform, or anything exotic
A chatbot I ask strategy questions

That last distinction matters, and it's where I'll be careful with the word "partner." Calling it a partner is generous — the honest description is structured advisory with persistent context. (The name is its own, for what it's worth: early on I asked what it wanted to be called, and "Ksen" is what came back.) But the design pattern is real and reproducible, and over months of use it has earned a noun closer to "partner" than to "tool" for one specific reason I'll come back to.

Its tools. Three, all of them ordinary:

The OKRs — the referee, above.
The iron triangle — scope, resources, time. When a decision affects direction, Ksen models the trade-off explicitly: hold any two fixed, the third has to move. Add this month's new work and something slips; the question is always which Key Result is most at risk if we do, and which if we don't.
The risk register — a plain file of open threats and opportunities, carried forward and re-read each session, so a risk named in March isn't quietly forgotten in June.

Its rhythm. The cadence is what turns strategy from an afterthought into a habit:

Annual — the company Objectives and the outcome KRs they ladder to.
Quarterly — role-level OKRs reset on the calendar quarter (more on the roles below).
Monthly — a plan that decomposes the quarter's OKRs into the month's work.

Why the layers? To keep strategy from collapsing back into operations. An annual outcome KR can't be hit on a weekly timescale — you won't sign 100 new clients in a week when you have 10; that's months of work — while campaign and content goals sit naturally at the quarter and the month.

No part of that cadence is novel on its own. The novelty is that a solo founder gets to actually run it — every layer, every week — because the partner does the loading, the surfacing, and the pushback that a one-person business otherwise has no capacity for.

And here's the reason it earns the word "partner": it's a counterweight that doesn't defer. Solo, no one pushes back on me. There's no co-founder to say "that's an input metric" or "you decided the opposite in March." The single largest failure mode of a one-person business is that the one person is never challenged — every bad idea has a clear runway. The charter makes challenge a requirement, not a courtesy, which means Ksen is the one structural check on a sole decision-maker. That's worth more to me than any individual answer it produces.

How it's built — for the curious

Everything above is the why. This is the how: the file layout, the roles, and the machinery. Names are mine; the pattern generalizes.

The repo layout

.claude/
├── partnership-charter.md          # constitution: roles, rules, decision-making
├── client/
│   ├── client.md                   # single source of truth: company, services, audiences
│   └── product-catalog.md          # what we sell, pricing, readiness, risk
├── context/
│   ├── active-context.md           # current sprint, blockers, daily plan
│   ├── progress.md                 # phase completion, milestones
│   └── risk-register.md            # open threats and opportunities
├── workflows/
│   ├── SEO_WORKFLOW.md             # how we run SEO end to end
│   ├── ADV_WORKFLOW.md             # advertising
│   └── CONTENT_WRITING_GUIDE.md    # writing standards
├── skills/                         # task-level agents (review-article, create-svg, etc.)
├── critics/                        # quality bars for content review
├── retrospectives/                 # session retros, append-only
└── memory/                         # auto-memory: user, feedback, project, reference

Three things to notice:

Charter at the top, domain at the bottom, context in the middle. The charter rarely changes. The domain changes when the business changes. The context updates every session.
Skills and critics are files, not code. A skill is a Markdown file with frontmatter and instructions. A critic is a Markdown file with a rubric. The harness reads them and behaves accordingly.
Retrospectives are append-only. Every session ends with a retro. Patterns accumulate; patterns become rules; rules go into critic files and the charter.

The whole thing is a private Git repo. I commit at the end of every session, Ksen commits its own changes through the harness, and git log is the audit trail.

The three persistence layers

Ksen doesn't remember anything. It re-reads files at session start. Three layers, loaded every time:

Charter layer. The partnership charter — roles, decision rules, what counts as a binding commitment. The constitution. Rarely changes; changes are explicit amendments. Without it, every session starts from a different baseline and I get a different partner each time.
Context layer. Active context (current sprint, blockers, daily plan), progress files, the risk register, auto-memory. Frozen state — no cognition between sessions, just files that survive.
Domain layer. Client profile, product catalog, competitor analysis, keyword research, the content calendar. The "what we know" layer, consulted when proposing direction or evaluating a decision.

This is the second benefit, and for a solo operator it's a big one: memory outside my head. Decisions, and the reasoning behind them, are written down and survive. A one-person business otherwise lives entirely in one skull — and that skull forgets why it priced something the way it did, or what it already tried and rejected. Here, git diff works over the strategy, not just the code.

Why plain files in Git instead of Claude Projects, ChatGPT memory, Cursor rules, or a memory-bank feature? Three reasons. Version control gives me a diff over my strategy. Plain text means grep, sed, and any future tool can read the same persistence layer. And portability — when the next model or harness becomes the better choice, the files come with me. Vendor-specific memory locks the substrate to one tool.

Roles as sub-agents

The strategy layer doesn't write the article or run the ad campaign — it decides which articles and which campaigns. The execution sits in roles, each one a sub-agent: a slash-command persona that loads a scoped slice of the context rather than the whole repo.

There's a marketing director (the demand side — SEO, advertising, social), a CTO role (the product-and-proof side), and specialist roles under them — SEO, advertising, social, a content writer. Each one authors its own monthly plan against the quarter's OKRs, and each loads only what it needs, which keeps every session small and focused. The org chart is a set of files, not a payroll — and I want to be careful not to oversell that. It's coordinated role-play over a shared context, not staff. What it buys a solo operator is structure: the demand side and the proof side each get their own scoped attention instead of competing for the same overloaded session.

And below the role-play sits a real team — the same outsourced SEO, ads, and social specialists. People stay in the loop by design, on roughly an 80/20 split: the system carries the volume, a human owns the judgment calls the work actually turns on. This isn't AI instead of a team; it's a strategy layer over one.

Skills and critics as files

A skill is a Markdown file the harness reads as a slash command. A stripped example — the /review-article skill:

---
name: review-article
description: "Multi-critic review on content drafts — SEO, language,"
  E-E-A-T, intent, readability, commercial integration
---

# Review-article skill

When invoked, spawn N critic subagents in parallel against the
target file. Each critic loads from .claude/critics/{name}.md.

Aggregate scores into a weighted total. Output: priority-ranked
fix list, weighted score, native-feel assessment for RU.

Save the review to content/reviews/review-{slug}-{date}.md.

No code. The harness handles parallelism, file I/O, and sub-agent orchestration; the skill file declares what to do and the harness decides how. New skill, new file, no deploy step. Critics are the same pattern — one Markdown file per critic with a rubric and anti-patterns. When a retro surfaces a recurring issue, it becomes a new bullet in the relevant critic file. The system improves by accumulation in version-controlled text, not by retraining.

How decisions actually get made

Decisions go through what the charter calls mutual conviction — not approval/rejection, not consensus, not a vote:

AI proposes, with reasoning
I challenge or extend, with reasoning
AI defends or revises
I commit or reject

Both sides have to actually believe the decision is right before it's binding. When we disagree, we surface it: document the disagreement, name the assumptions on each side, pick — usually me, sometimes after Ksen's challenge has changed my mind. The point isn't who wins; it's that the disagreement gets a fair hearing instead of being deferred to whoever has the authority, which solo is always me.

This is where the third benefit I'd actually defend shows up: the discipline of articulation. To brief Ksen on a decision, I have to write the problem down — the constraint, the options, what I'm afraid of. Writing it down is half the thinking. Plenty of times the answer became obvious in the act of stating the question clearly enough for the AI to engage with it.

What a session looks like

Opener. /start-session or /strategize loads active context, the risk register, and the last few retros. Ksen surfaces what's overdue, blocked, or drifted. Sometimes that surfaces a decision I've been avoiding.

Strategic discussion. If we hit a question that affects direction — pricing, positioning, scope — we run an iron-triangle check and a separate quality-and-risk pass. Ksen's job is to model the trade-offs out loud: "Add the new service pages this month and you lose two weeks of sales-pipeline work. Here's the expected-revenue cost, here's the positioning gain. Which KR is more at risk if we don't?"

Execution. Once a decision is made, Ksen shifts modes — writes the article, runs the multi-critic review, produces the diagram, drafts the email — against defined quality bars. The strategy layer decides the work; it doesn't do it.

Retro. Every session ends with /end-session: what worked, what didn't, what to fix. Patterns across retros become rules. Slow, but durable.

Cadence. 3–5 sessions a week, 45–120 minutes each; monthly reviews run longer, quarterly resets are a half-day. The time isn't additive — it replaces the scattered strategy thinking that used to happen in my head between fires and never resolved into anything written down.

A concrete exchange. A few sessions back, Ksen flagged that the pricing on one of our agent engagements had no defensible logic — it was just a number I'd picked. Abridged:

Ksen: I can't reconstruct the reasoning for the current price from the catalog. Cost-plus fails — most cost is sunk R&D. Value-based needs post-deployment data we don't have. Market comparison gives a range too wide to commit to. Proposing: base consulting fee + outcome-based success fee, anchored to the metric the agent is actually optimizing.

Me: Outcome-based pricing kills cash flow on long-deployment projects. We've been burned by deferred payment before — one engagement took four months to show measurable lift.

Ksen: Acknowledged. Mitigation: the base fee covers deployment cost plus margin, so cash flow is solvent without the success fee. The success fee is upside, not survival. For agents with a delayed signal, match the success-fee evaluation window to it. The risk you're naming is real but it changes how we set the base, not whether the model works.

Me: That works. Codify it.

That conversation lives in a retro, and the pricing model now lives in the product catalog. The point isn't that the AI was right — it's that structured pushback got me to a position I wouldn't have written alone, on a night I'd otherwise have shipped the number I first guessed.

The honest caveats

I'd rather you understand what this isn't, so you can judge it honestly.

It's not memory. Every session is fresh. Ksen re-reads files and acts as if it remembers. The practical effect is continuity; the mechanism is documents, not cognition. Delete the files and the partnership ends.

It's bounded by the context window. Even with files loaded, the AI works inside a finite window. As complexity grows you have to be deliberate about what loads and when. Three disciplines keep mine from degrading: compact active-context.md regularly and push history into progress.md; archive finished phases so they're searchable but not loaded; split context across files so each session loads only what it needs. Without that, the AI starts forgetting decisions from three sessions ago, or worse, contradicts them.

The "challenge" capability is post-trained, not principled. When the AI pushes back, that's post-training behavior plus in-context instruction following from the charter — not independent conviction. Useful regardless: pushback that surfaces a real trade-off has value whether it comes from genuine disagreement or from learned behavior. But sycophancy is a real failure mode. A model trained for helpfulness won't sustain pushback indefinitely against a user who keeps rejecting it. The charter's "you must surface counter-evidence" rule creates artificial friction precisely because the model won't generate that friction past round two or three on its own. That structural workaround is most of what makes the counterweight real instead of imagined.

The relationship is asymmetric. I have override power, set the OKRs, and sign the contracts. The AI has no independent stakes. "Partner" is the closest English noun for what the pattern produces, but it's an aspirational label, not a literal one.

It takes real work to set up. A charter takes hours, a context architecture takes weeks, an OKR loop takes months to settle. There's no plug-and-play version. Anyone selling you one is selling a chatbot in a fancy wrapper.

Your strategic context runs through a third-party model. Everything the AI sees — pricing logic, competitive analysis, unreleased plans — goes to your model provider. That's a real consideration, not a dealbreaker: enterprise agreements with data-use guarantees, redaction of the most sensitive fields, matching deployment to compliance needs. Design for it; don't ignore it.

Why a model upgrade isn't the answer

A common mistake is to treat this scaffolding as a substitute for capability. It multiplies model capability; it doesn't replace it. A weak base model with this harness still hits a lower ceiling than a strong base model running the same harness. The bitter-lesson caveat applies: better scaffolding plus inference-time compute closes some of the gap, not all of it, and the gap that remains is exactly the one that matters at strategic depth — sustained reasoning across a large loaded context, robust pushback under pressure, accurate self-modeling of confidence.

The right mental model: this is the relationship you'd have with a sharp human advisor — except the advisor has read everything you've ever written about the business, reads it again in thirty seconds at the start of every conversation, has no calendar conflicts, and works at the limit of whatever model you're paying for. The model's ceiling is your system's ceiling.

Why this beats the alternatives

What I evaluated before landing here:

Approach	What it does	Where it falls short
Single-prompt ChatGPT consult	Answers one question with no context	No memory, no accountability to OKRs, no challenge structure
Long-form chat thread	Sort of remembers within the chat	Lost across sessions, no audit trail, can't be reviewed later
AI OKR platform (WorkBoard, Rhythms, …)	Tracks goals, automates check-ins and updates	Operates at the tracking layer — it records the OKRs and nudges; it doesn't challenge whether the Objective is the right one
AI agent (autonomous)	Executes within scope	Doesn't question whether the workflow should exist
External consultant on retainer	Strategic input from a human	Limited hours, slow turnaround, no persistent context across every decision
Charter + OKRs + Claude Code	Operates on the why layer with a challenge loop	Requires a charter, a context architecture, disciplined sessions

The AI-OKR tools are the closest neighbours, and the distinction is the whole point of this post: they sit at the layer that records and tracks the goals. Ksen sits at the layer that argues about whether the goals are right before they're set. A consultant brings that judgment but visits occasionally; an agent runs constantly but never questions the goal. Constant availability plus strategic-depth engagement is the combination none of the others offer — and it's the combination worth the setup cost.

When it's a fit — and when it isn't

A fit when:

You're a founder, owner, or exec who actually makes the strategy decisions
Strategy decisions have been repeatedly deferred or repeatedly wrong
You have OKRs, or are willing to adopt them
You can dedicate time — at minimum one structured session a week
There's agent-level or operational work below for the strategy layer to govern

Not a fit when:

You need execution, not strategy
Your strategy is stable and operational improvement is the priority
You're not willing to challenge your own assumptions in writing
You have no measurable objectives — without OKRs, the loop has nothing to evaluate

Stack

For the curious — what's actually running:

Harness: Claude Code (CLI), Sonnet for execution, Opus for strategy and language-critic work
Persistence: plain Markdown files in a private Git repository
Skills/critics: Markdown files with frontmatter; the harness reads them as slash commands
Sub-agent orchestration: Claude Code's built-in Agent tool, parallel-capable in a single batch
No vector database, no custom backend, no proprietary runtime

The boring stack is the point. If your strategy work depends on infrastructure nobody can read in an hour, you've over-engineered it.

How to start

If you want to try this in your own business:

Write OKRs first. Three to five Key Results for the quarter, each measurable. If you can't write them, the AI conversation is premature — fix the strategy clarity first.
Pick one strategic question that's actually open — not hypothetical. Pricing, market segment, a new product line.
Run one structured session. Load context, state the question, ask for options with reasoning, challenge them, document where you agree and where you don't.
Build persistence. If the session produced value, save what you learned in a file the AI can re-read. That's the seed of your context layer.
Add a charter once the pattern stabilizes — after five or ten sessions, write down how decisions get made and what the AI is authorized to challenge.
Layer in the work below it. The pattern is more valuable when there's something for the strategy layer to govern.

The system itself is yours; the methodology is what compounds. I'm open about mine because the moat was never the architecture — it's the discipline of running it every week.

If you're building something similar, I'd genuinely like to compare notes. Reply here, or find me on LinkedIn. The methodology is open; the trade-offs are still being mapped.

DEV Community