DEV Community: Gregory Shevchenko

Marketing agents need workflow boundaries, not better prompts

Gregory Shevchenko — Mon, 08 Jun 2026 12:30:47 +0000

On June 6, 2026, I spent the day at Profound HQ in New York City as a selected participant and solo builder at the Marketing Engineering Hackathon.

I wrote the canonical field note here:

Profound Marketing Engineering Hackathon field note

This DEV.to version is the builder-facing essay. The Medium version is more reflective. This one is about the operating lesson I took from the room:

Marketing agents do not become useful because they produce better paragraphs.

They become useful when they run a bounded workflow with clear inputs, preserved evidence, explicit review gates, and a measurement loop.

That sounds less flashy than "autonomous marketing agent."

It is also the part that actually matters.

The hackathon prompt was the useful constraint

Profound framed the day around a sharp build prompt:

Find a marketing process that's inhuman in scope or scale, and ship a system or agent that runs it.

That is a better prompt than "build an AI marketing tool."

It forces you to stop thinking in artifacts and start thinking in systems.

A blog post is an artifact. A dashboard is an artifact. A generated brief is an artifact.

The real question is different.

What process produces that artifact?

What evidence does it depend on?

What state does it need to remember?

Where should a human approve the next step?

That distinction is where a lot of marketing-agent work breaks.

Many systems are built around a single impressive output. They can generate a landing page, a report, a campaign idea, or a polished draft.

But when you ask what happens before and after the output, the system gets blurry.

Where did the sources come from?

Which claims were allowed?

What changed since the previous run?

What happens when evidence is weak?

Who decides whether draft becomes publish?

If the agent cannot answer those questions, it is not a workflow yet.

It is a demo with a nice ending.

AEO/GEO is a workflow problem

My work is mostly around AEO/GEO, AI Search visibility, ContentOS, and marketing agents.

AEO and GEO are often described as tactics for getting cited by answer engines.

That includes ChatGPT Search, Perplexity, Google AI Overviews, Gemini, Claude, Copilot, and other AI search surfaces.

That description is useful, but incomplete.

If the job is only "write pages that AI systems might cite," then the solution sounds like a content checklist:

answer the question clearly
mention the right entities
add schema
publish on trusted surfaces
build topical authority

All of that helps.

But the real operating problem is larger.

An AEO/GEO workflow has to answer the same questions repeatedly:

Which buyer prompts matter?
Which engines and regions are being checked?
Which answers mention us?
Which answers cite competitors instead?
Which third-party sources are winning?
Which first-party pages are strong enough to deserve citation?
Which claims need more proof?
Which source gaps should become pages, docs, case studies, decks, or partner mentions?
What changed after the last publishing cycle?

That is not one content task.

It is an evidence pipeline.

And once you see it that way, the right architecture changes.

The useful unit is a packet, not a post

For AI Search visibility, I do not think the useful unit is "an AI-written article."

The useful unit is a packet.

A good packet contains:

the prompt set
the current answer snapshots
the cited-source analysis
the competitor/source gaps
the claim inventory
the approved source pack
the canonical page or distribution asset
the human review decision
the follow-up measurement window

The article is only one output inside that packet.

That framing makes the agent easier to design.

Instead of asking an agent to "do marketing," you can give it a bounded job:

inspect this prompt set
classify cited sources
find missing first-party answers
draft a source-backed brief
check unsupported claims
prepare a human approval packet
schedule remeasurement

Each step has inputs.

Each step has outputs.

Each step can fail.

Each step can be reviewed.

That is boring in the best possible way.

Draft is not publish

One rule I keep coming back to:

Draft is not publish.

This sounds obvious until you watch teams wire agents straight into external surfaces.

For marketing operations, the difference between a draft action and a publish action is not cosmetic. It is a trust boundary.

A draft can be wrong and still be useful.

A published page with weak sources can become a liability.

A suggested source gap can be helpful.

A fabricated claim attached to a brand can damage the source graph you are trying to build.

This is why marketing agents need permission boundaries:

read-only inspection
draft generation
internal review
human approval
external publish
post-publish measurement

Those should not be one permission level.

They should be separate states.

The agent should know which state it is in.

The human should know what is being approved.

The system should keep the evidence trail.

The source graph is part of the product

One thing the hackathon clarified for me is that the event record itself should be source-backed.

The canonical note on my site links to:

the public Profound event page
the kickoff deck
the official event photo gallery
the LinkedIn discussion
the Medium reflection

That is not just tidy blogging.

It is a source graph.

If you want AI systems and humans to understand a person, company, product, event, or case study, the public source graph has to be legible.

It should say what happened, when it happened, where it happened, what role the person played, and which public sources support the claim.

That matters for portfolio pages.

It matters even more for AEO/GEO work.

If your own site does not make the source of record clear, distribution platforms will compete with your canonical page instead of reinforcing it.

For AI visibility, that is a structural mistake.

What I would measure first

If I were turning this into an implementation checklist for a marketing team, I would not start by publishing more pages.

I would start with a small measurement loop.

Start small.

The first version can be simple:

Pick ten buyer prompts
Run them across one or two answer engines
Save the answer snapshots
Record brand mentions separately from citations
List the cited URLs
Classify the missing source gaps
Choose one canonical page to improve or create
Publish the smallest useful source-backed asset
Re-run the same prompts after the next crawl window

That loop changes the work.

The team is no longer asking, "What should we write next?"

The team is asking, "Which source gap is blocking a useful answer, and what evidence-backed asset would close it?"

That is where AEO/GEO starts to feel like engineering.

Full stop.

Where companies go wrong

The common failure mode is not lack of AI tooling.

It is unclear ownership of the workflow.

One team owns content. Another owns SEO. Someone else owns analytics. A founder owns positioning. A freelancer owns distribution. Then an AI tool is dropped into the middle and asked to "make it faster."

The result is usually more output with the same weak source graph.

The fix is not to add more prompts.

The fix is to define the workflow boundary:

what question the workflow answers
which sources it is allowed to use
which claims require approval
what output counts as a draft
what output is allowed to publish
when the same prompt set gets remeasured

Once that boundary exists, the agent can be useful.

Without it, the agent only accelerates ambiguity.

Not because every marketer needs to become a software engineer.

Because the work needs system boundaries, state, and proof.

The standard should be demo, not slideware

The Profound hackathon was judged by demo, not by deck.

That standard is healthy for marketing AI.

A slide can hide a missing source.

A polished paragraph can hide a weak claim.

A screenshot can hide the lack of retry logic.

A demo has to show the shape of the system:

what goes in
what comes out
where state lives
where evidence is preserved
where a human decides
what happens when the proof is weak

That is the bar I want for marketing agents.

Not "can it generate content?"

"Can it run a bounded part of a real marketing workflow without hiding the evidence?"

The takeaway

The best version of marketing engineering is not old growth work with a new title.

It is a change in the unit of work.

Instead of asking:

What content should we publish?

Ask:

What marketing process is too large, too evidence-heavy, or too fast-changing to run manually, and what system would make it repeatable without hiding the proof?

That question is where AEO/GEO gets interesting.

It is also where marketing agents become useful.

Not as chatbots.

Not as content mills.

As bounded systems that help teams inspect, decide, produce, publish, and measure with a trail a human can trust.

FAQ

Q: Was this a speaking event?

A: No. My role was selected participant and solo builder. The canonical field note keeps that role explicit because builder participation and speaking should not be collapsed into the same claim.

Q: Why publish a DEV.to version if the canonical note already exists?

A: DEV.to reaches a more technical reader. The canonical page records the event and source graph. This version adapts the lesson for people building agents, workflow systems, and AI-search pipelines.

Q: What is the main engineering lesson?

A: A marketing agent needs a boundary. It should know its input, state, allowed sources, review point, output type, and measurement loop before it is trusted with real marketing work.

Sources

Canonical field note: Profound Marketing Engineering Hackathon field note
Medium reflection: I spent a day at Profound's Marketing Engineering Hackathon
Profound event page: The Marketing Engineering Hackathon
Profound kickoff deck: Profound Hackathon Kickoff
Official event photo gallery: Hackathon NY by Profound
LinkedIn discussion: Gregory Shevchenko on LinkedIn

Originally published on gregshevchenko.com.

I open-sourced the core of how we get clients cited by AI

Gregory Shevchenko — Sat, 06 Jun 2026 00:46:44 +0000

Last year I watched an AI engine praise a client in one sentence and cite only their competitors in the next. The client was thrilled to be "mentioned." I was not, because a mention you cannot trace back to your own URL is applause with no door behind it.

That gap is the whole problem with AI search, and almost nobody is measuring it. A brand mention is when the model says your name. A citation is when it attributes the answer to a page on a domain you own. Mentions feel nice and disappear. Citations compound, because every one sends the engine and the reader back to a property you control.

We build AI-marketing agents at Humanswith.ai, and we run the same loop for clients every week: measure, produce, optimize, design. This year I open-sourced the core of it. Four small tools, MIT-licensed, no account, no API keys, that are the honest skeletons of the agents inside our hosted Workspace. The Workspace runs the loop at scale. The free tools hand you the method.

TL;DR — one tool per step:

ai-visibility-probe-lite — measure mention-share versus citation-share across engines.
contentos-agent-lite — write from an eight-gate process with canonical-first distribution.
aeo-site-audit-lite — audit the three retrieval gates: fetchable, chosen, extractable.
brand-card-lite — generate on-brand social cards from a tokens file.

Here is the loop, one tool at a time.

How do you find the prompts where AI search ignores you?

You find them by running brand-free discovery prompts through the engines and recording who gets named and who gets cited. The first measurement is almost embarrassingly simple: ask the questions your buyers actually type, then read the answers.

The first tool, ai-visibility-probe-lite, is a kit for exactly that. You write brand-free discovery prompts, the questions a buyer asks before they have heard of you ("best X for Y," never "is Acme any good?"), run them yourself in ChatGPT, Perplexity, or Gemini, and paste the answers back. It then reports two numbers that most dashboards collapse into one: mention-share, how often the answer named you, and citation-share, how often it cited a URL you own. Brand-free prompts are the discipline here. Asking "is my brand good" measures how the model feels about you, not whether it treats you as a source. [1]

The first time you run this on your own company, it stings a little. That sting is the point.

How do you produce content an AI engine will actually cite?

You produce citable content by writing from a documented process with sources, not from a single prompt. The fastest way to write a page no engine will ever cite is the opposite: paste a topic into a chatbot and ship the first draft. The second tool exists to stop that.

contentos-agent-lite is a content agent you run inside your own coding assistant. It writes from a documented process rather than a single prompt, walking eight gates: business context, research, a source pack, a brief, a draft, an editorial pass, a publish-readiness check, and distribution. It will not invent a fact that has no source, which already puts it ahead of most AI writing.

Its eighth gate is the one teams skip and then pay for. Gate 08 is canonical-first distribution. Before anything ships, it lints the page for the signals that decide whether an engine can fetch and attribute a citation: one canonical URL, one H1, Open Graph tags, Article structured data. Then it drafts the LinkedIn, Medium, and dev.to versions, and every one of them links back to your canonical page. Publish on your own domain first. Make every rented copy point home, so the authority piles up where you own it instead of where you borrow it. [2]

Why can a well-written page still go uncited?

A page can be beautifully written and still never get cited, because three gates sit in front of every citation, and most people only ever check the first.

The third tool, aeo-site-audit-lite, checks all three. Fetchable: can a crawler reach and index the page (HTTP status, robots and noindex, canonical, structured data). Chosen: among the pages it can reach, why does it pick a competitor (you hand it the URLs an engine cites today, and it classifies the gap as authority, freshness, a competitor's data table, or missing proof). Extractable: once chosen, can it lift a clean answer out (a concise lead answer, headings, schema, lists and tables it can quote in one block).

Point it at a URL or a local file and you get a scored report and a prioritized fix list. It runs offline, needs no keys, and uses only the Python standard library, so you can read every line before you trust it. [3]

How do you keep distributed content on brand?

You keep it on brand by generating every social card from one brand-tokens file, so colors, fonts, and logo stay consistent wherever a post lands. Citable content still has to look like you when it shows up in a feed. The fourth tool, brand-card-lite, turns a tiny brand-tokens file (your colors, fonts, and logo) into a self-contained, on-brand social card, and it lints those tokens for contrast and consistency so the text actually meets WCAG and the fonts have real fallbacks. No image model, no cloud service, no keys. Templating and a little math you can audit. [4]

Where do companies go wrong with AI visibility?

Across the audits we run, the same four mistakes show up again and again, and each tool above targets exactly one of them.

Loop step	Free tool	The mistake it fixes
Measure	`ai-visibility-probe-lite`	Stopping at "are we mentioned?" instead of measuring the citations to URLs you own.
Produce + Publish	`contentos-agent-lite`	Shipping a one-prompt draft with no sources or structure — the page an engine skips.
Optimize	`aeo-site-audit-lite`	Never checking retrieval, so a page stays unreachable, un-chosen, or un-extractable.
Design	`brand-card-lite`	Letting the silo win — distributing with no canonical home, so a rented platform takes the authority.

How do you run the AI-visibility loop this week?

You do not need the hosted product to start. Here is the minimum loop, the one I would run if I had an hour:

Measure. Run ten brand-free prompts through one engine with the probe. Write down mention-share and citation-share.
Pick one loss. Find a prompt where a competitor is cited and you are not.
Write the answer. Take that prompt through the content agent's gates, then publish it on your own domain first.
Audit the page. Run the site auditor and close the top Fetchable, Chosen, or Extractable gap.
Dress it. Generate an on-brand card so the post looks like you.
Re-measure next week. Run the same prompts again and watch the two numbers move.

Why open-source the core for free?

People ask why I open-sourced the part that took us years to figure out. The honest answer is that the method is not the moat. Brand-free measurement, the mention-versus-citation split, the eight content gates, the three retrieval gates, canonical-first distribution: that is the part that changes outcomes, and it is now fully in the open.

The hosted Humanswith.ai Workspace adds the part you cannot do by hand at any real scale: automated multi-engine scans on a weekly cadence, the publishing and re-measurement loop that proves whether a fix actually worked, and the team and hosting around it. The free tools tell you where you stand. The Workspace runs the loop for you. [5]

It is a deliberate bet. The companies that win AI search will treat it as an operating loop, not a one-time audit. So clone a tool, measure one prompt set, and fix one page this week. That is the entire ask.

Frequently asked questions

Are these really free and open-source? Yes. All four are MIT-licensed on GitHub, with zero runtime dependencies and no API keys. You can read every line before you run it.

What is mention-share versus citation-share? Mention-share is how often an answer names your brand. Citation-share is how often it cites a URL on a domain you own. An engine can name you while sourcing only competitors, so the tools report both. [1]

What does canonical-first distribution mean? Publish on your own domain first as the canonical version, then adapt for the platforms, with every copy linking back to that URL. The citation authority stays on the property you control instead of the silo you rent. [2]

Will this guarantee AI cites me? No, and anyone who promises that is selling something. The tools measure the gates that come before a citation and give you a way to close them in priority order. The result is earned.

When should I move to the hosted Workspace? When you want the loop run for you across a real content program, rather than running each tool by hand. [5]

References

[1] [ai-visibility-probe-lite](https://github.com/humanswith-ai/ai-visibility-probe-lite) (MIT)
[2] [contentos-agent-lite](https://github.com/humanswith-ai/contentos-agent-lite) (MIT, incl. gate 08 canonical-first distribution)
[3] [aeo-site-audit-lite](https://github.com/humanswith-ai/aeo-site-audit-lite) (MIT)
[4] [brand-card-lite](https://github.com/humanswith-ai/brand-card-lite) (MIT)
[5] Humanswith.ai Workspace

Originally published on gregshevchenko.com.

The open-source AI Search visibility audit stack I’m building

Gregory Shevchenko — Tue, 02 Jun 2026 19:43:37 +0000

Most AI Search visibility work becomes vague too early.

A team asks, “How do we show up in ChatGPT?”

Then the work jumps straight to prompts, dashboards, content ideas, competitor checks, and brand-mention screenshots.

Those things can matter.

But if the page layer is messy, the rest of the system becomes hard to interpret. A weak citation rate may be a content problem, a crawl problem, a schema problem, an entity problem, or a page that is not represented clearly enough for machines to reuse.

So I am building geo-audit as the open-source, inspectable layer of my AI Search visibility workflow. The first public slice is intentionally boring: crawl the site, inspect the head tags, parse JSON-LD, check canonical URLs, and turn that into a repeatable proof packet before asking an LLM to judge anything. [1]

Why start with deterministic gates?

Because deterministic checks remove preventable noise.

Before a team asks an LLM whether a page is “good for AI Search,” I want to know simpler things:

Can the route be fetched?
Is the final URL stable?
Is there one clear title?
Is there a meta description?
Does the canonical URL match the intended source of record?
Is there a visible H1?
Does JSON-LD exist?
Does JSON-LD parse?
Are obvious noindex or schema gaps present?

None of that requires a model.

That is the point.

AI Search visibility work needs LLM checks later, but the base layer should be code-level, repeatable, and boring enough to rerun after every change.

What went public first?

The first public release adds two modules to geo-audit: site-crawl-lite and head-schema-gate. [2]

site-crawl-lite gives a small-site route inventory. It checks status, final URL, title, meta description, canonical, H1, word count, JSON-LD count and types, link counts, image alt counts, and noindex state.

head-schema-gate checks the homepage or target route more directly: title, description, canonical, H1, Open Graph, JSON-LD parse errors, Article author sameAs, BreadcrumbList, and FAQPage signals.

These modules are informational gates for now. They produce scores and action items, but they do not silently change the existing composite methodology.

That was deliberate.

A public audit tool should not move scoring goalposts in the same change that adds new checks.

How does the secrets boundary work?

The public repository must never contain real API keys, private credentials, internal hostnames, or personal secrets.

Users who clone the repo bring their own keys.

The deterministic modules run without paid APIs. If someone wants richer provider checks, they can copy .env.example into a local, gitignored .env file and configure their own credentials. [3] [4]

The boundary is simple:

Public repo = safe code, docs, placeholders, tests, and trust checks.

Private workspace = local keys, team credentials, configured providers, and deployment-specific proof.

That boundary matters more when agents are involved.

If an agent can safely improve public code without touching secrets, the tool can evolve in public. If the same stack runs in a private environment with configured keys, it can do richer audits without leaking the private layer.

Where does this fit in the broader stack?

I do not think one tool replaces everything.

The operating stack has layers.

First, crawl and head/schema gates answer the technical baseline question: can the site be fetched and represented cleanly enough for search engines, answer engines, and social surfaces?

Second, ContentOS readiness answers the publishing question: does the page have a source pack, claims, evidence, answer units, FAQs, and human review before publication? [6]

Third, distribution checks answer the authority question: do Medium, LinkedIn, Habr, VC.ru, X, Substack, GitHub, and profile pages route authority back to the canonical URL?

Fourth, measurement answers the business question: do prompt sets, citations, source context, competitors, and downstream traffic change after the work ships? [5]

geo-audit is strongest at the first layer today.

That is okay. A good open-source base should be small enough to inspect and useful enough to run.

What did the first proof find?

I ran the two new gates against my own site before publishing the canonical note.

The result was not a dramatic failure, which is exactly what a baseline gate should show after recent technical cleanup: site-crawl-lite returned 99/100 across 19 checked routes, and head-schema-gate returned 94/100 on the homepage. [2]

The remaining notes were small follow-ups: one route without JSON-LD and a BreadcrumbList recommendation where breadcrumb-like markup already exists.

That is useful signal.

It says the next improvement is a schema consistency pass, not a panic rewrite.

What does this not replace?

This does not replace Screaming Frog, Sitebulb, Oncrawl, log-file analysis, enterprise crawls, full keyword suites, or paid brand-monitoring products.

Those tools are still useful.

The point is different.

I want an install-first, agent-friendly, testable stack that can run inside a repo workflow, produce proof artifacts, keep secrets local, and explain exactly what it checked.

That makes it easier to improve the process in public and then write about the improvement with the source code attached.

What I want to build next

The next modules are not glamorous.

That is a feature.

I want an internal-link graph, route-readiness runner, image-alt gate, sitemap/feed/llms consistency checker, and a stronger bridge from ContentOS source packs into publish-readiness scoring.

The pattern I want to keep is simple:

Build a deterministic layer.

Prove it on my own site.

Publish the code.

Write the canonical note.

Then distribute the idea only after the first-party page is the source of record.

The canonical version of this article lives on my site, where I keep the related pages and distribution links updated. [7]

FAQ

Is geo-audit a replacement for Screaming Frog or Sitebulb?

No. It is an inspectable AI Search visibility audit layer. Enterprise crawlers still matter for large-scale crawling, log-file analysis, and advanced technical SEO workflows.

Does the public repo contain API keys?

No. Public users bring their own keys through local environment variables or a gitignored .env file. The public repository should contain placeholders and documentation, not real credentials. [3]

Can the tool run without paid APIs?

Yes. The deterministic modules run without paid API keys. Optional keys unlock richer brand-mention, PageSpeed, and provider-specific checks. [4]

Why start with crawl and head/schema gates?

Because LLM scoring is less useful when a page is missing canonical tags, titles, descriptions, JSON-LD, or crawlable routes. Deterministic checks remove preventable noise first.

Sources

[1] GitHub — geo-audit public repository

[2] GitHub PR — geo-audit pull request #10

[3] GitHub — geo-audit TRUST manifest

[4] GitHub — External services and BYOK configuration

[5] Gregory Shevchenko — How to measure AI Search visibility

[6] Gregory Shevchenko — What ContentOS is

[7] Canonical version on my site — gregshevchenko.com

How to roll out an Agentic Workspace inside a marketing team

Gregory Shevchenko — Tue, 02 Jun 2026 19:22:17 +0000

Most AI adoption plans start with the wrong unit.

They ask which role can be replaced.

A safer engineering question is narrower:

Which repeatable workflow can be governed?

That distinction matters because the strongest evidence around AI and work is task-shaped, not whole-role-shaped. The OpenAI/OpenResearch/UPenn paper on GPT exposure is often cited because it shows broad exposure across the labor market, but it does not say that entire jobs are already automated. [1]

Anthropic’s Economic Index points in the same direction: AI use is uneven, task-level, and split between augmentation and automation patterns. [2]

So the practical rollout unit is not a job title.

It is a workflow.

What is the right first unit for agent rollout?

A workflow has a clear start and a clear end.

It has a trigger, approved inputs, a transformation step, a quality gate, a human approval point, and a measurement loop.

A role is too broad. “Marketing manager” includes strategy, research, writing, review, publishing, reporting, coordination, taste, and accountability. If you try to automate the whole role, the system becomes vague before the first run.

A workflow is observable.

Good first workflows for a marketing team are smaller:

weekly AI Search visibility measurement
source-backed canonical page updates
content brief generation
internal link QA
distribution rewrites for Medium, LinkedIn, X, or DEV.to
schema, head tag, sitemap, and visible-link checks

Small enough to control.

Important enough to matter.

Why are raw AI tools not enough?

Developer tools already show the pattern.

GitHub describes Copilot coding agent as working in its own environment, running checks, and preparing pull requests for human review. [4]

Claude Code and Codex point in the same direction: agents that can read context, use tools, prepare changes, and return reviewable work instead of only answering in chat. [5] [6]

But this does not mean every office worker should be dropped into a blank agent terminal.

That is the adoption trap.

Raw agent tools assume a strong operator. Most marketers do not want to manage repository context, shell commands, permission boundaries, tool routing, proof loops, and rollback logic.

They need a prepared surface.

That is what I mean by an Agentic Workspace: a governed layer where prepared agents work with approved sources, narrow permissions, quality gates, review packets, and human approval.

What does the 30-day rollout look like?

The rollout has four phases.

1. Scope one workflow

Pick one recurring workflow.

Write down:

what starts it
which inputs are allowed
what output is expected
what the agent may touch
what the agent must never do
what “done” means

Example acceptance criteria:

“The packet is done when it includes one canonical URL, one source list, one changed page or draft, no orphan footnotes, passing visible-link checks, passing layout/style gates, and one next action.”

Boring acceptance criteria are useful.

They make review cheaper.

2. Build the source pack

Do this before adding more agents.

A marketing agent is only as good as the material it is allowed to use.

A source pack should include:

company facts
product pages
approved positioning
URLs
allowed claims
banned claims
examples of strong output
examples of weak output
language and style rules

This matters even more for AI Search, AEO, and GEO work.

If you want AI systems to cite your brand, the workflow needs source clarity, entity consistency, visible links, answer-ready blocks, and structured data discipline.

Those requirements should be inside the packet, not remembered manually after the draft is finished.

3. Use prepared agents with narrow permissions

Do not start with dozens of agents.

Start with a few bounded ones:

research agent: extracts facts, caveats, and unanswered questions
brief agent: turns the source pack into a specific task
canonical-page agent: proposes structure, FAQ, sources, and schema
QA agent: checks footnotes, links, head tags, JSON-LD, sitemap/feed/llms coverage, and layout
distribution agent: adapts canonical content for other platforms without breaking canonical-first logic
measurement agent: updates prompt coverage, citation status, crawl status, and next actions

Most early agents should be draft-only.

Let them prepare work.

Do not let them publish without a human gate.

4. Review packets and rejected examples

The review packet is the operating object.

It should show:

what changed
which sources were used
which checks passed
what failed
what was rejected
what the smallest next action is

Rejected examples are not waste.

They are memory.

If a draft was generic, unsupported, too promotional, visually broken, or wrong about canonical logic, save that rejection and use it to improve the next run.

That is how agentic work compounds: not from one magical prompt, but from a system that remembers what “not good enough” looks like.

What is the best first workflow for marketing teams?

I would start with weekly AI Search visibility.

The loop is concrete:

Capture current entity facts and canonical URLs.
Run a fixed prompt set across target answer engines or manual checks.
Record mentions, citations, missing sources, and wrong recommendations.
Choose one canonical page or source-surface improvement.
Prepare the page or distribution update.
Run footnote, visible-link, schema, sitemap, and layout gates.
Publish after human approval.
Repeat next week and compare the same prompts.

This is narrow enough for adoption.

It is also strategically useful because AI Search visibility connects content, technical SEO, brand facts, external sources, and measurement.

Microsoft’s Frontier Firm framing is useful here because it describes humans and agents as part of a new operating model, not a simple one-step replacement story. [3]

What should stop the rollout?

Stop or slow down if the team has no source pack.

Stop if nobody can define “done” before reviewing the output.

Stop if the same weak draft or layout bug returns every week.

Stop if an agent can publish, delete, overwrite, or send externally without a human gate.

Stop if the team celebrates output volume without measuring citation, crawl, review quality, or business movement.

In most cases, the fix is not a better prompt.

The fix is a better workflow boundary.

What is the practical takeaway?

Do not roll out AI by trying to replace a marketing role.

Roll it out by governing one repeatable workflow.

Start with a source pack.

Use prepared agents.

Require review packets.

Capture rejected examples.

Measure the same loop next week.

That is how an Agentic Workspace becomes useful: not as a pile of prompts, but as a controlled operating layer for work that humans still own.

The canonical version of this article lives on my site, where I keep the distribution links and related research pages updated. [7]

Sources

[1] OpenAI / OpenResearch / University of Pennsylvania — GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models

[2] Anthropic — Economic Index: New building blocks for understanding AI use

[3] Microsoft WorkLab — 2025 Work Trend Index: The Year the Frontier Firm is Born

[4] GitHub Docs — About GitHub Copilot coding agent

[5] Anthropic Docs — Claude Code overview

[6] OpenAI — Introducing Codex

[7] Canonical version on my site — gregshevchenko.com

Token economics for AI agents: why workflow ownership matters more than task automation

Gregory Shevchenko — Tue, 26 May 2026 15:40:44 +0000

A weak version of AI labor economics sounds like this:

A $100,000 knowledge worker can be replaced by a $2,730 token bill.

That framing works as a shock, but it breaks as an operating model. Production work does not become cheap just because inference is cheap. You still need context, tools, prompts, permissions, retries, evaluation, approvals, security, and someone accountable for the result.

A better framing is this:

Your time now competes with token economics, but the real unit of competition is not the person. It is the repeatable workflow.

Why is token bill not workflow cost?

A recent token-economy model estimated that a comparable AI workflow could have a raw token bill around $2,730 per year. In the same analysis, a fully loaded AI-agent workflow is closer to $82,000 per year once the orchestration layer is included. That remains meaningfully below a fully loaded human benchmark around $135,000 per year for a $100,000 salary, but it is not “AI labor is basically free.” [1]

This distinction matters.

If you compare a human role to only API spend, you will make bad decisions. If you compare a human workflow to a complete AI workflow, you can start making practical ones.

Real cost includes model calls, retrieval, tool access, prompt and workflow design, deterministic checks, human review, failed attempts, monitoring, maintenance, and governance.

That missing middle is where many AI replacement narratives become sloppy.

Why does AI attack workflows before roles?

A more useful question is not “which job disappears?”

Ask this instead:

Which repeatable processes inside this job can now be done by a machine-assisted workflow?

A 2023 OpenAI, OpenResearch, and University of Pennsylvania paper on GPT exposure is often misquoted. A safer reading is that around 80% of workers could have at least 10% of their work tasks affected by GPTs, and around 19% could have at least 50% of tasks affected. [2]

That is not the same as saying 80% of work is already automated.

It means task exposure is broad, uneven, and workflow-specific.

Most roles combine judgment, communication, source gathering, drafting, checking, routing, publishing, reporting, and client/team coordination. AI is much better at absorbing some of those layers than others.

What is the practical unit for AI automation?

For operators, developers, marketers, consultants, and agency teams, a workflow with clear boundaries is the right unit.

A useful workflow has seven parts: input, approved source, transformation step, quality gate, approval point, output, and measurement loop.

Once work is described that way, you can decide which parts belong to an agent, which parts need deterministic code, and which parts must stay with a human.

This is where most “AI agent” projects either become useful or become theater.

If an agent only produces text, it is a drafting assistant. If it can read the right sources, take the right action, run checks, preserve evidence, and stop when quality is not good enough, it starts to become workflow infrastructure.

What changes for developers and operators?

New literacy is not just prompting.

Prompting is the visible layer. Deeper leverage comes from workflow ownership.

A workflow owner decides what evidence is allowed, what “done” means, which failures are unacceptable, which checks are deterministic, when approval is required, how the system recovers after a bad output, and how improvement is measured.

That is why tools like Claude Code, Codex, Cursor, Windsurf, n8n, MCP servers, and repo-level proof loops matter. They are not just “AI chat with files.” They are early versions of a new operating layer for knowledge work.

People who can turn messy work into measured workflows become harder to replace.

People who only sell hours for repeatable cognitive tasks become easier to compare against token economics.

How should an agent workflow be mapped?

Here is a practical way to think about the system:

Not every layer becomes automated.

Every layer becomes explicit.

That is where cost, quality, and speed can improve together.

How does this apply to content and marketing workflows?

Take content production.

A weak AI version is: “write me an article about AI Search.”

A stronger workflow version starts with a canonical page and search intent, gathers internal evidence and external sources, scores pre-writing readiness, drafts with a clear answer structure, checks claims and footnotes, adapts the canonical article for Medium, LinkedIn, DEV.to, Habr, or X, verifies visible links and canonical references, and measures whether the page gets crawled, cited, shared, or reused.

This is not “AI writes content.”

It is a controlled content-production corridor where a human operator uses AI to increase throughput without giving up editorial control.

That difference matters for SEO, AEO, GEO, and AI Search visibility. A language model can produce words quickly. A workflow can produce reliable assets repeatedly.

How does this apply to coding-agent workflows?

A weak coding-agent loop is simple: ask for a fix, accept a patch, and hope the issue is gone.

A stronger loop defines the failing behavior, reproduces it, writes or runs a red check, makes the smallest safe change, runs the proof loop, documents the failure mode, and adds that bug class to the next gate.

This is why “agent persistence” can become a quality bug. If an agent keeps patching without a better gate, it may return the same class of defect again and again. More persistence is not the fix. A clearer workflow boundary and a stronger stop condition are the fix.

What should you build first?

If you are trying to apply AI agents inside a real business, start smaller than “replace a role.”

Start with one repeatable workflow.

Write down what starts it, which sources are allowed, what output is expected, what the agent may do, what the agent must not do, which checks must pass, where a human approves or rejects, and how success is measured after delivery.

Then automate the boring middle.

Do not automate the accountability.

What is the career implication?

If you are a developer, marketer, consultant, analyst, editor, or agency operator, the question is not whether AI replaces you tomorrow.

Ask whether your work is packaged as isolated tasks or owned as a workflow.

Task executors are compared against cheaper task execution.

Workflow owners are compared against the value of the system they can run.

That is the shift.

Your time now competes with tokens. But judgment, systems thinking, taste, source discipline, and workflow ownership still compound.

Move from task execution to workflow ownership before the market forces the transition.

FAQ

Does this mean AI replaces a $100,000 employee for $2,730?

No. The $2,730 figure is a raw token-bill comparison, not a fully loaded workflow cost. A useful comparison includes orchestration, QA, retries, tools, monitoring, and human approval. [1]

Does GPT exposure mean whole jobs are already automated?

No. The safer 2023 reading is task exposure: many workers may have some tasks affected by GPTs, but that is not the same as full role automation. [2]

What is the practical first step for a team?

Pick one repeatable workflow and define its input, allowed sources, output, quality gate, approval point, and measurement loop before adding an agent.

Sources

[1] MeaningfulTech — The Token Economy: What a $100,000 Employee Really Costs in the Age of AI

[2] OpenAI / OpenResearch / University of Pennsylvania — GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models

[3] Canonical version on my site gregshevchenko.com

Autocompaction Is Not Memory

Gregory Shevchenko — Tue, 26 May 2026 15:10:30 +0000

Long-context agents already summarize.

That is useful.

It is not memory.

Built-in autocompaction helps Claude Code, Codex, Cursor, Windsurf, or another coding-agent surface survive a long session. But a team workflow needs something stricter than "the chat got summarized."

It needs portable operational state.

That is the difference I keep coming back to while working on a local MCP token-economy stack: compression keeps one conversation alive; handoff lets a workspace continue.

What built-in autocompaction does well

Autocompaction is good at reducing the raw context of a product session when the window gets too full.

For a single agent in a single chat, that can be enough:

preserve the broad goal;
compress prior discussion;
keep the model moving;
avoid forcing the user to start from zero.

That is real value.

The problem appears when the work is no longer just one chat.

In a real repo, the same task may move between Claude Code, Codex, Cursor, Windsurf, a remote Mac Mini, MCP tools, CI gates, and human review. At that point, a narrative summary is not the same thing as an operational contract.

What autocompaction usually loses

The details that matter most are often the least summary-shaped:

which approvals were actually granted;
which files or services are off-limits;
which exact values must not drift;
which sources are trusted, semi-trusted, or untrusted;
which errors were already tried and fixed;
which commands passed;
which checks are still pending;
what the next agent must not redo.

Those are not just "context."

They are control-plane state.

If that state disappears during compaction, the next agent can sound confident while silently re-opening risks the previous agent had already closed.

The missing layer: local handoff MCP

The mechanism I want is a local handoff MCP that writes a structured handoff before the window is full.

The point is not to make a prettier summary.

The point is to make a resume contract that another agent can use safely.

A minimal handoff should preserve:

objective and done condition;
loaded instructions and constraints;
approval state;
exact values that must not change;
risks and red flags;
actions already taken;
errors and fixes;
pending verification;
next recommended step;
what not to redo.

That contract should live in the workspace, not only inside the product's private chat memory.

Autocompaction vs local handoff

The timing difference matters.

Autocompaction often happens after context pressure is already high. A handoff protocol can pre-score the session earlier and decide whether the next transition needs a normal summary, a red-flag handoff, or a hard stop for human review.

Why a 1M context window does not remove the need

A larger context window is valuable.

I want it. I will use it.

It lets the agent keep more code, logs, source material, and prior reasoning available before compression becomes necessary.

But a larger window mostly delays the failure mode. It does not automatically make state portable, trusted, auditable, or shared across products.

One million tokens can still contain:

stale approvals;
buried secrets;
contradictory instructions;
obsolete diagnostics;
repeated failed attempts;
unlabeled source trust;
no clear next step.

More room is not the same thing as better state management.

A simple handoff contract

Here is the shape I want agents to produce at real transition points:

## Objective
What we are trying to finish.

## Done Condition
The exact observable state that means this task is complete.

## Constraints
Loaded repo rules, user constraints, risk boundaries, and trust labels.

## Approval State
What the user approved, what remains unapproved, and what requires a checkpoint.

## Actions Taken
Commands, edits, deploys, external publications, or tool calls already completed.

## Verification
Checks that passed, checks that failed, and checks still pending.

## Red Flags
Secrets, live ops, destructive commands, ambiguous ownership, or same-defect loops.

## Next Step
The recommended next action for a fresh agent.

## Do Not Redo
Work already completed or paths already ruled out.

This is intentionally boring.

Good handoff is not supposed to be clever. It is supposed to be hard to misunderstand.

What we can measure

The useful question is not "did the summary look nice?"

The useful question is whether the next agent can continue with less waste and fewer mistakes.

I would measure:

resume success: can a fresh agent take the next step from the handoff alone?
re-read rate: how often does it need to reopen old files or old chat context?
token estimate: how much context was avoided during resume?
leak rate: did secrets, private implementation details, or off-limits facts enter the handoff?
approval preservation: did the resumed agent retain the correct permission boundary?
redo rate: did the agent repeat completed work?

This is where the MCP token-economy angle becomes practical. The point is not just fewer tokens. The point is fewer unsafe or wasteful recovery loops.

Where this helps

This pattern is useful when:

a coding session is approaching context pressure;
a task is moving from Claude Code to Codex or Cursor;
a local agent hands work to a remote machine;
a background MCP workflow needs to resume later;
the work touched live deploys, credentials, publications, or approvals;
the same defect class has already appeared twice.

In those cases, "the chat will summarize itself" is not enough process.

Practical takeaway

Use autocompaction.

Use bigger context windows when they are available.

But do not confuse either one with memory.

Memory for agentic engineering is not just remembering what was said. It is preserving the operational state that lets the next actor continue safely.

Autocompaction helps a chat survive.

Handoff helps a workspace continue.

Sources

MCP stack token economy — the local measurement frame behind byte saving, cache-friendliness, and prompt-context economics.
Agentic engineering for marketing teams — the shared operator vocabulary for Claude Code, Codex, Cursor, Windsurf, n8n, MCP, proof loops, and quality gates.
AI agent failure loops — the QA and stop-rule note behind red-first gates, blind validation, rejected examples, and failure-loop control.

Full canonical note:

https://gregshevchenko.com/notes/autocompaction-is-not-memory/

AI Agent Failure Loops: When Persistence Becomes a Quality Bug

Gregory Shevchenko — Sun, 24 May 2026 16:16:51 +0000

In 2026, I want my AI coding agents to have one more rule: know when to stop.

AI agents do not always fail by stopping.

Sometimes they fail by continuing.

I ran into this while building a custom Cyrillic font extension for a real brand system. The task looked concrete: make Cyrillic letters, Latin letters, numerals, and special symbols feel like one editorial type family.

Claude Code and Codex kept working. They generated files, exported proofs, reported progress, and fixed the last visible complaint.

But the same defect class kept returning.

That is the dangerous version of an AI-agent failure loop: the workflow looks productive while the real quality problem survives.

What is a failure loop?

A failure loop is a repeated pattern where an agent keeps producing new candidate fixes while the same underlying defect remains unresolved.

It usually has five steps:

The user rejects the same kind of defect again.
The agent patches the latest symptom.
The proof gate is too weak to catch the issue.
The agent asks for another manual review.
Everyone spends another cycle on the same problem.

One mistake is normal.

The real process bug appears when the agent continues after its validation system has already failed.

Why normal proof loops can fail

Proof loops are useful. Tests, screenshots, build checks, linting, diffs, and generated reports all matter.

But proof loops can also become theater if they measure the wrong thing.

In my font project, the agent could prove that the font compiled, the PDF rendered, the screenshot existed, bounding boxes changed, and a numeric score improved.

That did not prove the letters looked right.

Users were rejecting a different thing: visual consistency.

Some Cyrillic glyphs felt too short, too thick, too loosely spaced, or structurally wrong next to Latin letters.

If the gate cannot see the defect the human keeps seeing, the gate is not allowed to declare the task done.

The rule I now use

After the same visible defect class appears twice, stop normal implementation.

Do not make one more speculative patch.

Do not relax the threshold.

Do not ask the user to inspect another candidate artifact.

Switch into failure-loop breaker mode.

What a failure-loop breaker does

A failure-loop breaker is a hard mode switch for AI-agent work.

A better next output is a diagnostic package, not another candidate fix.

It should include:

the repeated failure class;
a rejected corpus of known-bad examples;
a red-first gate that fails on those examples;
a fix that turns the gate green;
blind or independent validation when the author has seen the answer;
a clear continue, stop, or human-decision recommendation.

This is not only a retry limit.

A retry limit stops cost growth. A failure-loop breaker changes the work itself.

The red-first gate matters

A useful gate must fail before the fix, because otherwise it has not proven that it can see the old failure.

If the agent cannot make the new checker fail on previous bad artifacts, it has not built a checker for the real problem.

Many agent workflows skip this part.

They add a new metric, see the new candidate score higher, and call it progress. The metric was never forced to reject the old failure.

For subjective or visual tasks, this matters even more because the rejected corpus becomes the bridge between human taste and deterministic validation.

When the agent is contaminated

Another trap is contaminated validation: the same agent writes the fix, knows the target, and grades the result.

That can be useful during iteration, but it is not independent validation.

If the agent has already seen the expected answer, the final check needs a deterministic gate with withheld examples, a blind reviewer, a separate model that does not receive the author reasoning, or a human decision when the requirement is taste rather than computation.

Same-author validation is often self-consistency, not proof.

I packaged this as a small public skill

I turned the rule into a small public repo:

https://github.com/g-shevchenko/agent-failure-loop-breaker

It installs a compact skill and repo-local rules for Claude Code, Codex, Cursor, and Windsurf.

Its installed rule is deliberately simple:

If the same defect class appears twice, the agent must stop normal patching and build a rejected corpus plus a red-first gate before continuing.

This package is not meant to make the model smarter.

It makes the workflow less willing to confuse motion with progress.

Where companies go wrong

Teams often treat agent persistence as an asset by default.

That is reasonable for well-scoped implementation tasks with strong tests. It is risky for work where the acceptance criterion is visual, editorial, architectural, or operational.

If Claude Code, Codex, Cursor, or Windsurf keeps failing the same class of review, the next investment should go into the validation contract.

The best prompt in the world will still loop when the gate rewards the wrong artifact.

Where this helps

This pattern is useful for UI polish loops, visual regression work, PDF and presentation generation, typography systems, content QA, and agentic coding tasks where the same bug returns.

Here is the signal:

If the user says “this is still the same problem” twice, the process should change.

Practical takeaway

Do not ask an AI agent to “keep trying” forever.

Ask it to prove that its checker can catch the last failed attempt.

If it cannot, the next task is not implementation.

At that point, the next task is building a better gate.

Full write-up:

https://gregshevchenko.com/notes/ai-agent-failure-loop-breakers/