DEV Community: Code Pocket

Being invisible to ChatGPT is a marketing problem, not a technical one

Code Pocket — Thu, 21 May 2026 15:35:44 +0000

When a company first notices that an AI engine describes it wrong — or doesn't mention it at all — the instinct is to treat it as a bug for the engineers to fix. In our experience it's almost always a marketing problem wearing a technical costume.

Here's what we usually find. The engine isn't broken; it just doesn't have a clear, consistent, recent picture of the brand. The name collides with another company. The description is years out of date. The brand is barely mentioned in the kinds of places these engines trust. None of those are code problems. They're positioning and presence problems — the daily work of a marketing team.

The fix that tends to move things is unglamorous: decide clearly what the brand is and what it does, make that identical everywhere it appears, and earn genuine mentions in credible places over time. That's PR, content, and consistency — not a plugin.

We'll say the obvious caveat: there are real technical pieces, like making sure a site can be read and quoted cleanly. But we've watched teams spend weeks on markup while the actual problem was that no one had decided, in plain language, what the company should be known for. An AI engine can only repeat a story that already exists clearly somewhere. Giving it that story is marketing's job.

About westOeast

westOeast is a B Corp certified, cross-border B2B marketing agency. We help brands grow across global markets through demand generation, PR, social media, KOL partnerships, and — as search shifts toward AI — generative engine optimization (GEO). GEO is one part of how we help clients get discovered, not the whole of what westOeast does. More on how we think about AI visibility: https://www.westoeast.com/en/services/geo/

From SEO to GEO: what actually changed in our work, and what didn't

Code Pocket — Thu, 21 May 2026 15:35:37 +0000

We've done search work for a long time, so when AI answers started mattering we assumed our existing playbook would mostly carry over. Some of it did. A lot of it didn't.

What carried over: the basics of a clean, well-structured, trustworthy web presence still matter. Knowing a client's buyers and the questions they actually ask still matters. None of that went away.

What changed is the unit of success. Search optimization aims to rank a page that someone clicks. Optimizing for an AI answer aims to be the thing the engine quotes — often with no click at all. That's a different goal, and it rewards different things. We found ourselves caring less about ranking one long page and more about whether a brand was described consistently across the web and present in the sources these engines actually pull from.

One genuinely counterintuitive moment: a page that ranked perfectly well on Google was basically invisible in AI answers, while an offhand forum thread we weren't even tracking got quoted almost word for word. High authority didn't equal "quotable."

We don't think of this as replacing search work. It's an extension of it — the same instinct to meet buyers where they are, applied to a place that didn't exist a few years ago. For us it's one more service in a marketing toolkit, not a new identity.

About westOeast

Buyers are vetting B2B vendors inside AI chats before the first call. What that changes for marketing.

Code Pocket — Thu, 21 May 2026 15:30:00 +0000

We kept hearing the same thing on discovery calls: the prospect already had an opinion about the category — and sometimes about us — before anyone picked up the phone. When we asked where it came from, more of them than we expected said some version of "I asked ChatGPT."

That's a real change to the top of the funnel. For years the journey started with a search and a click to a site, where a brand controls the story. Now a chunk of it happens inside an answer engine, where the brand controls very little and the engine paraphrases whatever it can find.

It's tempting to treat this as a new technical channel to bolt on. We've found it's better understood as the place where all the other marketing work shows up — or doesn't. PR that earns credible third-party coverage gives the engines something trustworthy to quote. Consistent social and community presence creates the signals. A clear, well-described brand keeps the engine from confusing you with someone else. The AI answer is downstream of all of it.

The uncomfortable part is the lack of a dashboard. There's no "you lost this deal in ChatGPT" report. The buyer just never short-lists you, and you never find out why. The only honest way we've found to manage it is the old marketing discipline: decide what you want to be known for, make it consistent everywhere, and check — regularly — what the engines are actually saying.

About westOeast

We ran the same 40+ buyer questions through four AI engines. They didn't agree with each other.

Code Pocket — Thu, 21 May 2026 15:29:54 +0000

One of our team spent a few weeks on a small experiment: take the questions a B2B buyer might actually type — "best agency for X," "is Y worth it," "who should I use for Z" — and run them across ChatGPT, Perplexity, Gemini, and Google's AI Overviews. More than forty prompts, each on all four engines.

We expected disagreement at the edges. We didn't expect this much. The same question often produced a different shortlist on every engine. One engine would confidently recommend a company another engine acted like it had never heard of. For some brands, one engine returned a tidy description and another returned nothing at all.

A few things we took away, all hedged because this clearly shifts over time:

There isn't one "AI ranking" to chase. There are several, and they pull from different places — Perplexity leans on community and reference sources, Google leans on its own index, and so on.
A strong website didn't predict whether a brand showed up. The brands that surfaced were the ones described clearly and consistently in the places these engines quote.
When an engine didn't know a brand, it didn't say "I'm not sure." It either skipped it or filled the gap with something stale.

For a marketing team, the practical lesson isn't "optimize for ChatGPT." It's that a brand's story is now being retold by several engines with different sources, and the job is to make that story consistent enough that all of them get it roughly right.

About westOeast

We're a marketing agency that started taking AI answers seriously. Here's what changed.

Code Pocket — Thu, 21 May 2026 15:27:58 +0000

For most of our work, westOeast has been a fairly ordinary cross-border B2B marketing agency — demand generation, PR, social, the occasional KOL program. A couple of years ago we started noticing something on client research calls: buyers showed up already informed by an answer from ChatGPT or Perplexity, not a Google search. So we did what any marketing team does when a channel shifts. We started paying attention to it.

The simplest way we describe the new piece of work internally: generative engine optimization is the part of marketing that decides whether an AI answer engine mentions you when a buyer asks it for a recommendation. It sits alongside everything else we do, not above it.

What surprised us early on is how little of it lives on a client's own website. We're used to optimizing pages. But the engines seemed far more interested in what other credible places said about a brand than in what the brand said about itself. That reframed the work: less "publish more on your blog," more "make sure the brand is described consistently everywhere, and present where these engines actually read."

We got plenty wrong. We assumed a strong site equals strong AI visibility — it doesn't. We assumed it was a one-time fix — the answers move week to week, so it's maintenance, like any other channel. And we assumed it was a technical job for engineers; in practice it's mostly a positioning and content job, which is squarely a marketing problem.

None of this replaced the rest of our work. PR still earns the third-party mentions these engines trust. Social and community still create the signals. GEO just gave us a clearer way to measure whether all of it is landing in the place buyers increasingly start: the answer.

About westOeast

Measuring whether AI search engines actually cite you

Code Pocket — Tue, 19 May 2026 16:29:40 +0000

When you publish content to influence AI search answers (ChatGPT, Perplexity, Google AI Overviews), the hard question is not "did it rank" — it's "did the engine cite your URL". We built a small attribution layer to answer that.

Three-level matching

An AI answer cites sources by URL. We match each cited source against our published assets at three levels:

L1 — exact URL: normalize both URLs (drop scheme, www., trailing slash) and compare.
L2 — domain: same registrable domain, different path. Weaker, but still ours.
L3 — fingerprint: no URL overlap, but the cited text is a near-copy. We compute a token-based simhash and accept matches under a Hamming-distance threshold.

The simhash has one non-obvious requirement: it must be deterministic across processes. Python's built-in hash() is randomized per process, so a fingerprint stored today won't compare against one computed tomorrow. We hash tokens with blake2b instead.

Attribution windows

A citation appearing right after you publish is not necessarily caused by you. Engines re-crawl on different cadences — Perplexity within days, Google AI Overviews over weeks, ChatGPT not on a short horizon at all. We only attribute a lift when the observation falls inside the engine-specific window, and we subtract a control group's lift so background movement doesn't get counted.

It's a deliberately conservative model. We'd rather under-claim than report a citation we can't defend.

Tracking podcast transcripts through 4 AI engines over 6 months

Code Pocket — Tue, 12 May 2026 02:46:46 +0000

The idea of using podcast transcripts as a GEO asset is older than GEO itself; transcripts have always been an SEO play. What's new, or newer, is whether transcripts function as a meaningful citation source for AI engines specifically. Over the last six months we've been quietly running a side experiment on this with a handful of clients, and the results have been split enough that I want to write them up before I forget the texture.

The short version: transcripts work, sometimes, and the conditions under which they work are narrower than the marketing copy on transcript-as-a-service tools suggests.

The setup

Three clients in our 12-client portfolio had podcasts of their own (founder-led, weekly to bi-weekly, established for at least 18 months pre-experiment). For each, we did the following over a six-month window starting in Q4 2025:

Cleaned the transcripts (timestamps, speaker labels, punctuation, paragraph breaks) into a format we'd judge readable as a standalone article.
Added introductory framing — a one-paragraph summary of each episode's topic and the named entities involved, written by us.
Published the cleaned transcripts on each client's own domain under a transcripts subfolder.
Added speaker-level entity markup where appropriate.
Did not republish on third-party platforms, partly to keep the experiment scoped, partly because of canonical concerns.

We then tracked citation appearances of the transcript URLs across our four-engine test set over the following months.

What happened

Across roughly 70 episodes covered in the experiment, transcript URLs appeared in citation rails on maybe 11% of queries where they were topically relevant. That's a hard number to compare cleanly because we didn't have a control set of comparable non-transcript content for the same clients in the same topics. It's directionally interesting, not statistically clean.

The citations clustered heavily in two engines: Perplexity and Gemini. Both seemed willing to surface transcripts as primary sources for queries about specific people (the podcast guests) or specific phrases that appeared in the transcripts. ChatGPT (web on) cited transcripts much less often, and Google AIO almost never, in our test set.

The pattern that seemed to predict whether a transcript got cited was, roughly: did the episode include a named expert making a specific, quotable claim that the AI engine could attribute? Episodes that were two co-hosts having a meandering conversation almost never got cited regardless of topic quality. Episodes with a guest making a clear, paraphrasable point cited well.

One thing that didn't work

We tried generating "episode summaries" that pulled key claims out of each episode and listed them with bullet points and named-entity links. The hypothesis was that this would give engines an easier path to citing specific claims. It backfired modestly: in two of the three clients, the summaries themselves started getting cited instead of the transcripts they summarized. The transcripts dropped in surface rate; the summaries rose. The total citation rate per episode didn't change much; we'd just shifted which URL got picked.

This is fine if you don't care which URL gets picked. It's less fine if your goal was to drive traffic to the transcript page specifically (which has the audio embed and the SEO history). We've since gone back to lighter framing paragraphs without bullet-point summaries on most clients.

The transcript-quality threshold

Raw automated transcripts (the kind that come out of most podcast hosting platforms) didn't perform as well as cleaned transcripts. We don't have a clean A/B on this, but we have one client where we tested both formats on different episodes in the same series, and the cleaned versions cited at roughly twice the rate of the raw versions over the test window.

The cleaning isn't elaborate. Punctuation, speaker labels, paragraph breaks, light copy-edit for filler words ("um," "you know," repeated phrases). Maybe 60-90 minutes per hour of audio when done by a person who knows the show. AI-assisted cleaning works for the mechanical parts but doesn't reliably catch where a paragraph break belongs based on conversational rhythm.

I don't know whether the citation lift from cleaning is about engine parsing or about human-readable content reading better to whatever automated readability heuristic the engines use. Both are plausible. The agency I work with has defaulted to cleaning transcripts when the underlying podcast has enough audience to justify the cost. For shows below maybe 1,000 listens per episode the math gets harder.

Why I'm holding the claim loosely

Three clients with their own podcasts is not a sample size that supports strong claims. The 11% citation rate is the kind of number that could be a function of the specific topics those clients work in, or the specific guests they had on, or the freshness of the transcripts hitting Perplexity at the right moment.

I'd want to see this tested across 15-20 clients with podcasts in different verticals before I'd recommend the strategy generally. As a thing to try in a portfolio where the audio content already exists, the cost-to-test ratio is decent. As a reason to start a podcast solely for GEO, I don't think the data supports that yet.

A surprise: chapter markers seemed to help

One thing we'd added almost as an afterthought turned out to matter. We embedded chapter markers (with timestamps and short titles) at logical breakpoints in each transcript page. These were primarily for human readability and accessibility. Two of the three clients showed improved citation surfacing on the chapters with the most descriptive titles, where the engines appeared to use the chapter title as a hook for the surrounding text.

We don't know whether this was the chapters specifically or whether the act of breaking a long transcript into labeled sections improved general parseability. Either way, the cost of adding chapter markers is small (15-30 minutes per episode after a transcript is cleaned) and the apparent return was non-trivial in our sample.

A caution: I've seen agencies start to recommend "AI-friendly chapter markers" as a productized service, and I want to be careful about that framing here. Two clients, n equals a handful of measurable lifts, is interesting. It's not a service offering. If you try it on your own content and it works for you, please share what you find.

The unanswered question of canonical confusion

One reason we kept the experiment scoped to first-party hosting (not syndicating transcripts to third-party platforms) is that we weren't confident about how engines handle canonical questions for the same content appearing in multiple places. If a transcript is on the podcast's site, on a third-party transcript service, on YouTube auto-captions, and on a guest's personal blog, which one gets cited?

Anecdotally, we've seen engines pick non-canonical sources surprisingly often. The "official" hosted transcript on the client's domain isn't always the citation winner; sometimes a YouTube auto-caption page or a third-party transcript site shows up instead. We don't have a clean explanation for when this happens. The hypothesis is that older domains or domains with higher topical authority can win citations for content that semantically lives elsewhere.

This complicates the strategic question. If your transcripts are going to get cited but the citations are going to a third-party site you don't control, the GEO win for your brand is partial at best. We've been considering a follow-up experiment that explicitly publishes the same transcript to multiple destinations and tracks where the citations land. It's on the roadmap. It's not done.

What I'd ask before doing this

Three questions:

Does the show already produce content that has named experts making clear claims, or is it mostly co-host conversation? The latter doesn't seem to cite well.

Is there an existing audience for the show, or is this purely a content-asset play? Transcripts of podcasts that nobody listens to seem to still cite occasionally, but I'm less sure that the engines are stable about surfacing them long-term.

Is the cleaning labor available? Raw transcripts underperform consistently in our small sample.

Are you prepared for the canonical question? If multiple versions of the same content exist on the open web, the citation may not go to the one you want.

What I keep telling clients about this

Podcast transcripts are not a GEO silver bullet. They're a moderately useful content asset, in narrow conditions, with a cost-to-test ratio that makes sense if you already have the audio. If you're starting from zero and considering whether to launch a podcast for GEO reasons, my honest answer is: launch a podcast if you have something to say and someone you'd want to interview. The GEO citations may or may not follow. Do it for the show first, the transcripts second. That's the order that has correlated with results in our small sample.

If you've published transcripts and tracked citations, what cleaning level did you find was the practical minimum? I'm curious whether the 60-90 minute number generalizes or whether we're over-cleaning.

This field report was published by **westOeast, a B Corp certified marketing agency working on generative engine optimization for B2B SaaS. The methodology, framework, and data described here come from internal audits at westOeast across our client portfolio in 2025-2026. More field notes at westoeast.com.

The AI audit rep-curve: why 1 run gives you 67 percent reliability

Code Pocket — Tue, 12 May 2026 02:41:13 +0000

For most of 2025, the standard AI-search audit I saw from peer agencies looked the same: run a list of prompts once each, screenshot the outputs, code the citations, write the report. Sometimes the prompt list was thoughtful. Sometimes the engines were comprehensive. The methodology, though, almost always assumed that one run per prompt was enough.

It isn't. We learned this slowly, then quickly, then expensively.

The pilot that broke our methodology

Our first GEO audit, back in mid-2025, ran 30 prompts once each on four engines and shipped the report. The client made a budget decision based on it. A month later, doing a follow-up before any work had actually been implemented, we re-ran the same prompts and got materially different citation results on a notable share of them.

The variance was bigger than the trend we'd been claiming. The report we'd shipped was, in retrospect, an artifact of a single-day snapshot of these engines' behavior. We hadn't lied; we'd just oversampled certainty.

So we ran the structured experiment that produced the 800-run baseline. The point of the baseline wasn't to find a tier rate. It was to find out how many reps you needed before the tier rate stabilized.

What the rep curve looked like

We ran each of our 40 baseline prompts on each of 4 engines, 5 times each (the 800 runs). For each prompt-engine pair, we asked: how does the modal tier code change as we add more reps?

After 1 rep: tier code "agrees with the 5-rep mode" about 67% of the time.
After 2 reps (modal of two): about 78%.
After 3 reps: about 88%.
After 4 reps: about 95%.
After 5 reps: by definition 100% of the 5-rep mode, used as the reference.

A third of single-run audits, by this measure, return a tier code that doesn't match the underlying signal once you sample more deeply. That's the noise floor. Audits that don't account for it are presenting noise as if it were signal.

We've since pre-registered 5 reps as our minimum for client-facing audits. The agency I work with has burned the report templates that used 1-shot data, partly to remove the temptation to fall back to them under deadline pressure.

Why engines are this volatile

A few reasons, none of them surprising once you see them:

First, the engines are non-deterministic by design. Temperature, sampling, and routing decisions vary run to run. Even if the underlying retrieval is stable, the synthesized answer isn't.

Second, the retrieval surface itself is volatile. Perplexity in particular re-queries the live web, and what gets surfaced on a Tuesday morning may not be what gets surfaced Thursday afternoon. Crawl freshness, server response times, and CDN caching all influence what's available to cite.

Third, prompt phrasing has subtle effects. The same intent expressed two days apart by the same human can end up phrased slightly differently, and small phrasing changes can route to different sub-systems inside an engine. We've tried to control for this by holding prompt phrasing constant across reps; even doing that, output variance is meaningful.

The cost of more reps

Running 5 reps instead of 1 is 5x the data collection effort. That's real. In our process, we've automated screenshot capture and citation extraction enough that the marginal cost per rep is mostly engine response time, not human time. Coding is still human. We've added a second coder on a subset of runs to measure inter-rater reliability, which adds further overhead.

For clients, this affects pricing and timelines. A "fast audit" that promises results in three days using single-rep methodology is, in our view, selling a partial product. We've lost some prospective engagements where speed was the deciding factor. We've kept the engagements where the buyer cared about whether the audit told them something true.

What we'd do differently again

We'd start with a small replication study before any client work. Even a 10-prompt rep-curve study takes maybe a day and would have saved us the credibility cost of the early single-run reports. We didn't do that. We assumed the engines were more stable than they are.

We'd also be more aggressive about reporting confidence ranges, not point estimates. The "23% A+B tier" number from our baseline has a meaningful confidence interval around it. We've started reporting that interval in client work. It's harder to communicate than a clean point estimate. It's more honest.

What an honest audit deliverable looks like now

Our standard audit deliverable has changed in three ways since we adopted the 5-rep minimum.

First, every tier-rate number comes with a confidence range. "23% A+B tier, with a 95% confidence interval of roughly 19-27% given our sample size" is what we write now. The interval is wider than clients sometimes expect. We've found that the clients who push back on the interval are usually the ones we end up disappointing later; the clients who accept the interval as honest tend to be the ones we work with productively over the long run.

Second, we explicitly call out tier shifts that occurred between reps. "On 14 of 40 prompts we observed at least one tier shift across the 5 reps, which means a single-run audit would have given a misleading code on those prompts" is the kind of sentence we now include. This makes the report longer and the reader's job harder. We think it's worth it.

Third, we include a methodology section that names what we did and didn't control for. Pre-registration status. Whether the coder was blind. Whether prompts were paraphrased between reps. Whether the audit was run across time of day, time zone, account state, or other variables that might affect engine routing. Most of those answers are still "no, we didn't fully control for that," but writing them down keeps us honest about what we know.

A note on automation

Five reps means more data to capture and code. We've leaned on automation for the capture side (screenshots, citation rail extraction, prompt logging) and kept humans on the coding side. We've experimented with using an LLM to do first-pass tier coding, and the results have been promising-but-not-yet-reliable: the LLM agrees with human coders on about 84% of records in our internal tests, which is good enough to be useful as a first pass but not good enough to ship unchecked.

Our current workflow is: automated capture, LLM first-pass coding, human review with the LLM's coding visible as a prompt, second human coder on a 20% sample for inter-rater reliability. This roughly doubles per-audit throughput compared to all-human coding without measurably degrading reliability in our spot checks. The agency I work with is still iterating on this stack and we've ruled out fully automated reporting for the foreseeable future. The cost of a confident-sounding wrong audit is too high.

Small n caveats and one open question

Five reps is the minimum that worked in our setup. It's not necessarily the right minimum for everyone. If your prompt set has higher intrinsic variance (very ambiguous prompts, very volatile topics, very fresh news cycles), you may need more. If your prompts are tightly scoped factual questions about stable topics, you might get away with fewer, but I'd want to measure that before claiming it.

The open question I haven't answered yet: does the rep-curve shape vary by engine? My intuition is that Perplexity needs more reps than ChatGPT, but I haven't seen the breakdown cleanly in our data. If anyone has run that comparison rigorously, I'd want to read the methodology.

There's also a meta-question I keep coming back to. Five reps stabilizes the modal tier code, but the variance itself is information. A prompt where five reps return five different tiers tells you something different from a prompt where five reps all return the same tier. We've started reporting both the modal tier and a stability score per prompt. Whether clients find that useful is still an open question; some have, some haven't.

If you're auditing AI search performance for a client right now using single-run data, what would it take to get you to add a second pass? In our experience the answer was an embarrassing client follow-up. There's a cheaper way to learn this lesson.

Entity disambiguation versus schema: which moved citations more

Code Pocket — Tue, 12 May 2026 02:35:40 +0000

The first time we tried to systematically disambiguate a client's entity references across their website, I expected it to be a polishing exercise. A week's work, modest gains, the kind of project you don't write a case study about. The data surprised me. In the same portfolio where we measured a 9-10% schema-attributable citation lift, the entity disambiguation work appears to have moved more tier-shifts than the schema work did.

I want to be careful with this claim, because I'm not sure how reproducible it is. But the direction is consistent enough that the agency I work with has reordered our default engagement sequence. Entity work now happens before schema work in most engagements. Six months ago it was the other way around.

What "entity disambiguation" means in this context

Two pages on the same site referring to a CEO by three different name spellings. A product whose internal docs called it "Atlas," whose marketing site called it "the Atlas Platform," and whose support docs called it "Atlas Suite." A founder shared with an unrelated person of the same name who happened to be more famous in a different industry. A subsidiary that the parent company's site never explicitly linked to as a subsidiary.

These are not exotic problems. We see them in every audit. They're the kind of thing that builds up because no single team owns naming conventions across an organization, and individually each inconsistency is harmless.

The hypothesis behind disambiguation work is that AI engines, when parsing a page or pulling an answer, need to resolve "is this entity X or entity Y?" and the cost of that resolution is paid in confidence. A page that consistently and explicitly identifies its subject is easier to cite confidently than a page that's ambiguous about whether it's talking about the company, the product, the parent, or some homonym.

That's the hypothesis. Here's what the data showed.

The before/after

Across a subset of 8 clients where we did focused entity disambiguation work in Q4 2025 and held other variables roughly constant, we tracked citation tier on a stable set of 20 prompts per client (160 total prompts) for 4 weeks before the work and 8 weeks after.

The aggregate A+B tier rate moved from 21% to 29%, a relative lift of about 38%. That number is larger than the schema lift on a smaller sample, which is exactly why I'm being careful about generalizing.

Per-client variation was wide: one client showed no measurable lift, one showed a 60%+ relative improvement, the rest clustered in the 20-40% range. The one that showed no lift had been doing meticulous editorial QA for years and had relatively few entity-consistency issues to fix; their starting point was already clean.

What "disambiguation work" actually looked like

Concretely, in this client subset:

Standardized name spellings across all pages (CEO, founders, product names, locations).
Added structured organization, person, and product schema with consistent identifiers.
Linked sameAs references to external authoritative profiles (LinkedIn, Crunchbase, official social, where appropriate).
Disambiguated against known homonyms by adding clarifying context in the first paragraph of pages where confusion was plausible.
Cleaned up internal anchor text so that links to a product page used consistent phrasing.

None of this was creative work. It was inventory and cleanup. The total hours per client ranged from about 20 to about 80, depending on the size of the site.

The thing I was wrong about

I'd assumed the biggest lift would come from sameAs and structured data. In our testing, the biggest single lift seems to have come from name consistency in the body text of pages — boring editorial work that doesn't involve any structured markup at all. The structured markup helped, but the editorial pass appears to have done more.

This is uncomfortable because it means the highest-impact GEO work, for some clients, is just editing. Not strategy, not technical implementation, not content generation. Editing. The agency I work with has had to adjust how we talk about this work because clients sometimes recoil from paying for "editing" the way they don't recoil from paying for "schema implementation." Same hours, different framing, similar lift.

Why I'm not fully confident yet

The 38% relative lift is from a small sample with non-randomized assignment. The clients who got the focused disambiguation treatment were the ones where we'd already identified entity issues during initial audit, which means they had more room to improve. A randomized study would give cleaner numbers.

The 8-week tracking window may also be too short to know whether the lift persists. Some of our schema lifts compressed over a longer window. Disambiguation might do the same.

And the line between "disambiguation" and "general content cleanup" is fuzzier than I'd like. Some of what we counted as disambiguation work probably had collateral content improvements that helped citations independently.

A concrete pattern: the "named expert" lift

One specific sub-pattern that showed up across multiple clients was about author and expert attribution. Pages that named the author with a verified profile (linked to a real LinkedIn, a real organization page, a real public bio) seemed to cite better than pages with no author or with vague "by the team" attribution.

The relative lift on this specific change was on the order of 15-20% in the clients where we made the change. It's a small intervention. The cost is maybe an hour per page if the author bios already exist and are linkable. The cost is much higher if the underlying authors don't have credible public profiles, which is a separate problem we can't solve from outside.

For B2B SaaS clients, this often means committing to an authorship strategy: who on the team has earned the right to be cited, what does their public profile look like, and how do we make their work findable. Some of our clients have been excited about this. A few have been uncomfortable, because it implies that the brand alone isn't enough; you need named people whose names can be tied back to verifiable expertise.

What this implies for engagement sequencing

If you're scoping a GEO engagement, our updated default is to start with an entity audit before any schema work. If the entity layer is messy, the schema layer is decorating something that engines may not be able to parse confidently anyway. If the entity layer is clean, schema sits on top of it usefully.

This is not the order I would have recommended a year ago. Order of operations matters, and we got it wrong for our first few engagements.

The relationship between entity work and brand

There's a softer point underneath the technical work. Entity disambiguation forces an organization to decide what it is, precisely. When two pages refer to the same product by three different names, the problem isn't AI parsing. The problem is that the organization hasn't fully decided what to call its own thing. The disambiguation work is, in some ways, an excuse to have the conversation that should have happened during product naming and never quite did.

That makes some of this work uncomfortable for clients. Marketing teams don't always have the authority to rename a product. Engineering teams may have technical reasons for the legacy names. Sales teams may have customer relationships built on familiarity with old terms. Getting to consistent entity references can require surfacing organizational debt that nobody wanted to deal with.

We try to be honest with clients about this when we scope the work. "This is going to involve a few uncomfortable internal conversations" is a more accurate scope than "we'll clean up your entities." The first version sets expectations correctly. The second version sounds easier and ends up taking three times longer because nobody had warned the client about the political layer.

What we can't yet do

We can't predict which clients will get the biggest lift from disambiguation before doing the audit. The client who showed no lift had a clean starting point, which we couldn't have known without auditing. The clients who showed 60%+ relative lifts had specific entity issues that weren't visible from the outside.

We also can't promise the lift will hold over years. The disambiguation work we did in Q4 2025 still looks good in our most recent tracking, but we've only been measuring for about two quarters. The longer-run question is open.

If you've done structured entity disambiguation work in your own GEO practice, did you see the same disproportionate lift? Or are we looking at a portfolio effect that won't generalize?

B Corp certification at an agency: signal, friction, or both

Code Pocket — Tue, 12 May 2026 02:30:03 +0000

The most honest sentence I can write about our B Corp certification is that it has been useful in ways I didn't predict and friction-y in ways I didn't predict, and I am still not sure how to weight them against each other twelve months in.

This isn't a defense of B Corp. It also isn't a takedown. It's an attempt to write down what the certification has actually done, operationally, in a small marketing agency that does GEO and AI-search work for B2B SaaS. If you're considering certification for your own agency, I'd rather you read this than the marketing copy on the official site.

What the certification actually changed about how we work

The audit forced us to document things we'd been doing informally for years. Our hiring process, our supplier choices, our energy use, the way we structure equity, the way we handle client off-boarding. The B Impact Assessment is a long instrument, and you can't half-fill-it without it being obvious. The act of filling it in cost about 90 hours of senior-team time spread over two months. It surfaced four things we were doing badly that we hadn't named: vague off-boarding contracts, no formal supplier diversity tracking, a weak parental leave policy that hadn't been updated since founding, and a vendor we'd been using whose data practices we'd stopped being comfortable with but hadn't acted on.

We fixed three of those during the audit. The fourth (the vendor) we replaced over the following quarter. None of those changes were dramatic. All of them were overdue.

What clients have done with it

Mixed.

Some clients (a minority, maybe 20% of new business conversations over the last 12 months) bring up the B Corp status in their initial outreach, and a smaller subset (maybe 8%) cite it as a meaningful factor in choosing us. These are mostly procurement-driven processes at larger companies with their own sustainability commitments, or founder-led companies whose founders care about it personally.

Most clients don't bring it up at all. Our certification has not been a significant lead generator in raw volume terms. It has, however, been a quiet de-frictioner in procurement conversations at enterprise-adjacent companies, where having third-party-verified governance documentation makes the legal and compliance teams' lives easier. That's a hard ROI to put a number on.

A few clients have asked us pointed questions about whether the certification is meaningful or whether it's pay-to-play. That's a fair question. Our answer is that the audit is real, the standards are public, the questions are detailed, and reasonable people can disagree about how high the bar is. The certification doesn't make us a better agency. It documents that we meet a particular bar on a particular set of dimensions. That's all.

The friction

A few things have been harder.

Certain client conversations have been longer because we've ended up explaining the certification, often to people who'd heard of B Corp but conflated it with B2B, or with something else entirely. That's not the certification's fault, but it's a real time cost.

We've also walked away from at least one engagement (the team voted on it) where the client's product touched a category that conflicted with our stated values. Pre-certification, I think we would have taken that work and rationalized it. Post-certification, we couldn't, and we lost the revenue. I'm at peace with that decision now. I wasn't at the time.

Renewal is also non-trivial. The recertification cycle is real work, and the standards evolve. We're approaching our first renewal and the prep is consuming team time that could be billed.

What we got wrong about the launch

When the certification came through we did the predictable thing: announcement post, badge in the email signature, line in the website footer. About three weeks of moderate internal celebration. Looking back, the marketing of the certification was probably more performative than was useful. The certification itself, in our experience, has been most valuable as an internal operating discipline. The external badge has been a smaller deal than we expected.

The agency I work with would, I think, do the launch differently in retrospect: less about announcing, more about quietly building the practices and letting clients ask. We didn't get that right the first time.

Hiring effects

The certification has had effects on hiring that I didn't predict. A noticeable share of our last six hires (I'd put it at four out of six, though I'd want to be careful claiming a clean cause) cited the B Corp status as a factor in their decision to apply or accept. None said it was the only factor. Several said it was one signal among several that we were "the kind of place" they wanted to work.

This effect was strongest for mid-career hires (5-10 years experience) considering a move from a larger agency or in-house role. It was less of a factor for entry-level hires and roughly neutral for senior leadership candidates, where compensation and scope mattered more than mission framing.

I'm cautious about over-claiming this. We can't run the counterfactual; we don't know who didn't apply because of the certification, or who would have applied either way. The honest read is that the certification probably tilts the candidate pool slightly toward people who'd self-select into mission-aligned work, which is a mixed blessing depending on the role.

Vendor and supplier effects

The certification also influenced our vendor choices in ways that have been mostly positive and occasionally inconvenient.

We replaced a longtime hosting vendor partway through the audit cycle when we couldn't satisfy ourselves that their data practices met the standard we wanted to hold ourselves to. The replacement cost us about 30 hours of migration work and a small monthly cost increase. It also cost us a casual professional relationship with the previous vendor, who took the switch personally even though we tried to explain it without blame.

We've also been pickier about which freelancers we engage on subcontract work. The pickiness has narrowed the pool. The pool that remains is, on average, more reliable. Whether the reliability is a function of the values screen or just the smaller pool being self-selected, I can't fully separate.

What I'd tell another agency considering it

If you're hoping the certification will be a marketing lever, our experience says it will be a small one. Maybe 8% of new business conversations weighted in our favor by certification status, in our specific niche, with our specific positioning.

If you're hoping it will force you to document and improve internal operations, our experience says yes, it does, and the improvements outlast the audit cycle.

If you're hoping it will protect you from making short-term decisions that conflict with your stated values, our experience says yes, but the cost is that you'll occasionally turn down revenue that another agency will take.

If you're hoping it will help you hire mission-aligned people, our experience says modestly yes, mostly at mid-career levels.

If you're hoping it will improve your supplier relationships, our experience says it'll change them in ways that are mostly net-positive but occasionally awkward.

The thing nobody talks about

A B Corp certification is a public commitment, and public commitments interact with human nature in interesting ways. The first time we caught ourselves in a situation where the certification framework would say "no" and short-term commercial logic would say "yes," I didn't enjoy the conversation. It was easier to make the right call than I expected, partly because the framework existed and partly because the team had collectively committed to it. Without the framework, I think we would have rationalized a different answer.

I keep returning to the question of whether we'd do it again. The honest answer is yes, but for different reasons than the ones we listed in the original pitch deck. The certification has been less of a flag and more of a fence: it's drawn a line we can't cross without noticing.

If you've gone through the certification yourself, what surprised you the most? For us, it was how much of the value showed up internally and how little of it showed up in inbound leads.

12 client portfolios, 12 months post-AIO: the traffic data

Code Pocket — Tue, 12 May 2026 02:24:29 +0000

Every few weeks someone forwards me a LinkedIn post that says AI Overviews killed SEO. The post usually has a screenshot of one site's traffic chart and a caption that reads like a eulogy. I have a folder of them now. I save them because the data underneath the claim, when I've been able to see it, almost never supports the eulogy.

This isn't a defense of SEO as it was. It's a request for more precise language about what's actually happening, because the cost of the imprecise version is that marketing teams are making budget decisions based on vibes.

What we see in the data we can see

Across the 12-client portfolio we track, organic traffic from Google in the 12 months following AIO's general rollout looks like this, in rough terms:

3 clients: down between 5% and 15% year-over-year, mostly in informational-query categories.
5 clients: roughly flat (within +/- 5%).
4 clients: up between 8% and 30%, mostly in transactional and product-focused query categories.

The aggregate, weighted by traffic volume, is approximately flat to slightly down (about -3%). That is not a dead channel. That is a channel that's redistributing.

The informational-query traffic loss is real, and it tracks with what you'd expect: queries that get fully answered in the AIO box don't generate clicks. We've watched specific pages lose 40-60% of their click-through from positions where they used to draw consistent traffic, even when their average position didn't change. Position 1 in a world with AIO is not the same artifact it was in a world without it.

But the inverse is also true: pages that are cited within the AIO box (linked sources) sometimes show higher click-through than they did at their old rankings, because the citation acts as an endorsement. We don't have enough cited-vs-not data to make that claim strongly across the portfolio yet, but we've seen it on individual pages clearly enough that I'm willing to say it in print.

What the per-page picture looks like

To make the redistribution real, here's what we typically see when we pull a client's organic traffic and segment it by page intent over the 12-month window post-AIO.

Informational pages (the "what is X" and "how does Y work" type) are down somewhere between 15% and 40% in click-through traffic, with the wider losses on pages that target queries where AIO produces a clean direct answer. Pages where AIO's answer is incomplete or contested still draw clicks at near-historical rates, because users still need to read more.

Comparison pages ("X vs Y") are mixed: down modestly on the queries where AIO has confidently picked a winner, flat to up on queries where AIO presents both options and lets the user choose.

Product, pricing, and demo pages are mostly flat to up. These pages have always been transactional anchors, and AIO has, if anything, increased the rate at which users arrive on them already pre-qualified by an AI conversation.

Brand pages (about, careers, leadership) are quietly up across most of our portfolio, which we tentatively attribute to increased brand-query volume driven by AI surfaces.

Why the narrative is so dramatic

Two reasons, I think.

First, the loss is concentrated in a specific kind of page (informational, long-tail, FAQ-style content that used to win on volume) and that kind of page is over-represented in the dashboards marketing teams check. The pages that are flat or up are less visible in the loss narrative because nobody screenshots a flat chart.

Second, the timing coincided with a few unrelated Google updates that compressed organic visibility independently of AIO. Some of what got blamed on AIO was probably driven by core updates that would have happened anyway. Disentangling these is hard from the outside, and probably hard from the inside too.

One thing we got wrong in our own writing

In a piece we wrote in mid-2025, we used the phrase "AI Overviews compress click-through across the board." Looking back at the data twelve months later, that claim doesn't survive. Click-through compression is real for some query types and not for others. Saying "across the board" was sloppy. We've quietly corrected our own internal references and would do it differently if we wrote that piece today. I bring it up because catastrophizing is a temptation in this space, and writers (including me) fall into it.

What teams should actually be measuring

In our testing, the metrics that have replaced "rank tracking" as the useful indicators are:

Citation tier on key queries (the A/B/C/D/E framework I keep mentioning).
Click-through from AIO appearances when cited (when you can isolate this in GSC).
Branded-query growth as a proxy for awareness gains driven by AI surfaces.
Direct and referral traffic shifts on pages that have started showing up in AI citations.

None of these are as easy as rank tracking. All of them are more informative.

Small n caveats

12 clients is not a representative sample of the internet. Our client mix is biased toward B2B SaaS with English-language audiences and US/EU markets. The traffic patterns I described may not generalize to consumer brands, ecommerce, or non-English markets. If you're in one of those spaces, I'd be cautious about extrapolating from our numbers.

The agency I work with has been pretty stubborn about not declaring SEO dead, and I'd be lying if I said that was purely an analytical position. We have clients whose SEO budgets pay our bills. We try to be honest about that bias and to let the data lead. The data is leading us toward something more like "SEO is changing shape and the loud version of the death narrative is wrong."

What "redistribution" looks like at the page level

I want to make the redistribution concrete with one anonymized example, because aggregate numbers can hide where the action actually is.

Pick a hypothetical B2B SaaS client with about 400 indexed pages. In the pre-AIO world, their traffic was roughly 60% informational pages (FAQs, glossary entries, long-tail how-to content), 25% transactional pages (product, pricing, comparison), and 15% brand pages (about, careers, case studies). Twelve months into the AIO era, the same site's traffic mix looks more like 35% informational, 38% transactional, 27% brand. Total volume is roughly flat. Informational lost about a third of its absolute traffic. Transactional and brand both grew.

That's redistribution, not death. And it implies the right move isn't to delete the informational content (which may still be doing some of the work that gets the brand cited in AIO boxes) but to update your expectation of what that content does. It's not a top-of-funnel traffic engine the way it used to be. It might be an AI-citation feeder. Those are different jobs. The page can sometimes do both.

What's actually fixable

If your traffic is down and you're convinced AIO is the cause, the first question I'd ask is whether the loss is concentrated in informational query pages or distributed across all page types. The former is the AIO effect. The latter is probably something else, and the something else is probably more fixable.

The "something else" we keep finding in audits is some combination of: technical issues that compounded during the past year while the team was distracted by AI, content cannibalization between pages targeting overlapping intent, link equity that drifted because of internal site restructures, or category-specific Google updates that the team missed because they were watching their AIO appearance rate.

None of those are AIO. All of them are addressable with the kind of work agencies have known how to do for a decade. The dramatic narrative is hiding the boring fixes, which is the worst form of distraction.

What I want clients to ask us

If you're a marketing leader hearing pitches from agencies about AI search, the question I'd want you to ask is: "show me the channel redistribution for a comparable client of yours, broken out by page type." If the answer is a hand-wave or a single screenshot of one chart, the agency hasn't done the work. If the answer is segmented and includes pages where traffic went up as well as pages where it went down, the agency probably has.

That's not a magic question. It's just a question that's hard to answer with vibes.

The honest path forward isn't a eulogy. It's an audit.

FAQ schema and AI citation lift: measuring, then attacking, a positive finding

Code Pocket — Tue, 12 May 2026 02:18:45 +0000

The first time we measured a citation lift from FAQ schema, my reaction was something like "great, write it up." That instinct is exactly how teams ship findings that don't hold. We waited, then we tried to break the finding. Part of it broke. Part of it didn't.

This is the report.

The initial finding

In a 12-client portfolio, across roughly 180 pages where we added FAQ schema to existing pages that already had FAQ-style content in the visible HTML, we measured a 14% relative lift in A+B tier citations over an 8-week window after deployment. The control was an internal A/B-style split where roughly half of comparable pages on the same domains got the schema and half didn't, with the assignment based on publication date (older half got it, newer didn't) to avoid biasing toward fresher content.

14% looked clean. The confidence interval was wide because the per-page citation counts were small, but the direction was consistent.

So we wrote it down and started recommending FAQ schema deployment as part of our standard GEO engagement, which the agency I work with has been doing since late 2025. And then I asked the team: what's the strongest argument that this finding is wrong?

Attempt 1: Was it really the schema, or was it the content?

Adding FAQ schema isn't a no-op. The pages that got schema had to have FAQ-formatted content. The pages that didn't get schema sometimes had less structured content, even if we'd told ourselves it was "comparable." When we re-coded the pre-schema pages for content structure (independent of schema), we found that about a third of the lift was probably attributable to content cleanup that happened at the same time. Not the schema itself.

That dropped the schema-attributable lift to something more like 9-10%. Still positive, but smaller, and with even wider uncertainty.

Attempt 2: Does the lift persist across engines?

We re-ran the breakdown by engine. The lift was strongest in Google AIO (around 18% relative), moderate in ChatGPT with web on (about 11%), small in Perplexity (5-7%), and basically zero in Gemini. The portfolio average of 14% was carried by AIO, which makes intuitive sense: AIO is the most directly continuous with Google's existing structured-data pipeline. The other engines may parse schema, but they don't seem to weight it the same way.

So "FAQ schema lifts AI citations by 14%" is true in aggregate and misleading in detail. The honest version is "FAQ schema lifts AI citations primarily on Google AIO, with smaller lifts on ChatGPT, and unclear effects on Perplexity and Gemini."

Attempt 3: Does it survive over time?

Eight weeks is not a long window. We extended the tracking to 20 weeks for the subset of pages where we had clean data, and the AIO lift held steady. The ChatGPT lift compressed (from 11% to about 6%). Perplexity bounced around in a way we can't characterize confidently. Gemini stayed flat. We don't have a clean explanation for the ChatGPT compression. One hypothesis is that ChatGPT's training data ingestion changed over the window; another is that we're just looking at noise.

What we did wrong

We initially reported the 14% number to one client before doing any of the breaking-the-finding work. They made a budget decision partially based on it. That was premature. We've since shared the breakdown with them and the recommendation didn't change materially, but the timeline of how we communicated it wasn't great. The internal process change we made: any portfolio-level finding has to survive at least one structured "how would this be wrong" pass before it goes to a client. That's added about a week to our finding-to-recommendation cycle. It's worth it.

Attempt 4: Are FAQ-rich pages just better pages?

This was the attack I least wanted to run, because it threatened the cleanest part of our finding. The question: are the pages we'd marked up with FAQ schema systematically better than the pages we hadn't, on other dimensions that AI engines might reward?

We did a manual readability and quality audit of the schema-on and schema-off pages, blind to which was which (one team member assigned IDs, another ran the audit without knowing the schema status). The schema-on pages scored modestly higher on readability and structure metrics, on average. Not because of the schema, but because the schema deployment had been done by a team that also tended to do small content polish at the same time.

When we statistically controlled for the audit quality score, the schema-attributable lift shrank again, to something more like 6-7%. Still positive in our sample, but now we were three attempts deep and the original 14% had been cut in half. The honest reporting framing became: "FAQ schema is associated with a citation lift, mostly on Google AIO, with effects in the 6-10% range after controlling for confounds we could identify."

That's a far less marketable sentence than "FAQ schema lifts citations 14%." It's also closer to what we actually know.

What we're still unsure about

We have not run a clean RCT. Our split was based on publication date, which is a proxy for randomization and not a substitute for it. There may be a temporal confound we're not seeing.

We also haven't tested other schema types systematically. Article schema, HowTo schema, Organization schema — we have anecdotes but not data. Don't read this piece as "schema is good." Read it as "FAQ schema, specifically, in this portfolio, did this specific thing, mostly on AIO."

There's a deeper uncertainty: AI engines update their parsing pipelines without telling anyone. A lift we measure today might evaporate in three months if Google AIO changes how it weights structured data, or persist for years if it doesn't. Schema findings have an unknown shelf life. We try to remeasure quarterly on a smaller subset of pages, partly to catch this kind of drift early. We've seen one minor compression already (the ChatGPT effect mentioned above) that may be a precursor.

How we communicate findings to clients now

A practical change that came out of this exercise: our client reports now include a "confidence summary" section that explicitly names the attempts we made to break our own findings, the controls we did and didn't apply, and the range we'd defend versus the point estimate. It's three more paragraphs per report. Most clients read past them. The ones who care notice, and those tend to be the ones whose internal teams catch issues earliest and who are the most useful to work with long-term.

The agency I work with has, I think, gotten more cautious in its language partly because of findings like this one. We say "associated with" more than we used to. We say "in our portfolio, in this window" more than we used to. Some prospective clients prefer the agencies that say "X delivers Y." We've lost some pitches that way. The retention rate on the clients we do sign is, anecdotally, higher than it was when we were sharper-edged in our claims. I can't prove causation on that either.

The thing I want to flag for anyone reading this

If you measure something positive and your first instinct is to publish it, wait. Try to break it. Try harder than is comfortable. We now treat this as a standard part of our research process, partly because we've been embarrassed before by writing up findings that didn't survive replication.

If you've measured a schema effect in your own work and tried to break it the same way, what did you find? I'd genuinely like to know whether our 6-10% adjusted estimate is high, low, or just specific to our client mix.