Is ChatGPT citing your site? A conceptual guide to GEO tracking in Python published

#webdev #python #ai #seo

When someone asks ChatGPT, Perplexity, or Gemini the same question, there is no rank. The model either mentions your site, paraphrases something from it, makes up a claim about it, or ignores it entirely. No dashboard tells you which happened.

This is what people are starting to call GEO — Generative Engine Optimization. I've been building a tracker for it over the past several months, and I want to share the conceptual model that actually works, along with one minimal Python snippet you can run today.

Fair warning: this is not a mature discipline. A lot of what's written about GEO online is vague hand-waving. I'll try to stay concrete about what I've verified and honest about what I haven't.

What you're actually measuring

When someone asks an LLM "best family dentist in my city", there are four distinct outcomes for any given site — and they are not the same thing:

1. Citation — the LLM names the site explicitly ("according to example.com...")
Paraphrase without attribution — the LLM clearly ate your content but doesn't name you
Distortion — the LLM mentions you, but gets something wrong: a price, a service you don't offer, an outdated fact
Omission — the LLM ignores you entirely, even for queries where you should be a natural match

Classical SEO collapses everything into one metric (position). GEO needs at least four, and they trade off against each other. A site cited ten times with four distortions is in worse shape than a site cited three times correctly. If you only track citation rate, you miss that.

The measurement loop

At its core, a GEO tracker does three things on a schedule:

Send a query to an LLM
Receive and parse the response
Decide which of the four outcomes applies to your domain

Steps 1 and 2 are mechanical. Step 3 is where it gets hard — and that's where most of the design work lives.

The minimal snippet

Here's the loop in its simplest form. I use OpenRouter because it gives you one API for many models, which you'll want later when you start comparing how different LLMs respond:

import os
import httpx

def query_llm(prompt: str, model: str) -> str:
    """Send a prompt, return the response text."""
    response = httpx.post(
        "https://openrouter.ai/api/v1/chat/completions",
        headers={"Authorization": f"Bearer {os.environ['OPENROUTER_API_KEY']}"},
        json={
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
        },
        timeout=60,
    )
    response.raise_for_status()
    return response.json()["choices"][0]["message"]["content"]


def check_mention(text: str, domain: str) -> dict:
    """First-pass check: is the domain or brand in the response?"""
    text_lower = text.lower()
    brand = domain.lower().split(".")[0]
    return {
        "domain_mentioned": domain.lower() in text_lower,
        "brand_mentioned": brand in text_lower,
    }


answer = query_llm(
    "Who are the top independent SEO specialists in Russia?",
    "deepseek/deepseek-chat",
)
print(check_mention(answer, "example.com"))

That's the skeleton. Running it once for one query is not a tracker — it's a sanity check. Everything beyond this is making it smarter.

Where the real complexity lives

The snippet above gets you maybe 30% of the way. Here's what eats the rest:

Query design. One phrasing per topic is not enough. "Best X" vs "Recommend X" vs "Who should I hire for X" can produce completely different answer sets from the same model. I run five to fifteen variations per topic and aggregate. Without this you're measuring the LLM's reaction to one specific phrase, not the underlying landscape.

Model coverage. ChatGPT, Claude, Gemini, Perplexity, and regional models like GigaChat or YandexGPT don't retrieve the same data the same way. Your site might be cited heavily by one and invisible in another. A single-model tracker tells you almost nothing — you need at least three in every cycle.

Distortion detection. If the LLM cites you but misquotes a price or invents a service, that is often worse than being ignored. Detecting this means comparing the LLM's claim against your actual page content — a fuzzy-match problem with no clean solution. I currently combine phrase overlap, entity extraction, and a manual review queue for the ambiguous cases. It is not automated to my satisfaction yet.

Paraphrase without attribution. The hardest case. The LLM obviously learned from your content but doesn't name you. Embedding similarity helps but is noisy. I lean on a combination of signals — phrase-level overlap, entity overlap, semantic similarity — and still end up reviewing edge cases by hand.

Stability. LLMs are non-deterministic. Run the same query twice, get two different answers. One snapshot is not data. You need multiple runs per cycle and a statistical view — mean, variance, flip rate — not a single boolean.

Things I'd tell myself six months ago

Don't build a dashboard first. Build a CSV-dumping script, run it for two weeks, stare at the raw output. The metrics that matter will emerge. If you lock in a schema early you'll be rebuilding it constantly.
Distortion rate is more important than citation rate. Almost nobody tracks it. It's the metric with the highest ratio of business impact to effort required.
Do not trust LLMs about their own sources. They will confidently invent URLs. Every claim has to be verified against the actual page, or you're building a tracker that measures hallucinations.
Pick one vertical first. Trying to build a generic tracker across all niches at once will drown you in edge cases. Pick one domain you know well, get it working there, then generalize.

Open questions

I do not have clean answers for these, and this is where I'd genuinely like input:

How do you weight a mention in ChatGPT vs. Perplexity vs. Gemini? Does downstream traffic actually follow, or is it mostly brand signal?
Is there a point where optimizing for LLM citation starts hurting classical SEO — e.g., by making content too "citation-bait" and lowering human engagement?
What's the right cadence? Daily is overkill for most queries. Monthly misses fast shifts. Weekly feels right but I haven't validated it.

If you've been working on anything in this space — tracking, experiments, methodology — I'd like to compare notes in the comments. This is early territory and I don't think anyone has it figured out yet.

DEV Community

Is ChatGPT citing your site? A conceptual guide to GEO tracking in Python published

Top comments (0)