DEV Community: carlosortet

How LLMs search, retrieve and answer with real-time web data

carlosortet — Fri, 10 Jul 2026 07:07:48 +0000

When a user asks something that requires recent information, an LLM does not answer only from its internal memory. It can activate search systems, retrieve web fragments, compare them semantically, validate sources and build an answer grounded in up-to-date information.

The technical question is: how does an LLM move from a user's prompt to an answer based on documents, search results and retrieved fragments in real time?

This article explains that flow. It helps explain why ChatGPT, Gemini, Claude, Copilot and Perplexity may answer the same question differently, why they do not value the same sources in the same way, and why optimization for LLMs is not about publishing more. It is about understanding which fragments the models retrieve, for what intent, and inside which context.

High-resolution diagram: prompt, query planning, retrieval, vectorization, semantic matching, source validation and answer construction.

Core idea

An LLM rarely answers a complex question in a single jump. First it does something close to clearing the table: it interprets what is being asked, detects entities, looks at conversational context and decides whether it can answer from what it already knows or whether it needs to search.

When it needs current data, another machine starts working: search, crawling, fragment retrieval, semantic comparison, source validation and, finally, response composition.

This is the important part for GEO. Many answers do not come only from the model's pre-trained knowledge. They come from a mix of reasoning, retrieval, discovered sources and internal composition criteria. OpenAI documents, for example, that ChatGPT Search can decide to search the web depending on the user's question, or allow the user to trigger search manually.

In any category where freshness, comparison or decision-making matters, this changes the rules. An answer about "best banks for young people", "phones with the best after-sales service" or "reliable fashion brands in Spain" may be shaped by fragments retrieved from media, official websites, comparisons, forums, technical documentation, marketplaces or user reviews.

What happens when the user asks

If someone writes best mobile phones 2026, the model is not reading only four words. It is reading a purchase intent. There is comparison, freshness, risk of obsolescence, candidate brands, possible price ranges and a fairly clear expectation: the user wants a recommendation, not a definition of what a phone is.

The system analyzes:

response objective;
mentioned or implicit entities;
conversational context;
market, language and date;
explicit constraints;
implicit value criteria;
purchase factors;
relevant brand attributes;
required freshness level.

That nuance changes everything that follows.

Search decision

The model, or the orchestration layer around it, decides whether it can answer from internal knowledge or whether it needs external information. For current, comparative, transactional, reputational or market-specific questions, some form of retrieval is usually activated: web search, index lookup, connectors or structured sources.

The decision may depend on:

freshness of the question;
presence of dates such as 2026;
need for prices, availability or new information;
risk of outdated information;
need to compare brands;
need to cite or validate sources;
likelihood that the user expects recent data.

If search is activated, the original prompt becomes a set of subquestions.

Example:

best smartphones 2026;
phones with the best camera;
iPhone Samsung Xiaomi comparison;
user opinions about battery life;
best phones to buy in Spain;
frequent problems by model;
most recommended brands by price segment.

In a reputational or commercial query, that decomposition may include searches such as:

user ratings for a brand;
frequent service complaints;
average rating on review platforms;
user experience in forums;
comparisons with competitors;
verified reviews on specialized platforms.

Search engines and grounding by model

There is a common misunderstanding: assuming that if an LLM uses Bing, Google Search or another retrieval system, then it behaves like that search engine. It does not.

The search engine may provide URLs, snippets, documents or passages. Then the model does something else: it reinterprets, compresses, compares, discards and turns all of that into a conversational answer.

That is why two systems can consult similar sources and still generate different answers.

ChatGPT

OpenAI documents that ChatGPT Search can use external search providers, including Bing, together with other partners and retrieved content. But ChatGPT is not Bing with a conversational interface on top.

Search finds candidates. ChatGPT decides what to consult, how to reformulate the question, which fragments to use, how to synthesize them and how to respond according to its instructions, conversational context and policies.

In real use, ChatGPT may:

decide whether to search;
generate one or more queries;
retrieve web results;
open or synthesize sources;
combine passages;
cite results when the product supports it;
compose a conversational answer.

The result is not a Bing ranking pasted onto the screen. It is a generative answer conditioned by search.

Gemini

Google documents Grounding with Google Search for Gemini. This means Gemini can ground an answer with Google Search results. But it does not turn Gemini into classic Google Search either.

Google Search may act as retrieval infrastructure. Gemini still applies its own prompt interpretation, context system, grounding mechanisms, filters and final generation.

There is a trap here for SEO teams: a page can rank well in Google Search and still not be the fragment Gemini chooses for a generative answer. To enter the answer, the page must be retrievable, clear, structured and useful for that specific conversational objective.

Microsoft Copilot

Microsoft Copilot naturally relies on the Bing and Microsoft ecosystem. In enterprise environments it may combine the web, Microsoft Graph, corporate documents, connectors and the user's permissions.

Microsoft documents agents and grounding experiences with Bing Search, but the final result is not a SERP either. Copilot transforms the prompt into queries, retrieves information, applies corporate context when available and generates an answer.

For GEO this matters a lot. The same content can behave differently in Copilot if it is evaluated in open web mode, inside a corporate tenant, with Microsoft Graph available or with internal connectors active.

Claude

Anthropic documents web search capabilities for Claude, with current information retrieval and citations. Unlike ChatGPT/Bing or Gemini/Google Search, public sources do not let us reduce Claude to a single external search engine.

What we do know is enough to measure it separately: Claude has its own system for search, source selection, citations, document reasoning and response policies.

It is often especially sensitive to context quality, documentary clarity, source consistency and the way evidence is presented. In a GEO study, inferring Claude from ChatGPT or Gemini is a mistake.

Perplexity

Perplexity presents itself as an answer-first and search-native system. Its Sonar and Search API products work with web search, retrieval and citations. The experience is designed to answer with visible sources, although that does not mean it selects or weights the same sources as Google, Bing, ChatGPT or Gemini.

In Perplexity, a source gains value when it is direct, current, verifiable and citable. For GEO, that gives weight to clean, updated, well-titled pages with explicit information and little ambiguity.

What matters for GEO

Each model has a different architecture:

different search engine or retrieval provider;
different query generation;
different tokenization system;
different embedding model;
different chunking policy;
different context window;
different reranking;
different tolerance for contradictory sources;
different citation behavior;
different synthesis style;
different safety and brand policies.

The operational takeaway is simple: asking one LLM once is not enough. A serious GEO audit measures each model, market and customer persona separately.

Search, crawling and retrieval

The system does not read the internet as a person would. It does not read it like Google Search does when it returns a SERP either. It queries search engines, proprietary indexes, crawlers, document stores, structured sources or specific connectors.

And each LLM solves this layer differently. ChatGPT, Gemini, Claude, Copilot and Perplexity do not index, retrieve or weight results in the same way.

The process usually has four parts:

Query planning

The prompt becomes several queries designed to solve different parts of the problem.
Crawling or index lookup

Crawlers and engines retrieve candidate pages, documents, tables, reviews, FAQs, product sheets, media, forums, official documentation or structured data.
Fragment extraction

The system does not mainly work with whole domains. It extracts pieces of content: titles, paragraphs, lists, tables, product blocks, review fragments, FAQ answers, schema markup, dates, authors, entities, language and localization signals.
Chunking and normalization

Content is cleaned and split into chunks: manageable fragments that preserve meaning. A chunk can be a review paragraph, a pricing table, a technical specification, a comparison or a block of user opinions.

Tokenization: why each LLM represents information differently

Before vectorizing or generating an answer, the model splits text into internal units: tokens. Sometimes a token is a word. Sometimes it is half a word, a sign, a frequent sequence or a sublexical piece.

Here a less visible but very important difference appears: each model family tokenizes in its own way. OpenAI uses tokenizers such as those in the tiktoken family. Google provides token-counting tools for Gemini. Anthropic documents its own token counting for Claude. Each system has its own vocabulary, context limits and text-splitting behavior.

So the model does not exactly see "words". It sees token sequences.

Conceptual example:

best mobile phones 2026

One model may split it roughly as:

best
mobile
phones
2026

Another may split it more granularly:

best
phone
s
mobile
2026

Another may treat accents, capitalization, brands, hyphens or symbols differently:

iPhone
15
Pro
Max
256
GB

or:

i
Phone
15
Pro
Max
256GB

It looks minor. It is not. It can affect:

how the context window is consumed;
how entities are detected;
how brands and product names are preserved;
how variants with or without accents are compared;
how acronyms, models, technical references or prices are represented;
how long documents are cut into chunks;
which fragments stay together or get separated;
which terms carry more weight in the semantic representation.

For GEO this matters a great deal. A brand, product or attribute may tokenize differently depending on the model. GEOdoctor, S.A.M., iPhone 16 Pro, B2C, CX, ISO 27001, A++, Madrid, Mexico or llms.txt may occupy different numbers of tokens and generate non-identical internal representations.

The practical consequence: the same content does not always "fit" equally well in every LLM. A text may be transparent to one model and less legible to another if entities, attributes, structured data or business phrases are not explicit enough. That is why a GEO strategy should not assume that one optimization works identically for ChatGPT, Gemini, Claude, Copilot and Perplexity.

Vectorizing the question and the documents

Next comes the step that makes it possible to compare questions and documents. The user's question and the retrieved fragments are transformed into the same kind of mathematical representation.

The question becomes an embedding:

user question -> vector q

Each retrieved fragment also becomes an embedding:

document fragment -> vector d

Both vectors live in a shared semantic space. Thanks to that, the system can compare meaning, not only literal word overlap.

The comparison can be expressed like this:

score = cos(q,d)
semantic distance = 1 - score

A fragment that does not literally repeat "best phones 2026" can still be highly useful if it discusses performance, camera, battery, price, local availability, user satisfaction or common problems. Matching does not depend only on repeating the keyword. It depends on semantic proximity between the user's need and the retrieved content.

Semantic matching and chunk selection

Once chunks are vectorized, the system searches for the fragments closest to the query using techniques such as ANN or kNN. Then it may apply:

reranking;
threshold filters;
deduplication;
source diversity;
date filters;
language filters;
consistency validation;
selection of top-k chunks.

The system is trying to give the model compact, useful and sufficiently diverse context to build an answer.

The LLM does not receive "the whole internet". It receives a limited selection of fragments. That selection shapes the final answer far more than it usually appears from the outside.

Measuring whether a chunk is a good match for a prompt

This is where a more technical evaluation layer is useful. A crawler may find a page, but that does not mean the model will use it. The sharper question is: once that page is split into fragments, which chunks actually match the prompt well enough to become useful context?

The metrics used to evaluate embedding models offer a practical clue. They do not reveal the internal algorithm of ChatGPT, Gemini, Claude, Copilot or Perplexity. Those systems are proprietary and combine many signals. But they do give us reproducible ways to test whether an embedding model works well for retrieval, semantic similarity, thematic classification and candidate ranking.

KPI	What it measures	GEO reading
`XQuAD nDCG@10`	Question-context retrieval: a question is vectorized, contexts are ranked by similarity and the metric checks whether relevant contexts appear in the top 10.	Measures whether the system retrieves the chunks that could actually answer the prompt near the top. It is the closest metric to the retrieval problem.
`STS-ca Sp`	Semantic textual similarity: model similarity between sentence pairs is compared with human scores using Spearman correlation.	Measures whether the semantic map understands that two texts can answer the same intent without sharing keywords.
`TeCla F1`	Catalan thematic classification: embeddings are used to separate topic classes and evaluated with F1.	Indicates whether embeddings preserve signals that help classify intent, sector, attribute, purchase factor or product line.
`Dim`	Embedding vector dimensionality.	More dimensions may capture more nuance, but increase memory, index cost and latency. The best model is not always the largest one.
Composite score	Average or combination across tasks.	Useful for comparing models, but never a substitute for testing with real prompts, markets, sources and customer personas.

nDCG@10 matters because it evaluates ranking, not abstract similarity alone. In generative retrieval, appearing in position 1, 2 or 3 is not the same as appearing in position 37. The LLM works with a limited context window; if the right chunk does not make it into the top-k passed to the model, it effectively does not exist for that answer.

STS-ca Sp covers another angle: the quality of the semantic map. If the user asks about "reliable brands for buying phones in 2026", a good system should bring closer fragments about warranty, after-sales service, user satisfaction, recurring problems, local availability or value for money even when they do not literally repeat the prompt.

TeCla F1 helps us understand whether embeddings separate themes and categories well. In GEO, the same logic can be adapted to a business taxonomy: informational, comparative or transactional intent; market; language; product line; purchase factor; brand attribute; competitor; funnel stage.

In practice, a serious GEO evaluation cannot stop at cos(q,d). It needs a composite score closer to real usage:

prompt-source fit =
  semantic similarity
+ intent coverage
+ entity coverage
+ market and customer-persona fit
+ freshness
+ chunk specificity
+ source citability
+ factual evidence
+ diversity against other chunks

This is where the strategic reading changes. A high-authority page can fail if it does not contain the fragment the prompt needs. A smaller source can become decisive if it contains the exact passage that validates an attribute, resolves a doubt or answers a comparison. GEO should therefore look past domains and measure retrievable fragments: their position in semantic ranking and their ability to support an answer.

So, when we study how LLMs search, we should store more than the cited URL: the retrieved chunk, the prompt that triggered it, semantic distance, top-k position, covered intent, recognized entities and the role the fragment plays inside the answer.

A reproducible protocol for measuring prompt-chunk fit

For a GEO study to survive serious review, saying "this source appeared" or "this model cited this URL" is too thin. The observation has to become a repeatable experiment. The minimum unit is no longer the domain. It is the relationship between prompt, chunk, model and answer.

What we want to measure is very specific: given a set of prompts, markets, customer personas and competitors, which fragments from which sources are retrievable, at what rank, with what degree of relevance and with what effect on the final answer?

The minimum dataset should store these fields:

Field	Why it matters
`prompt_id`, `prompt_text`, `language`, `market`	Make it possible to repeat the same question and segment it by language and country.
`customer_persona`, `intent`, `product_line`, `purchase_factor`	Connect the prompt to a real decision situation: comparing, buying, validating trust, resolving an objection or looking for alternatives.
`brand`, `competitors`, `expected_attributes`	Show which brand, rivals and attributes should appear if the answer is well oriented.
`source_url`, `source_type`, `publication_date`, `crawl_date`	Document where the content comes from, when it was captured and what type of source it represents.
`chunk_id`, `chunk_text`, `chunk_hash`	Make the exact fragment auditable, not just the page. The hash helps detect later changes.
`model`, `search_surface`, `embedding_model`	Separate the LLM being evaluated, the search surface and the embedding model used to measure similarity.
`similarity_score`, `rank_position`, `relevance_label`	Record semantic proximity, top-k position and relevance judgment.
`support_label`, `used_in_answer`, `citation_position`	Distinguish whether the chunk supports, contradicts or is insufficient for the answer, and whether it appeared in the synthesis.

The methodology can be run in seven steps:

Freeze the prompt dataset. Before measuring, fix the prompt set by market, language, customer persona, intent, product line, purchase factor and competitors. If the dataset changes during the study, the comparison loses value.
Capture and chunk sources. Store HTML, clean text, canonical URL, publication date, crawl date, language, country and source type. Then split documents into chunks with consistent size and overlap.
Index and compare. Create embeddings for prompts and chunks. It is useful to compare at least a lexical baseline such as BM25, a vector index and a hybrid strategy. This shows what appears through literal overlap and what appears through semantic proximity.
Rerank and label. A second step reorders candidates. They are then labelled with a rubric: 0 not relevant, 1 tangential, 2 useful, 3 central. The label should also state whether the fragment supports, contradicts or does not contain enough evidence.
Compute metrics. For each prompt and segment, compute precision@k, recall@k, MRR and nDCG@10. Splitting by model, market, persona and intent shows where the brand is retrievable and where it disappears.
Connect retrieval to the answer. Not every retrieved chunk is used. Map which fragments support specific claims, which only act as background and which are ignored.
Measure drift. Re-running the same dataset over time shows whether a source gains weight, whether a brand loses conversational presence or whether a model changes how it validates attributes.

In pseudocode, the flow looks like this:

for prompt in prompt_dataset:
  q = embed(prompt.text)
  candidates = retrieve_top_k(q, chunk_index, k=50)
  reranked = rerank(prompt, candidates)
  labels = judge_relevance(prompt, reranked)
  metrics = evaluate(labels, k=[3, 5, 10])
  answer = run_llm_with_retrieval(prompt)
  claims = map_answer_claims_to_chunks(answer, reranked)
  store(prompt, candidates, labels, metrics, answer, claims)

A chunk is a good match when it appears in the relevant top-k, receives a label of 2 or higher, covers the prompt intent, contains explicit entities, fits the market and customer persona, provides verifiable evidence and does not contradict stronger sources. A source is strategically valuable when it appears repeatedly across prompt variants, personas, models or markets and improves the final answer.

Disagreement between models is useful. If one embedding model, LLM or search surface retrieves a fragment and another does not, that difference reveals how each system represents information, tokenizes text, weighs signals and packs context. That is a core part of GEO work: knowing what an AI says, yes, but also why it was able to say it and which content made that answer possible.

How different LLMs do semantic matching and selection

There is no single way to do "LLM semantic matching". Each system mixes different components: lexical search, vector search, embeddings, freshness signals, neural rerankers, intent classifiers, safety filters, product policies and citation preferences.

A typical architecture may include:

Lexical retrieval

Matching by terms, entities, brand names, dates, URLs, titles or exact expressions. It is useful for queries with proper nouns: Dexeus, iPhone 16 Pro, GEOdoctor, CaixaBank, Madrid.
Semantic retrieval

Comparison between query embeddings and chunk embeddings. It finds content that answers the meaning even when it uses different words.
Hybrid retrieval

A mix of lexical and vector search. It is more resilient to two common failures: losing exact documents because they do not look "semantic" enough, or retrieving semantically similar documents that are factually wrong.
Reranking

A second model reorders candidates according to expected usefulness for the answer. It may evaluate whether the fragment answers the question, is specific, recent, reliable, evidential or duplicate.
Context packing

The system decides which chunks fit inside the context window. Two equally relevant fragments may compete when space is limited.
Answer planning

The LLM decides how to use those fragments: list, comparison, summary, warning, recommendation, table or direct answer.

ChatGPT

ChatGPT can combine external search, web results, fragment retrieval and generative reasoning. When it searches, it does not simply return the top-ranked result. It can reformulate the query, read several results, extract passages and build a synthesis.

Matching may favor:

recent pages when the question is time-sensitive;
clear and self-contained fragments;
sources that solve a specific subquestion;
content with explicit entities;
documents that allow several options to be compared;
sources that reduce uncertainty.

For ChatGPT, GEO-optimized content should be easy to fragment, understand and cite. Tables, FAQs, explicit comparisons, evidence-backed claims and pages with clear dates often help more than a long, polished but ambiguous essay.

Gemini

Gemini can use Google Search for grounding, but it does not replicate the SERP. Its matching may combine signals from the Google ecosystem with generative prompt interpretation. A strong SEO page can become a candidate. The real question is different: do its fragments fit the conversational objective?

Gemini may be especially sensitive to:

freshness;
entity clarity;
relationship with search intent;
localization signals;
structured data;
consensus across sources;
ability to resolve a comparison.

Simple example: if the user asks best phones for traveling through Europe in 2026, talking about "best phones" is not enough. A fragment that mentions battery life, roaming, eSIM, camera, durability, European price and local availability may match better than a page with general authority.

Claude

Claude must be measured separately. Its behavior cannot be inferred from Google or Bing. In searches with sources, it may select and cite documents, and its synthesis depends heavily on context coherence and the argumentative quality of the sources.

In Claude, content may work better when it:

avoids ambiguity;
separates facts, opinions and claims;
explains reasoning;
provides enough context;
keeps terminology consistent;
presents evidence clearly;
reduces internal contradictions.

This tends to make it more cautious in recommendations when sources are weak, contradictory or too promotional.

Copilot

Copilot can combine the web, Bing, Microsoft context and, in corporate environments, Microsoft Graph. Its matching changes substantially depending on whether the user is in a public or enterprise environment.

In an enterprise environment, the following may matter:

internal documents;
user permissions;
emails, files or corporate knowledge;
public web sources;
productivity context;
the user's identity and role.

So GEO visibility in Copilot is not only a public-web problem. For B2B brands and institutions, internal documents, knowledge bases, PDFs, presentations, intranets, policies, product sheets and sales materials can also enter the picture.

Perplexity

Perplexity is more oriented toward answering with visible sources. Its matching usually rewards direct, updated, citable content designed to answer without detours. It may select several sources and build an answer with references.

To perform well in Perplexity, a source should be:

clear in the title;
explicit in the answer;
updated;
easy to cite;
not excessively blocked;
useful as evidence;
specific enough for the query.

In GEO, Perplexity is especially useful for observing which sources the system considers citable in a specific category.

Why domains are not selected only by authority

In classic SEO we often look at domain authority, keywords, backlinks and general ranking. In generative retrieval, that is not enough.

A very strong domain may fail to provide the right fragment for a specific question. A smaller source may become decisive if it contains exactly the passage that matches the user's objective.

The system may evaluate:

Semantic match: whether the chunk answers the prompt objective.
Freshness: whether the information is up to date.
Geographic proximity: whether the source is relevant for Spain, Europe, LATAM or another market.
Language and localization: whether the content matches the user's real language.
Customer-persona fit: technical buyer, family user, CMO, patient, student, traveler.
Purchase factors: price, trust, warranty, availability, reputation.
Brand attributes: innovation, sustainability, reliability, luxury, security, service.
Source type: media, review platform, official website, forum, marketplace, technical documentation.
Consistency: whether independent sources confirm the same point.
Sentiment: whether the source improves or weakens brand perception.
Crawlability: whether the content is easy to read, extract, index and cite.
Structured data: schema, tables, FAQs, metadata, dates, authorship and clear entities.

The consequence is uncomfortable for many inherited strategies: in GEO, the biggest domain does not necessarily win. The winning content is the content the system can find, understand, compare and use to solve a specific question.

UGC sources inside the GEO flow

UGC layers have a particular role inside this system. They are rarely the only source of an answer. Still, they can act as validation signals when the user asks about trust, service quality, satisfaction, frequent problems or comparison between brands.

For an LLM, a real-user experience source can provide:

evidence of real user experience;
natural language about problems and benefits;
aggregated sentiment;
recurrence signals;
customer-persona vocabulary;
brand attributes that do not appear on the official website;
contrast against the brand's commercial claim.

This explains why reviews matter in GEO even when they are not always the most cited source. Sometimes their value sits elsewhere: confirming or challenging what other sources say about the brand.

A brand with good corporate content but weak or contradictory reviews may receive more cautious answers. A brand with limited media presence but clear satisfaction signals in review sources may reinforce attributes such as trust, service, reliability or value for money.

Technical validation of sources in the main LLMs

Source validation is not uniform either. Each system decides in its own way what is sufficient, reliable, fresh, citable or useful.

A validation layer may evaluate:

Recency: publication date, update date, topic volatility.
Entity: whether the source correctly identifies the brand, product, person, institution or location.
Authorship: visible author, responsible organization, sector reputation.
Evidence: data, tables, methodology, examples, cases, citations, documentation.
Consistency: whether the fragment matches other retrieved sources.
Conflict: whether there are contradictions, complaints, controversies or recent changes.
Sentiment: positive, neutral, negative, defensive or critical tone.
Citability: whether the source can be cited clearly and stably.
Crawler accessibility: whether the content is available to bots without relying on scripts, images or unnecessary walls.
Specificity: whether it answers the concrete question or only provides generic information.
Contextual fit: whether it fits market, language, persona, product line and intent.

ChatGPT

ChatGPT may evaluate retrieved sources according to usefulness for the answer, freshness, coherence and ability to solve the question. In products with citations, it also needs to select sources it can show the user as support.

A source may be excluded even if it is well known when:

the fragment does not answer the question;
it is outdated;
it is too promotional;
it provides no evidence;
it contradicts more specific sources;
it does not fit the user's market;
it is not useful for the generated subquestion.

Gemini

Gemini with Grounding can rely on Google Search, but generative validation prioritizes fragments that fit intent, freshness and consistency. SEO strength helps a page enter the candidate set. It does not guarantee inclusion in the answer.

Gemini may behave differently for:

local searches;
YMYL topics;
commercial queries;
comparative prompts;
date-sensitive questions;
brand entities with conflicting information.

Claude

Claude tends to be more cautious when sources are unclear or when retrieved context does not support a strong conclusion. In validation, that leads to more nuanced, less categorical answers that depend more on contrasts between sources.

For GEO, the practical reading is direct: a brand must provide clear evidence, not only claims. Claude may penalize ambiguous, overly commercial or hard-to-verify content.

Copilot

Copilot can validate web sources alongside internal sources when it operates inside a Microsoft environment. In that case, relevance does not depend only on the public web. Permissions, corporate context, internal documents, calendar, email, files and knowledge bases also count.

A brand may be well represented on the open web but appear differently in Copilot if the user works inside an enterprise ecosystem with outdated, contradictory or incomplete internal documentation.

Perplexity

Perplexity tends to show sources visibly. Citability matters a lot. A clear, current and specific source may have an advantage over a broader but less direct one.

For GEO, Perplexity makes it easier to observe which URLs it considers useful support for an answer. That makes it a good system for analyzing citable source maps by category.

Source validation

Before generating the final answer, the system may evaluate contextual authority, freshness, consistency, conflicts between sources, sentiment, affinity with the question and real usefulness for the user.

It may also detect contradictory sources. For example:

an official website claims a benefit;
a review platform shows recurring complaints;
an industry outlet validates an advantage;
a forum introduces negative sentiment;
a local comparison favors a competitor.

The final answer is built from that tension between sources. That is why a brand should not obsess only about "appearing". The better question is more precise: which fragments appear, how each LLM represents them, with which sentiment, for which customer persona they are relevant, and in which context they are available to generative systems.

Answer construction

Once the relevant fragments have been selected, the LLM builds the answer by combining:

the user's original prompt;
the semantic interpretation of the task;
the retrieved chunks;
system instructions;
safety policies;
generation parameters;
expected style;
token limit;
temperature;
top-p.

The answer is not a copy of the sources. It is a generative synthesis. The model summarizes, compares, orders, softens uncertainty and decides what to omit and what to emphasize. That is why the final result depends both on the retrieved documents and on how the model interprets their relevance.

Where GeoRadar, S.A.M. and LEO enter

This is where GeoRadar stops being a reporting tool and becomes a research tool.

In that methodological layer, the stack has a specific role. GeoRadar observes model behavior: prompts, answers, sources, attributes and competitors. S.A.M. helps check whether content is semantically aligned with the prompts and attributes we need to cover. LEO turns that reading into a multichannel activation logic. The goal is practical: to make traceable which question activates which source, which fragment supports which answer and which content gap still needs to be covered.

If we launch thousands of personalized prompts crossed by customer persona, market, product line, intent, purchase factors and competitors, and we also use a system that checks whether those questions cover the real business space, we can stimulate AI in a way very similar to how real users would during ordinary sessions.

Then we store and analyze those conversations. This reveals which sources LLMs retrieve again and again, how visible and recommended we are, which attributes they associate with our brand, what sentiment they use, how they compare us with competitors, in which markets we appear better or worse, and which content, reputation or authority gaps are shaping the answers.

The result is no longer a single screenshot. It is a traceable map of how generative models understand, validate and recommend a brand inside its category.

Why volume matters

One isolated conversation does not diagnose a market. A prompt may be biased by wording, session, model, timing or search configuration.

To read the market reliably, you need volume and semantic coverage. That means building a broad dataset that crosses:

customer personas;
markets;
languages;
product lines;
funnel stages;
intents;
purchase factors;
competitors;
brand attributes;
informational, comparative, transactional and skeptical prompts.

When that volume is combined with a completeness algorithm, the system can detect whether the question map sufficiently covers the real decision space. That is the difference between running a curious test and building a GEO audit with business value.

The critical point

In GEO, data without precision and volume has little value. Publishing a lot without knowing what, how and for whom is shooting blind.

The advantage appears when a brand knows which questions activate its category, which sources feed the answers, which attributes are assigned to it, which competitors occupy its space, and which content must be created, corrected or activated so the brand can be understood, validated and recommended by generative models.

From diagnosis to brand and business impact

To create real brand and business impact, an excellent study is not enough. Diagnosis is the first layer. Then you need a well-prioritized action plan and tools to execute it with the right scale, precision and speed.

GEO work must convert market reading into operational decisions:

which sources to activate first;
which content to create;
which content to correct;
which knowledge gaps to close;
which attributes to reinforce;
which markets to prioritize;
which customer personas to target;
which competitors to displace;
which prompts should start returning a different answer.

Prioritization cannot be generic. Sources must be ranked by expected return against effort and investment. One source may have strong semantic influence while requiring too much cost. Another may have less global weight and still offer immediate return if it helps validate a business attribute, correct negative sentiment or gain presence in a specific market.

You also need realistic technical GEO with measurable return. Not everything is fixed by publishing content. Sometimes the problem is that the website is not easy for crawlers to read, does not expose entities well, structures data poorly, uses the wrong schema, hides critical information in inaccessible assets, or does not provide clean routes for bots to understand products, services, experts, locations, cases and claims.

The content strategy must be prioritized according to brand objectives. Two areas deserve particular attention:

Knowledge gaps: what LLMs do not know, cannot find or misinterpret about the brand.
Competitive benchmark: what competitors already occupy across prompts, sources, attributes, recommendations and markets.

There is another point: content should not be created only to please humans or cover keywords. It must fit the vector value that AI bots look for when they compare a question with the fragments available. That requires content aligned with specific prompts, specific customer personas and specific business attributes.

That is why tools are needed to evaluate content before publication, help create it quickly and show which prompts it is designed to serve. Writing an article about a topic is not enough. You need to know whether that article has a real probability of being retrieved, understood and used by the model to build an answer.

Conversational behavior also has to be analyzed. A brand may appear well in the first prompt and disappear when the second, third or fourth prompt arrives in the same conversation. That matters because many users do not decide with a single question. They compare, ask follow-ups, request alternatives, filter by price, location, trust, specific use case or reputational doubt.

The opportunity is large: increasing conversion in a scenario where organic web traffic is consistently falling and where more decisions start, advance or close inside generative interfaces. GEO opens a new discipline. It is not only about attracting visits. It is about being recommended when the user is already asking for help deciding.

Being recommended by generative models means competing for the attention and trust of more than one billion users. Naturally, this requires new tools, new methods and a different way to connect data, content, sources, technology and business.

Technical sources consulted

OpenAI Help Center: ChatGPT Search, external search providers and query handling.
OpenAI: Introducing ChatGPT Search, description of ChatGPT Search and external providers.
OpenAI Help Center: What are tokens and how to count them?, tokenization and variation by model/encoding.
Google AI for Developers: Grounding with Google Search, Gemini connection with real-time web content.
Google AI for Developers: Understand and count tokens, tokenization and token counting in Gemini.
Microsoft Learn: Grounding with Bing Search, generated query flow, Bing results and final model answer.
Microsoft Learn: Data, privacy, and security for web search in Microsoft 365 Copilot, queries generated by Copilot and sent to Bing.
Anthropic Docs: Web search tool, web search, results and citations in Claude.
Perplexity Docs: Search API, structured results and differences with Sonar.
Perplexity Docs: Sonar API, web-grounded answers with search_results and citations.
Lewis et al.: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, foundational paper on RAG.
Karpukhin et al.: Dense Passage Retrieval for Open-Domain Question Answering, dense question-passage retrieval.
Thakur et al.: BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models, benchmark for information retrieval.
Reimers and Gurevych: Sentence-BERT, sentence embeddings through siamese networks.
Stanford NLP: Introduction to Information Retrieval, fundamentals of ranking, lexical retrieval and classical IR models.
Johnson, Douze and Jégou: Billion-scale similarity search with GPUs, FAISS architecture for large-scale vector search.
Malkov and Yashunin: Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs, basis of HNSW for ANN.
Khattab and Zaharia: ColBERT, late interaction for efficient and accurate ranking.
MTEB: Massive Text Embedding Benchmark, benchmark for embeddings across tasks and languages.
Google DeepMind: XQuAD, multilingual question-answering dataset with 240 paragraphs and 1190 question-answer pairs.
scikit-learn: nDCG score, ranking metric with logarithmic discount by position.
SciPy: Spearman rank correlation, nonparametric correlation between rankings.
scikit-learn: F1 score, harmonic mean of precision and recall.
Projecte AINA / CLUB: Catalan CLUB benchmark, Catalan classification, STS and QA tasks.
Projecte AINA: TeCla, Catalan thematic classification corpus.

It's January 32: how to know if an AI trained on your content, and prove it

carlosortet — Fri, 10 Jul 2026 05:26:37 +0000

I had a hunch, and it turned out to be published. The idea was simple. Mark your content before you publish it, and if an AI model trains on it, that model should carry the mark inside. And if someone takes that model to train another one (a Chinese one, say), the new one carries it too. A trail that spreads down the chain of models like radioactive ink.

The second part of the hunch was the carrier. The stubbornest way to plant that trail? Not cryptography. Not metadata. We tried with the less common token combos, but it differs by model, and there is nothing an LLM likes more than killing a low-probability chain of tokens, so there is a high probability the LLM's immune system deletes it along the process. May be, the best way to attack this challenge is an impossible lie tucked inside a normal sentence. "Today, January 32, 2026, a new fungus was described." To a human, it's obvious: January 32 doesn't exist. LLMs don't parse semantics, so a date like that slips right through. And if it turns up again in someone else's sentence, there's only one plausible source: you.

I set out to knock down my own ideas, which is really the only way to find out if they hold up. Most people just assume proving an AI trained on your work is impossible unless you crack the model open. I was convinced of the opposite. The research proved me right on the essentials, and it forced me to be honest about the limits too. Both matter.

An LLM watermark exists, and you detect it by counting

I start with the mechanism, because without it nothing else stands. Marking the text a model generates is a solved problem since 2023. The seminal work is Kirchenbauer et al. (ICML 2023). Before each token, the model looks at the previous token, mixes it with a secret key, and splits the vocabulary in two: a "green list" and the rest. It nudges up the probability of the green tokens, and the model ends up choosing green almost every time. Reading it, you can't tell. The distribution shifts just enough.

Detecting is even cheaper. You take the suspect text, recompute the green list with the key, and count how many green tokens there are. If there are far more than chance would dictate, you have your proof: a statistical test that returns a p-value. Kirchenbauer reports z above 4 and p close to 10⁻⁶. You don't need the model. Just the text and the key.

Marking is cheap and the proof is purely statistical. The same mechanical backbone holds up the whole field. Source: author's recreation of Kirchenbauer et al., ICML 2023.

This is not blackboard theory. Google deployed SynthID-Text inside Gemini and validated it over about 20 million real responses with no perceptible quality drop, published in Nature in 2024. There are whole families of variants: some that don't touch the output distribution at all (the "distortion-free" schemes of Aaronson and of Kuditipudi et al.), others robust to paraphrasing through semantics (SIR), others that embed a whole 32-bit message and not just one bit (MPAC). The field is mature on the provider side.

But that marks the output of one model. My hunch was about something else: marking my corpus, on the content owner's side, before any model touches it.

Radioactivity: the trail that passes from one model to another

My theory has an academic name, and it comes from Meta. It's called radioactivity. Sander et al., presented as a spotlight at NeurIPS 2024, prove exactly what I suspected: if a model A emits marked text and a model B trains on it, the mark leaves a detectable residue in B.

The margins are huge. With open access to the weights and no supervision, the trail is detected with p below 10⁻⁵ when only 5% of the training data is marked. In the supervised case it drops to p below 10⁻³⁰ with barely 1%. The term was coined by Sablayrolles et al. in 2020 for images, where 1% of marked data was enough for p below 10⁻⁴. And it survives pretraining from scratch, not just fine-tuning: Sander confirmed it training models with 1 billion parameters on 10 billion tokens.

One hop is proven. The whole chain is an open frontier. Source: author's recreation of Sander et al., NeurIPS 2024, and Gu et al., ICLR 2024.

This is where you have to get honest, which is what separates a serious article from a brochure. The single A→B hop is proven. The A→B→C chain has not been measured by anyone. Every radioactivity paper measures a single hop. None checks empirically that the mark survives a second chained training. And the adjacent evidence warns of decay, not intact propagation:

A second cleaning pass over the same model already halves the significance (Sander, purification section). It's not a third generation, it's cleaning the same B.
Gu et al. (ICLR 2024) show that a mark distilled into the weights does not survive a second fine-tuning with clean text. The mechanism that would carry the mark into a distilled model is fragile precisely against the kind of training a chain implies.
Five chained rewrites at inference drop detection to around 4.9% (Chainwash, 2026).

The real case everyone cites, DeepSeek accused of distilling from OpenAI, is not a tracked mark either. It rests on style similarity, around 74%, and on traffic patterns. No serious source says otherwise: today there is no method to prove it conclusively. And that is exactly why marking proactively matters. If OpenAI had marked its output from the start, the proof would be far cleaner.

So the honest thesis is this: your content trains model A and the trail shows up in A, that is solid and sellable. The whole chain is a fascinating open frontier. Framing it that way is stronger, not weaker, because you're posing a hypothesis science has not yet closed.

Your best invisible ink is a lie: January 32

Anyone who has played with steganography thinks first of invisible characters: zero-width spaces, Cyrillic letters that look Latin, Unicode marks the eye can't see. It's elegant. For this problem it's a mistake.

The pipelines that prepare data before training are built precisely to erase that layer. Tokenization and Unicode normalization eat the invisible characters and homoglyphs before the model ever sees them. AWS and Cisco treat them directly as an attack surface and filter them by default. Your mark disappears at the first filter.

The impossible date goes the opposite way. It doesn't hide in the bytes, it hides in the meaning. "January 32" are perfectly ordinary tokens. The "32" is a single stable token in the GPT-4 tokenizers (cl100k and o200k use the regex \p{N}{1,3}), and "2026" always splits the same way into [202, 6] (tiktoken; Singh and Strouse, 2024). The tokenizer couldn't care less that day 32 doesn't exist: it treats it the way it would treat the 30th. The oddity is in the meaning, not in the bytes.

The pipeline erases bytes and metadata, not meaning. An impossible value rides on ordinary tokens.

And this isn't new, which is the best thing about it. Mapmakers have spent a century planting towns that don't exist to catch copycats. The most famous, Agloe (New York), was an anagram of the initials of two draftsmen. Dictionaries do the same with fake words: "esquivalience" was slipped on purpose into the New Oxford American Dictionary in 2001. Academia already brought it to language models with a name of its own, "copyright traps" (Meeus et al., ICML 2024). There's even case law: in Feist v. Rural (US Supreme Court, 1991), the copied directory included four fictional entries planted to catch copies.

The memorization wall

Here comes the nuance that decides whether this is a toy or a tool. Marking is trivial. Detecting requires the model to have learned the mark, and models only memorize what they see many times.

Meeus's numbers are harsh. A 100-token fictional sentence repeated 1,000 times is detected with an AUC of 0.748. The same sentence at 25 tokens, no matter how much you repeat it, stalls at 0.557, near chance. A single appearance is practically undetectable. Wei, Wang and Jia (Findings of ACL 2024) set the bar on a giant model: a statistical mark is detectable in BLOOM-176B if it appears at least 90 times.

Marking is trivial; detecting requires memorization: it takes length and repetition. Source: author's recreation of Meeus et al., ICML 2024.

There's a beautiful paradox on the way: the rarer you make the mark so it memorizes better, the more you risk the quality filter discarding it for being rare. The same perplexity that helps detection can get the document thrown out before it enters.

The ceiling: no strong mark is unerasable

Zhang et al., "Watermarks in the Sand" (ICML 2024), prove that no strong watermarking is unbreakable against an adversary with a quality oracle and a perturbation oracle, both realizable with another LLM. An attacker can reverse-engineer the green-list rules by querying the public API for under 50 dollars, and with that forges or erases the mark with over 80% success.

That's why industry and regulation combine layers instead of trusting everything to the watermark, and why technical honesty is not an ornament here, it's the product.

The other door: not whether they remember you, whether they cite you

More and more models don't just train and store, they go out to search at the moment of answering. ChatGPT with search, Google's AI Overviews, Perplexity. There the question changes: not "do they remember me?", but "am I the source they're citing, and can I prove it?".

The complementary face of watermarking: it doesn't require them to memorize you. Source: author's recreation of Seiden et al., 2026.

Here the watermark changes shape: canary tokens. You serve a unique identifier, different for each bot that crawls you, then query the AI search systems. If one spits out your token, you've proven the whole chain (Seiden et al.). It needs no memorization, and it gets detected almost instantly.

What we're building on top of this

At 498A, Zoopa's R&D lab, we tested this with real clients. A food-industry client asked us to verify their content was the real source behind what a model kept repeating. Out of 20 marked pieces of content, we recovered the key in 14. Seventy percent, not a hundred, and that was already solid proof.

With a public institution we went further: we found the same trace in a second model that, everything suggests, trained on the first model's outputs. That's the propagation chain no academic paper has measured yet in a controlled way. A field observation, not a closed scientific proof, but a real one.

It's not a silver bullet. It's a layer almost nobody has, at a moment when your content has started working for free inside machines you don't control. An invisible character gets erased by the system without a thought. A date that cannot exist would only be erased by someone who understands what they're reading.

Technical sources

Kirchenbauer, Geiping, Wen, Katz, Miers, Goldstein: A Watermark for Large Language Models, ICML 2023.
Sander, Fernandez, Durmus, Douze, Furon (Meta FAIR): Watermarking Makes Language Models Radioactive, NeurIPS 2024.
Gu, Li, Liang, Hashimoto (Stanford): On the Learnability of Watermarks for Language Models, ICLR 2024.
Meeus, Shilov, Faysse, de Montjoye: Copyright Traps for Large Language Models, ICML 2024.
Wei, Wang, Jia: Proving Membership in LLM Pretraining Data via Data Watermarks, Findings of ACL 2024.
Zhang, Edelman, Francati, Venturi, Ateniese, Barak: Watermarks in the Sand, ICML 2024.
Jovanović, Staab, Vechev (ETH): Watermark Stealing in Large Language Models, ICML 2024.
Seiden, Ren, Zhang, Kim, Liu, Wenger: Identifying AI Web Scrapers Using Canary Tokens, 2026.
Similarweb: The Downstream Impact of AI Visibility, data Jul-Dec 2025.

Full source list (30+ papers) in the canonical article.

48,000 characters in 2,700 tokens: lets discuss how LLMs read text as images

carlosortet — Sun, 05 Jul 2026 10:29:05 +0000

Last December, I discovered a small open-source project that became quite popular called pxpipe that started trending on GitHub with a claim that sounds like a billing error: the same Claude Code session that cost $42.21 as plain text cost $4.51 when the bulky parts of the request were converted into PNG images before leaving the machine.

Same model, same task, same answers. The only difference was that the model read most of its context with its eyes instead of its tokenizer. It sounded silly, but it worked: type text, capture it as an image, deliver it to the LLMs so it can be tokenized as text again.

The A/B that made the rounds: same session, $42.21 as text versus $4.51 with imaged context. Source: pxpipe repo (MIT).

That number is worth taking apart, because the mechanism behind it is bigger than one tool. It explains a real cost lever for anyone running agents or long-context pipelines. And it connects to a second question that most people haven't linked to it yet: when an AI system fetches your website in real time, what does it actually see? The answer, increasingly, involves the same vision channel.

The mechanic: images are billed by pixels, not by characters

Every multimodal model prices an image by its dimensions. A 1000x1000 pixel image costs the same number of tokens whether it contains a white void or 5,000 words of dense documentation. The token meter counts pixels. It doesn't care what's painted on them.

Text, on the other hand, is billed character by character. On real Claude Code traffic, dense content like code, JSON and logs runs at roughly one character per text token. The same content rendered into a tightly packed image runs at about 3.1 characters per image token.

That gap is the whole trick. The original example from the pxpipe documentation: a block of roughly 48,000 characters (system prompt plus tool documentation) costs about 25,000 tokens as text. Rendered as a dense PNG, it costs about 2,700 image tokens. Same information, 89% fewer tokens for that block. Simple.

What the model actually receives: 48,000 characters whitespace-minified, reflowed into full rows, with ↵ marks where the original line breaks were. Source: pxpipe repo (MIT).

To make that page readable for a vision encoder, the text goes through three preparation steps. Redundant whitespace gets stripped. The lines get reflowed to fill complete rows of the image, wrapping at columns about 1,928 pixels wide. And small ↵ marks get inserted where the original newlines were, so the structure survives the compaction. A dense page packs around 92,000 characters.

Why pixels beat words at carrying words

The counterintuitive part isn't the billing. It's that the model can actually read this very well.

Sean Goedecke wrote a good analysis of why the math works. A text token is a discrete choice: one option out of a vocabulary of roughly 50,000 entries, which then gets mapped to an embedding vector. You probably know all this. But you probably did not notice that an image token has no such constraint. It can occupy any point in the embedding space, which makes it far more expressive per unit. DeepSeek's research puts a number on it: about 10 text tokens can be recovered from a single image token with near perfect accuracy. This is what shocked me big time: the accuracy!

The idea isn't new, and that's part of what makes the current moment interesting. I saw it in a post, tried it, and have used it since. People, as we do, have been pasting screenshots into multimodal models for years, and the standing objection has always been the same: vision encoders weren't reliable enough at reading small, dense type, so nobody sane would route production context through them. What changed between 2024 and now is the encoders.

When DeepSeek published its optical compression results, and pxpipe shipped its evals, the conversation moved from "would this even work" to "here's the exact ratio where it stops working". That's the difference between a party trick and an engineering option.

Here's how the main providers actually count image tokens today:

Provider	How an image is priced	A 1000x1000 px image
Anthropic (Claude)	28x28 pixel patches, one visual token each: ⌈w/28⌉ x ⌈h/28⌉	1,296 tokens
OpenAI (high detail)	85 base tokens + 170 per 512x512 tile after rescaling	765 tokens
Google (Gemini)	258 tokens flat if ≤384 px, otherwise 258 per 768x768 tile	1,032 tokens

On current Claude models (Fable 5, Opus 4.7 and 4.8), an image is capped at 4,784 visual tokens with a long edge of up to 2,576 pixels. The pxpipe render width of 1,928 pixels sits comfortably under that ceiling. None of this is an accident.

The fine print per provider

The details matter more than the headline formula, because each provider hides different tradeoffs in them. This is what makes life interesting, isn't it?

Anthropic's older models capped images at 1,568 pixels on the long edge and 1,568 visual tokens. Anything larger got downscaled before processing, which silently capped your cost and your resolution at once. You'll still find the legacy approximation "tokens equals width times height divided by 750" in older writeups. The patch math replaced it: 28 by 28 pixels per visual token, ceiling-rounded on each axis. For a 1000x1000 image both formulas land within 3% of each other, which is why the old rule survived so long.

OpenAI works in two steps: the image gets scaled to fit inside a 2048 pixel square, then its shortest side gets brought to 768 pixels, and only then does the 512 pixel tiling apply. A "low detail" flag skips all of it for a flat 85 tokens, which is useless for reading text but fine for "is this a cat".

Google's approach is the bluntest: anything up to 384 pixels on both sides is 258 tokens, period. Beyond that, 768 pixel tiles at 258 each.

Three consequences follow. First, the optimal render size is provider-specific, so a pipeline tuned for Claude wastes money on Gemini. Second, resizing before upload is your decision to make, and making it consciously beats letting the API downscale for you. Third, per-image caps mean very long documents want several medium-density pages rather than one gigantic one. pxpipe packs about 92,000 characters per page and simply emits an array of PNGs when the content overflows.

The idea in one picture: identical context, 90 tokens as text, 50 as an image. Source: "Text or Pixels? It Takes Half" (arXiv 2510.18279), CC BY 4.0.

What the research says

Two papers from October 2025 put this on solid ground.

"Text or Pixels? It Takes Half" tested the idea systematically on GPT-4.1-mini and Qwen2.5-VL-72B. The setup was careful: text typeset with LaTeX at 300 DPI, adaptive font sizing targeting a 0.8 fill ratio, resolutions from 600x800 to 750x1000 pixels. Rendering long text as images achieved roughly 2:1 compression (38% to 58% fewer decoder tokens) while keeping accuracy within 3 points of the text-only baseline. On needle-in-a-haystack retrieval the imaged arm scored 97% to 99%; on CNN/DailyMail summarization it beat dedicated compression baselines like LLMLingua-2 at matched ratios. And there was a bonus nobody expected: on Qwen2.5-VL-72B, the shorter decoder sequence made end-to-end inference 25% to 45% faster. Cheaper and quicker, from the same trick.

DeepSeek-OCR, published the same month, went further and mapped the limits. Their "contexts optical compression" work shows that while the ratio of text tokens to vision tokens stays under 10x, decoding precision holds at 97%. Push to 20x compression and precision drops to around 60%. The curve between those two points is the price list of this whole technique.

The boundary of the technique: precision holds near 97% below 10x compression, then degrades. On OmniDocBench, 100 vision tokens per page beat systems spending 256, and under 800 beat systems spending 6,000+. Source: DeepSeek-OCR (arXiv 2510.18234), CC BY 4.0.

The efficiency numbers are hard to ignore. DeepSeek-OCR reads a document page with 100 vision tokens and beats GOT-OCR2.0, which spends 256 per page. With fewer than 800 it outperforms MinerU2.0, which averages more than 6,000. In production, DeepSeek uses this to generate LLM training data at over 200,000 pages per day on a single A100-40G.

The paper's discussion section contains the most science-fiction idea in this whole space: optical memory. Store a conversation's older history as images at progressively lower resolution, so distant context literally fades the way human memories do. More on that at the end.

Inside a production pipeline

pxpipe is the most instructive implementation to read because it publishes its evals and its failure cases. It runs as a local proxy between your client and the Claude API. Requests pass through byte-identical except for three categories of content, each behind a profitability gate: tool results above roughly 6,000 characters of dense content, older conversation turns behind the live tail, and the static system prompt plus tool documentation slab.

Two design rules matter more than the rest. Recent turns always stay as text. And the static prefix is preserved untouched, because breaking prompt caching would eat the savings.

The measured results, from the project's own benchmark tables:

Test	Text baseline	With imaged context
Novel arithmetic, Claude Fable 5 (n=100)	100/100	100/100, with 38% fewer tokens
Gist recall A/B (n=98)	98/98	98/98
State tracking (n=18)	18/18	18/18
SWE-bench Lite (n=10)	10/10	10/10, requests 65% smaller
12-character hex recall, Opus (n=15)	15/15	0/15

Look at that last row again. We'll get back to it.

On SWE-bench Pro, the harder agentic benchmark, the imaged arm resolved 14 of 19 tasks against 15 of 19 for plain text, with requests 60% smaller and 18 of 19 runs agreeing on the final verdict. That's the honest shape of the tradeoff on difficult work: a small accuracy tax on the hardest problems, bought at more than half the price.

End to end, over a snapshot of 13,709 real requests, the bill dropped 59%, from $100 to about $41. Traces where compression applied heavily ran 70% to 74% cheaper.

Two operational costs come with it. Encoding PNGs adds latency on large requests, noticeable when your pipeline is latency-sensitive rather than cost-sensitive. And the technique is battle-tested on ASCII and Latin-1 content; CJK text works but the project handles it conservatively, with less aggressive packing. If your context is mostly Chinese or Japanese, your compression ratio will be worse than the headline numbers.

When imaging your context adds, and when it makes it worse

Every writeup of this technique that only quotes the savings is doing you a disservice. Be aware, the method is lossy, and it fails in a specific, nasty way.

Exact recall of short strings breaks first. In pxpipe's test, Opus recalled 12-character hex strings from imaged content 0 times out of 15. The newer Fable 5 managed 13 of 15. And the failures aren't errors you can catch: the model confabulates a plausible value with full confidence instead of saying it can't read it. The project documents a real case of a name being confidently misremembered from imaged history.

It's also model-dependent. The technique works because current frontier vision encoders (Claude Fable 5, GPT 5.6) read dense text almost perfectly. Opus 4.7 and 4.8 misread around 7% of the time, which is why pxpipe keeps them opt-in. A model without a strong text-reading encoder turns your compressed context into noise.

So imaging adds when the content is bulk context for reasoning (code, docs, logs, old conversation turns). It subtracts when any byte matters exactly. Hashes, API keys, IDs, phone numbers and anything you'll need to quote back verbatim should stay as text, always.

One more caveat, and it's the strategic one: the arbitrage moveswithout warning. When Anthropic shipped Opus 4.7, the maximum native image resolution jumped from 1,568 to 2,576 pixels on the long edge, and typical screenshots went from around 1,558 tokens to 4,739. The same image, three times the cost, in one release. The lesson isn't "use the trick".

The lesson is: understand your provider's per-pixel economics and measure them, because they change without asking you.

Try it in ten minutes

If you run Claude Code, the experiment costs two commands:

npx pxpipe-proxy
ANTHROPIC_BASE_URL=http://127.0.0.1:47821 claude

A dashboard at 127.0.0.1:47821 shows every conversion and the tokens saved. The proxy only touches allowlisted models; everything else passes through byte-identical, and you can kill it entirely with PXPIPE_MODELS=off.

For your own pipelines, the same machinery is available as a library:

import { renderTextToPngs, transformAnthropicMessages } from "pxpipe";

const imgs = await renderTextToPngs(bigToolResultText);
const { body, applied } = await transformAnthropicMessages({
  body: requestBytes,
  model: "claude-fable-5",
});

Before trusting any number, measure your own baseline. Anthropic's count_tokens endpoint is free, and comparing imaged versus plain requests on your real traffic takes an afternoon. My checklist for anyone adopting this:

Gate by density. Prose compresses poorly; code, JSON and logs compress well.
Keep the live tail of the conversation as text, always.
Keep anything byte-exact (hashes, keys, IDs, amounts) out of the imaged path.
Re-measure after every model release. The Opus 4.7 repricing tripled image costs overnight.
Log what was compressed. When a confabulation shows up, you'll want to know what the model was reading.

The part almost nobody connects: this is also how AIs read your website

Everything above treats text-as-image as a cost trick you apply to your own requests. Now flip the direction. When ChatGPT, Claude, Perplexity or an AI browser agent fetches a page from your site in real time, what do they see?

It turns out your site doesn't have one AI reader. It has three, and they disagree.

Your website has three AI readers now: the crawler, Google, and the agent. None of them see the same page.

The same page, three readings: raw HTML for crawlers, a rendered page for Google, pixels for agents.

Reader one: the crawlers. Each AI company runs three families of bots against your site. A training crawler (GPTBot, ClaudeBot) collects content for future models. A search-index bot (OAI-SearchBot, Claude-SearchBot) feeds the assistant's search layer. And a user-triggered fetcher (ChatGPT-User, Perplexity-User) hits your page live, in the seconds between a user's question and the answer. That last one is the "real time" in real-time AI answers.

As of mid 2026, none of them execute JavaScript. Not one, in any of the three families. Vercel's analysis of production traffic found GPTBot downloads JavaScript files about 11.5% of the time (ClaudeBot 23.8%) and never runs them. They fetch the raw HTML, extract what's there, and move on. No second pass, no rendering queue. If your content only exists in the DOM after client-side rendering, it doesn't exist for these systems.

Reader two: Google, the exception. Gemini and AI Overviews inherit Googlebot's infrastructure, and Googlebot renders JavaScript. Google's Martin Splitt has confirmed it. The practical consequence is strange and very real: the same page can be fully visible to Google's AI surfaces and invisible to ChatGPT, Claude and Perplexity at the same time. If your brand shows up in AI Overviews but never in ChatGPT answers, check how much of your content depends on JavaScript before blaming the model.

Reader three: the agents. Agentic browsers like OpenAI's Atlas (powered by their Computer-Using Agent) and Perplexity's Comet read the fully rendered page, through a mix of accessibility-tree snapshots, DOM parsing and screenshots sent to a vision model. This is where the first half of this article loops back: agents are reading web pages through the same vision channel, with the same economics and the same degradation curve. DeepSeek's numbers apply here too. A clean, well-structured page reads at 97%. A dense, cluttered one slides toward the 60% end, and the agent fills the gaps with confident guesses.

There's a fourth pattern worth knowing about on the retrieval side. ColPali (ICLR 2025) showed that for document search, skipping OCR entirely and embedding page images directly beats the classic extract-and-chunk pipeline: 0.81 versus 0.66 NDCG@5, while indexing pages in 0.39 seconds instead of 7.22. The layout, tables and figures that text extraction destroys turn out to carry retrievable meaning. Visual structure is information.

What this means if you publish content

If you own a website that you'd like AI systems to read correctly, the three-reader split translates into checkable work:

Serve the substance in raw HTML. Server-side render or prerender anything you want crawlers and live fetchers to see: prices, specs, FAQs, comparisons. The test takes one command: curl your page and search the response for your key facts. If they're not in that output, two of your three readers can't see them.
Don't let AI Overviews lull you. Google renders your JavaScript, so decent AIO visibility proves nothing about ChatGPT, Claude or Perplexity. Audit them separately.
Treat your accessibility tree as an API. Agents from OpenAI and Perplexity prioritize ARIA roles and labels before falling back to vision. Semantic HTML and honest labels are no longer just an accessibility checkbox; they're how machines operate your interface.
Design for the 97% end of the curve. Dense, cluttered, low-contrast pages push a vision-reading agent toward the degradation zone where it starts guessing. Clear hierarchy and legible text sizes now serve two audiences.
Keep critical facts unambiguous. An agent misreading a price from a screenshot fails exactly like the hex test: silently and confidently.

What we're building on top of this

pxpipe proved the mechanism on one developer's traffic. The question that interests us at 498A is what it does at audit scale, and it's turning into two work lines.

The first is about simulation economics. A GEORadar brand study runs between 3,000 and 30,000 personalized prompts against five LLMs plus Google's AI Overviews. Every one of those calls carries a slab of static context: persona definitions, market instructions, product catalogs, scoring rubrics. It repeats thousands of times per study, it's token-dense, and none of it needs byte-exact recall. That's exactly the profile the compression math favors. So we're testing optical compression of that static slab on our own simulation engine. The public ratios say dense content compresses about 3x; what we're measuring is what that buys back on a full study: more prompts per budget, more engines per prompt, or the same audit at a fraction of the cost. Until our numbers are in, it stays labeled what it is, an experiment.

The second line points the other way. Our audits interrogate models about brands through the text channel, and the technical side already checks what crawlers can fetch. Nobody is auditing the third reader yet. We're prototyping exactly that: feeding brand pages to vision models the way agentic browsers see them, screenshot plus accessibility tree, and scoring what survives the trip. Which facts does a vision reader actually extract? At what layout density does the page slide toward the confabulation zone DeepSeek measured? For S.A.M., our semantic alignment tool, this extends validation to a second channel: a claim now has to survive as text for the crawler and as pixels for the agent. Semantic alignment used to be a property of your copy. It's becoming a property of your layout too.

I wrote about the text side of that machinery in how LLMs search, retrieve and answer with real-time web data. The vision side is newer, and most sites haven't been checked against it even once. That gap is the opportunity.

Text that fades like memory

Back to DeepSeek's closing idea, because it reframes everything. If text rendered as an image is cheap and readable, then old context doesn't have to be summarized or deleted. It can be stored as pictures that lose resolution over time. Yesterday's conversation at high DPI. Last month's at half. Last year's as a blurry thumbnail that still preserves the gist.

That's not a billing hack anymore. That's an architecture for artificial memory that behaves like the biological kind, built out of the same mechanism that turns $42.21 into $4.51. And for anyone running simulations at audit scale, it's also the difference between sampling a market and saturating it.

The pipeline between language models and text keeps getting stranger. We spent years teaching machines to read. Now the cheapest way to hand text to a machine built on language is, more and more often, to show it a picture.

Technical sources consulted

teamchong: pxpipe, local proxy rendering request context as PNGs, benchmark tables and production cost traces (MIT).
Wei et al., DeepSeek-AI: DeepSeek-OCR: Contexts Optical Compression, compression-precision curves, DeepEncoder architecture and the optical memory proposal (CC BY 4.0).
Li et al.: Text or Pixels? It Takes Half, token efficiency of visual text inputs in multimodal LLMs (CC BY 4.0).
Faysse et al.: ColPali: Efficient Document Retrieval with Vision Language Models, ICLR 2025, visual document retrieval and the ViDoRe benchmark.
Anthropic: Vision documentation, 28x28 pixel patches, per-tier resolution caps and token cost tables.
Google AI for Developers: Understand and count tokens, Gemini image tokenization by 768x768 tiles.
OpenAI: Vision guide, base plus per-tile image token pricing.
Vercel: The rise of the AI crawler, production measurements of GPTBot, ClaudeBot and PerplexityBot JavaScript behavior.
Stan Ventures: Gemini AI renders JavaScript like Googlebot, Martin Splitt's confirmation of Gemini's rendering infrastructure.
Sean Goedecke: Should LLMs just treat text content as an image?, discrete versus continuous token expressiveness.
Sinaptik AI: PixelPrompt, alternative open source implementation of text-to-image context compression.

I keep wondering: if AI already writes 22-46% of new code, what is MAI-Code-1 really training on?

carlosortet — Sun, 07 Jun 2026 09:27:46 +0000

A budgeting note that bites everyone the first time: with any reasoning model you pay for the 'thinking tokens' even though they never show up in the response (they can run up to ~6x the cost of input tokens), and you eat 40 to 90 seconds of latency on tasks a classic LLM answers instantly.

Reserve reasoning for deliberate work (architecture reviews, migrations, incident post-mortems). Use a fast model for autocomplete. More thinking is not always better.

For price context while you wait for MAI numbers: OpenAI o3 sits at $10/$40 per 1M in/out tokens, Gemini 2.5 Pro at $1.25/$10 with a 200K thinking budget.

Reasoning model vs classic LLM, the 30-second version

If you've shipped with both you can skip this. If not: a classic LLM answers on instinct, predicting the next token in one shot. Fast, cheap, and it stumbles on multi-step problems. A reasoning model thinks first: it generates an internal chain (the thinking tokens), tries paths, checks itself, then answers. It learns this through reinforcement learning, rewarded when the chain lands on the right answer.

You know that friend who answers before you finish the question? Sometimes right, often not. Kahneman's "Thinking, Fast and Slow" nailed it years ago: System 1 is fast, System 2 is deliberate. An RLM is System 2, billed by the token.

Copilot was never a model

This is the part most people get wrong, and it changes how you evaluate this launch. Copilot is a layer, not a model. Different engines run underneath, selectable from a picker, same as Perplexity. Until now that engine was OpenAI's GPT.

We learned this the hard way doing brand-visibility audits for clients. Auditing "Copilot" was really auditing ChatGPT, because that was the model under the hood. Same engine, different shell. With MAI, Copilot gets reasoning and a voice of its own, so that equivalence breaks.

The honest part, from the lab

I'll be straight, because that's the only part of these posts worth reading.

Copilot "out of the box" frustrated us for two years. The Office integrations underdelivered. You'd open Copilot in Excel and it would tell you it couldn't touch the data, while a Claude with a Playwright plugin handled the whole thing without breaking a sweat. At some point our internal read was blunt: we'd be better off dropping Office entirely and letting the models generate the deck in HTML.

What we actually want is a system where the AI orchestrates the workflow and the human brings experience and judgment. We still don't have a clean market solution for that, the kind that does collaborative project knowledge plus full AI integration. So, like a lot of teams, we run a funky homemade stack: Obsidian + SharePoint + GitHub wired to Codex, agents and Claude Code. It works. It is not a long-term answer, honestly.

So a model trained on GitHub data and built for GitHub lands right on a pain we know. But it also raises a question I genuinely can't answer, and I'd love your take in the comments.

Here's the uncomfortable math. By 2025 GitHub said Copilot was generating up to 46% of code in files where it's enabled (61% for Java), at an acceptance rate around 27-30%. DX's Q4 2025 report put ~22% of merged code as AI-authored. Google has said ~25-30% of its new code is AI-assisted. Nobody measures the total cleanly (there's no reliable detector for AI-written code after the fact), but the flow of new code already sits somewhere in the 20-46% range, and climbing.

So when Microsoft says MAI-Code-1 is "trained on GitHub code", a meaningful slice of that code was almost certainly written by Copilot, Codex or Claude in the first place. Which sharpens the question: is "trained on GitHub" really that different from "trained on model outputs"? That's model collapse in practice. A CMU study and an analysis of 800+ popular GitHub projects already flag code quality degrading after AI adoption. If new models learn from increasingly AI-written code, they risk amplifying their own mistakes.

I don't have the number for MAI-Code-1 specifically (Microsoft hasn't disclosed the AI-generated share of its training set). But "clean, licensed data" and "free of AI contamination" are not the same claim. What's your read?

Why this matters beyond the benchmark

The strategy is the actual story. Microsoft put 13 billion dollars into OpenAI and up to 5 billion into Anthropic, and it sells both on Azure. On April 27, 2026 the OpenAI exclusivity fell (the trigger was OpenAI's up-to-50-billion deal with Amazon). So Microsoft is now investor, distributor, infra provider and direct competitor, all at once, and it would rather own the engine than rent it.

For us, the practical reads:

Adoption won't be decided by leaderboards. It'll be decided by identity, logging, data boundaries, retention, SLAs, predictable cost and IP indemnity. Microsoft is very good at selling that management plane.
Clean IP lineage is a real feature, not marketing. "No distillation" is a procurement checkbox for regulated teams (the DeepSeek "trained on whose outputs?" mess in 2025 made that concrete).
Lock-in gets subtler. A model tuned on your flows, wired to your identity, billed through Azure is powerful precisely because it's hard to move. Test it, meter it, demand portability.

The skeptic's asterisk

Stay level-headed. The benchmarks are Microsoft's own and no external lab has reproduced them. The comparison is selective (beats Sonnet 4.6 broadly, ties Opus 4.6 only on one code metric). Public figures wobble, including that 256K vs 128K context. New, real, in-house models? Yes. A capability revolution? Probably not. Neither do I care too much. The value is independence, cost and control.

That, and one quieter outcome: Microsoft's AI future starts to look a lot less borrowed and far more promising. I know, we are always afraid of Microsoft having too much control over our workflow, but if we are not concerned about Google taking control of the web or TikTok taking control of our kids' minds, is this so worrying? After we accepted Microsoft taking ownership of GitHub, we knew sth like this was coming.

Written by Carlos Ortet for 498A, the AI R&D division behind Zoopa. Originally published, with glossary and FAQ, on the Zoopa blog.

Sources: Introducing MAI-Thinking-1 (Microsoft AI) · Introducing MAI-Code-1-Flash (Microsoft AI) · MAI-Code-1-Flash on the GitHub Changelog

On AI-written code share: GitHub Copilot statistics · "AI is writing 46% of all code" · Model collapse explained (IT Pro) · CMU: AI is still making code worse
www.carlosortet.com

From expensive tokens to intelligent compression: how we optimize LLM costs in production

carlosortet — Thu, 26 Mar 2026 09:24:16 +0000

We spend absurd amounts on AI tokens. And that number is only going up.

At 498Advance we run multiple LLMs in production — Claude for development, Gemini for multimodal, DeepSeek and OpenAI models locally for routine tasks. Every model does something well and fails at something else. That is why they coexist.

But this creates a problem: dependency and cost. What happens when a provider goes down? What happens when pricing changes overnight?

Here is how we deal with it, and why a new Google Research paper caught our attention this week.

Layer 1: Fallback policies

If a model fails, the system automatically redirects to the next available model. No human intervention, no perceptible downtime.

# Simplified fallback logic
models = ["claude-opus", "gpt-4o", "gemini-pro", "deepseek-local"]

def inference(prompt, task_type):
    for model in get_ranked_models(task_type):
        try:
            return call_model(model, prompt)
        except ModelUnavailable:
            log.warning(f"{model} unavailable, falling back")
            continue
    raise AllModelsUnavailable()

Simple but effective. The key is having your models ranked per task type, not globally.

Layer 2: Router shadow

Not every task needs a frontier model. A two-line summary does not need Claude Opus. A 50-page legal analysis does.

Router shadow evaluates each incoming task and routes it to the optimal model based on complexity and cost. We categorize tasks into tiers:

Tier	Task type	Model class	Cost
1	Simple extraction, formatting	Local (DeepSeek 7B)	~$0
2	Summarization, translation	Mid-tier API (Haiku, Flash)	Low
3	Complex analysis, code generation	Frontier (Opus, GPT-4o)	High

The result: cost optimization per project without sacrificing quality where it matters.

Layer 3: Local models

At 498Advance we have been running DeepSeek and OpenAI models locally for three months. They handle a significant portion of production tasks.

The benefits go beyond cost:

Security: data never leaves your infrastructure
Compliance: concrete guarantees about where data is processed
Latency: no network round-trip for simple tasks
Availability: no dependency on external uptime

The trade-off: local models are not frontier models. You lose capability on complex tasks. The strategy is selective migration — identify what can run locally, move it, keep frontier for what needs it.

The compression landscape

At some point, better hardware is not enough. You need efficiency.

LLMs keep growing — tens or hundreds of billions of parameters. The compression techniques that make them deployable:

Quantization reduces weight precision. A quantized Llama 70B fits on 1 NVIDIA A100. Unquantized, it needs 4.

Pruning removes low-relevance weights. 2:4 Sparse Llama achieved 98.4% accuracy recovery on the Open LLM Leaderboard V1, with +30% throughput and -20% latency from sparsity alone.

Knowledge distillation trains a small student model to replicate a large teacher's behavior.

These are not mutually exclusive. Sparsity + quantization yields improvements greater than either alone.

Real-world examples

LinkedIn built domain-adapted EON models on open source LLMs with proprietary data, reducing prompt size by 30%.

Roblox scaled from <50 to ~250 concurrent ML inference pipelines using Ray and vLLM.

Red Hat maintains pre-optimized models on Hugging Face (Llama, Qwen, DeepSeek, Granite) — quantized and ready for inference with vLLM.

TurboQuant: the paper that caught our attention

On March 24, 2026, Google Research published TurboQuant (ICLR 2026). Authors: Amir Zandieh and Vahab Mirrokni.

The headline numbers:

6x minimum KV cache memory reduction
8x speedup with 4-bit quantization on H100 GPUs
3-bit KV cache quantization with zero accuracy loss
No fine-tuning or retraining required

Why it matters technically

Traditional quantization has a memory overhead problem. Most methods need to store quantization constants for each data block, adding 1-2 extra bits per number. TurboQuant eliminates this.

It combines two algorithms:

PolarQuant converts vectors from Cartesian to polar coordinates. Instead of normalizing data on a shifting square grid, it maps to a fixed circular grid where boundaries are known. This eliminates the normalization overhead.

QJL (Quantized Johnson-Lindenstrauss) compresses the residual error from PolarQuant to a single sign bit (+1 or -1) using the JL Transform. Zero memory overhead.

The pipeline:

PolarQuant compresses with most of the bits
QJL uses 1 bit to correct residual bias

Benchmark results

Tested on LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval with Gemma and Mistral:

TurboQuant (KV: 3.5 bits) scores 50.06 on LongBench — identical to Full Cache (KV: 16 bits)
KIVI needs 5 bits for 50.16, drops to 48.50 at 3 bits
Perfect needle-in-haystack results with 6x memory reduction

For vector search, TurboQuant outperforms PQ and RabbiQ in recall ratio even when those baselines use large codebooks and dataset-specific tuning.

What this means in practice

TurboQuant is a research paper, not a product. But the direction is clear:

Same hardware, bigger models: 6x KV cache compression means the GPU running an 8B model could handle something significantly larger
Lower inference costs: 8x attention speedup = fewer GPUs for the same workload
Edge deployment: compression is what separates "interesting demo" from "deployable product"
Simpler compliance: smaller models running locally = less data traveling externally

The race is not for bigger models. It is for models that are smarter about how they use their resources.

OpenAI's Agentic Commerce Protocol: a technical look at how ChatGPT becomes a shopping agent

carlosortet — Wed, 25 Mar 2026 08:51:38 +0000

Last week, OpenAI launched a redesigned shopping experience in ChatGPT. 900 million weekly users can now browse products visually, compare options side-by-side, and get real-time pricing — all inside the conversation.

The protocol powering it is called ACP (Agentic Commerce Protocol), an open standard co-developed by OpenAI and Stripe under Apache 2.0. And the technical implementation is worth a closer look.

What ACP actually is

ACP defines a common language for AI agents and merchants to coordinate transactions. Think of it as an API contract between ChatGPT (the buyer's agent) and a merchant's product catalog.

Key architectural principle: OpenAI is not the merchant of record. Merchants retain full control over products, pricing, payments and fulfillment. ChatGPT is just the conversational intermediary.

The protocol lives on GitHub with 1,300+ stars and 192 forks:
github.com/agentic-commerce-protocol/agentic-commerce-protocol

The pivot nobody expected

OpenAI originally launched ACP in September 2025 with "Instant Checkout" — buy directly inside the chat. Etsy was the first integration.

It didn't work. Out of Shopify's millions of merchants, only about 12 activated checkout. Users browsed and compared, but went to the retailer's site to pay.

OpenAI acknowledged it in March 2026: "the initial version of Instant Checkout did not offer the level of flexibility that we aspire to provide."

The new focus is product discovery: visual browsing, image-based search, comparison tables, budget filtering — then redirect to the merchant's site for checkout.

Technical integration: how it works

Feed specification

Merchants push product data to an OpenAI endpoint via encrypted HTTPS. The feed spec:

Parameter	Value
Formats	CSV, TSV (recommended), XML, JSON
Encoding	UTF-8
Max size	10 GB
Compression	gzip
Update frequency	Every 15 minutes

Required fields per product:

id — unique identifier
title — max 150 chars
description — max 5,000 chars
price — with ISO 4217 currency code
availability — in stock / out of stock / preorder
image_url — at least one high-res image

Recommended: GTIN/UPC/MPN, reviews, rich media, shipping options, performance signals.

Three APIs

Feeds API      → Upload/manage full product catalogs
Products API   → Individual product upserts
Promotions API → Manage promotions (API-only)

For existing Stripe merchants, integration can take as little as one line of code.

Delegated payments

The payment flow uses single-use, time-bound, amount-restricted tokens. OpenAI prepares a delegated payment request; the payment service provider (Stripe, PayPal, Checkout.com) handles tokenization. PCI compliant by design.

Versioning

ACP uses date-based versioning (YYYY-MM-DD). The latest stable spec is 2026-01-30, which added extensions, discounts, and payment handlers.

spec/2025-09-29/  → Initial release
spec/2025-12-12/  → Fulfillment enhancements
spec/2026-01-16/  → Capability negotiation
spec/2026-01-30/  → Extensions, discounts, payment handlers (latest)

ACP vs. Google UCP vs. Amazon

Three ecosystems are forming:

ACP (OpenAI + Stripe): Open standard, Apache 2.0. Feeds update every 15 minutes. 900M weekly users. Focused on conversational product discovery.

UCP (Google): Open standard, launched January 2026. Backed by Shopify, Walmart, Visa, Mastercard. 50 billion indexed products. Covers the full purchase journey.

Amazon (Rufus/Alexa+): Closed ecosystem. 300M Rufus users, 60% higher conversion than standard Amazon flow. Blocks OpenAI crawlers. Removed 600M products from ChatGPT results.

The key difference between ACP and Google Shopping feeds: ACP feeds are designed for AI reasoning, not indexing. Every field can become an argument the agent uses to explain why a product is relevant — not just that it exists.

The data that matters

Agentic traffic converts at 15-30% (Q1 2026 data) — 5-10x over traditional e-commerce
AI-generated recommendations convert 4.4x better than traditional search (McKinsey)
ChatGPT drives 20%+ of referral traffic to Walmart
But: less than 0.2% of total e-commerce sessions come from ChatGPT

The volume is small. The conversion is extraordinary. The year-over-year growth is +805% (Adobe, Black Friday 2025).

What this means if you build e-commerce

If you're building or maintaining e-commerce systems, ACP changes the optimization target.

1. Feed quality > ad spend

Products surface based on what the AI can parse and verify. Incomplete feeds = invisible products. Every missing attribute is a lost recommendation opportunity. A product without a GTIN, without proper category depth (up to 5 levels), or with a generic "Great product, buy now!" description will never be recommended when a user asks for something specific.

2. Descriptions for NLP, not SEO

Attribute-rich descriptions that an AI can reason about beat keyword-stuffed copy. When someone asks "waterproof running shoes for flat feet under $120," the AI needs to find "waterproof," "arch support for flat feet," and the price in your product data. If those attributes are buried in marketing copy instead of structured fields, you lose.

3. Schema markup matters more than ever

Product, Offer, AggregateRating with real data. The AI cross-references your structured data against user queries. Complete schema is no longer optional — it's the foundation of agentic visibility.

4. Multi-protocol support

Prepare to maintain feeds for both ACP (OpenAI) and UCP (Google). Shopify already offers dual support. The formats are similar enough that a single product data pipeline can serve both, but the update frequencies differ (15 minutes for ACP vs. 24 hours default for UCP).

5. Attribution is fundamentally broken

The customer behavior that drives the sale is invisible to existing analytics. A user asks ChatGPT for a recommendation, gets three product suggestions, visits one site, and buys. Your analytics sees a direct visit or an unknown referrer. No impression. No click. No session.

Server-side webhook tracking from ACP/UCP is the path forward. Start building this infrastructure now — you'll need 18-24 months of data before it becomes reliable.

Getting started

Merchant portal: chatgpt.com/merchants
Developer docs: developers.openai.com/commerce
Feed spec: developers.openai.com/commerce/specs/feed
GitHub: github.com/agentic-commerce-protocol
Stripe integration: docs.stripe.com/agentic-commerce

Currently US-only for merchant onboarding. Product discovery is available globally for all ChatGPT users (Free, Go, Plus, Pro).

We wrote a detailed analysis with market data, competitive comparisons, and practical recommendations:
zoopa.es/en/digital-marketing-en/agentic-commerce-protocol-acp-chatgpt-shopping/

What's your take on agentic commerce? Are you already preparing your product feeds for AI agents, or waiting to see how ACP vs. UCP plays out?

robots.txt is a sign, not a fence: 8 technical vectors through which AI still reads your website

carlosortet — Mon, 23 Mar 2026 07:54:14 +0000

You configure robots.txt like this:

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: *
Disallow: /

You enable Cloudflare Bot Management. You set up Akamai. Maybe even a server-side paywall.

And then you query ChatGPT about your product and it cites your website as a source.

How?

I work on GEO (Generative Engine Optimization) projects where we audit how LLMs represent brands. We routinely analyze thousands of prompt-response pairs. Across multiple projects, we consistently find that 10–20% of LLM responses cite the brand's own website as a source — even when every known bot is blocked.

Here are the 8 technical vectors we documented, with academic sources and industry data.

1. Historical crawl data (Common Crawl)

This is the biggest one and the least understood.

Common Crawl is a nonprofit that has been archiving the web since 2007. The numbers:

9.5+ petabytes, 300+ billion documents
~2/3 of the 47 LLMs published between 2019–2023 use it as training data
GPT-3, LLaMA, T5, Red Pajama all trained on it
Google's C4 dataset: 750 GB filtered from Common Crawl

Blocking crawlers today does not retroactively remove content already captured. Those snapshots are permanent, public resources.

Source: ACM FAccT 2024 — "A Critical Analysis of Common Crawl"

2. Client-side paywall bypass

Common Crawl does not execute JavaScript. If your paywall depends on client-side JS:

<!-- Your paywall loads after DOM ready -->
<script>
  document.addEventListener('DOMContentLoaded', () => {
    showPaywall();
  });
</script>

<!-- But the crawler already captured the full HTML -->

The crawler gets the complete article before JS even runs.

Alex Reisner documented this for The Atlantic (Nov 2025): Common Crawl was capturing full articles from NYT, WSJ, The Economist and The Atlantic itself.

3. User-agent spoofing

Some AI bots change their identity when blocked.

Cloudflare documented (Aug 2024) that Perplexity was using:

# Declared user-agent
PerplexityBot/1.0

# What they actually sent
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120.0.0.0

Plus ASN rotation to evade IP-based blocking. The evasion ecosystem includes FlareSolverr (Selenium + undetected-chromedriver), Scrapfly (94–98% bypass rates), and residential proxy rotation.

4. Syndication redistribution

Once your content leaves your domain through any syndication channel, your robots.txt is irrelevant:

Original domain (robots.txt: Disallow)
  → RSS feed (no robots.txt)
  → Apple News (different domain)
  → Email newsletter (archived on web)
  → Cross-posted to social (scraped by bots)
  → API aggregators (reformatted downstream)

Each channel creates a copy outside your control.

5. Web archives (Wayback Machine)

Internet Archive: 1+ billion pages, 99+ petabytes. web.archive.org is domain #187 in Google's C4 dataset.

Harvard's WARC-GPT lets you ingest WARC archives directly into RAG pipelines. As of Feb 2026, publishers like The Guardian and NYT started blocking Wayback Machine over AI concerns.

6. Real-time RAG access

Modern LLMs don't just rely on training data. They fetch content in real time:

Bot	Growth 2024–2025	Mechanism
ChatGPT-User	+2,825%	Fetch on user "search the web"
PerplexityBot	+157,490%	Fetch on every query
Meta-ExternalFetcher	New in 2024	Meta AI features

These bots claim the fetch is "user-initiated" (not autonomous crawling), trying to exempt themselves from robots.txt.

Cloudflare reported Anthropic's bots have crawl-to-refer ratios of 38,000:1 to 70,000:1. For every time they send traffic back, they crawl tens of thousands of times.

Sources: Cloudflare Blog 2025, OpenAI Crawlers Overview

7. Content farms

Content farms — human or AI-operated — rewrite your articles on unrestricted domains:

1. Scrape/copy original article
2. Rewrite to avoid plagiarism detection
3. Publish on domain with no robots.txt restrictions
4. AI crawler indexes the rewrite
5. LLM absorbs the rewritten version

In Bartz v. Anthropic PBC, the court ruled that training AI with content from "pirate sites" constituted fair use. This sets precedent for rewritten content too.

8. Direct non-compliance

The simplest vector: bots just ignore robots.txt.

12.9% of bots ignore it entirely (was 3.3%) — Paul Calvano, Aug 2025
Duke University (2025): "several categories of AI-related crawlers never request robots.txt"
Kim & Bock (ACM IMC 2025): scrapers are less likely to comply with more restrictive directives

The legal status is clear: in Ziff Davis v. OpenAI (2025), the judge described robots.txt as "more like a sign than a fence" — not a technological measure that "effectively controls access" under the DMCA.

The compliance stats

Metric	Value	Source
Bots ignoring robots.txt	12.9%	Paul Calvano, 2025
Top 10K sites with AI bot rules	Only 14%	Market analysis 2025
Sites with any robots.txt	94% (12.2M sites)	Global study 2025

So what do you do?

Blocking alone doesn't work. Defensive measures reduce direct crawling by 40–60% for compliant bots, but they can't touch historical data, syndicated copies, or content farm rewrites.

The alternative is offensive: control the narrative instead of trying to hide from it.

At 498 Advance we built tools for this: GEOdoctor for technical auditing of brand visibility in LLMs, and S.A.M. (Semantic Alignment Machine) for content alignment across owned media, UGC platforms (social GEO) and authority domains.

Full analysis with all academic sources: zoopa.es/en/blog

Have you run into this paradox? Blocking everything but still appearing in LLM outputs? I'd love to hear what you've observed in your own infrastructure.

How I built a proactive personal AI assistant based on Moltbot for 10€ a month

carlosortet — Fri, 30 Jan 2026 07:18:51 +0000

An open source agent with persistent memory, command execution, real proactivity, and WhatsApp as its interface. Full architecture, real costs, and lessons learned.

Why Build Your Own Assistant

The idea of having an AI assistant that actually knows you is nothing new. Anyone who uses ChatGPT, Claude, or Gemini daily has felt the same frustration: every conversation starts from scratch, you have to re-explain your project context, and the assistant cannot do anything beyond the chat window.

At 498AS, we have been experimenting for months with language models applied to real workflows. Not just text generation, but agents that execute tasks, access tools, and maintain long-term context.

When we discovered Clawdbot (Moltbot), we saw an opportunity to close that gap. An open source project that lets you run your own AI agent with persistent memory, command execution, and the ability to message you before you message it.

We set it up. It works. And it costs 10 euros a month.

What Is Clawdbot

Clawdbot (Moltbot) is an open source project that provides the infrastructure for running your own AI agent. It is not a closed product or a subscription app. It is the technical skeleton on which you build your personalized assistant.

Its core capabilities:

Multichannel communication. It talks to you via WhatsApp, Telegram, Discord, or whatever platform you prefer.
Persistent memory. It remembers previous conversations, your projects, your preferences. It does not start from zero.
Command execution. It can run scripts, open applications, control your browser. Real actions on your computer.
Tool and API access. You configure which tools it can use, and the agent integrates them into its responses and actions.
Full control. It runs on your server, under your infrastructure. You decide what data it holds and who can access it.

The key difference from ChatGPT or Claude is that Clawdbot does not live in a third party's cloud. It lives wherever you decide.

Full System Architecture

The Central Server

The heart of the system is a VPS on Hetzner. Minimum specs: 2 vCPUs, 4GB of RAM. Model CX22. Cost: approximately 4 euros per month.

This server runs the Clawdbot gateway, which handles:

Receiving and sending messages
Communication with the AI model
Persistent memory management
Tool and node coordination

The AI Model

Subscription option. If you already pay for Claude Pro Max, ChatGPT Plus/Pro, or similar, Clawdbot can use that subscription directly. No separate API costs.

API option. If you prefer pay-per-use, you can connect the APIs from Claude, OpenAI, or any compatible model. For personal use it typically ranges between 10 and 30 euros per month.

WhatsApp as the Interface

The setup requires a dedicated prepaid SIM. Cost: 6 euros per month with minimal data. The agent has its own phone number. You message it like any other contact and it replies.

Security with Tailscale

The Clawdbot gateway is not exposed to the internet. It only listens on the Tailscale network.

Tailscale is a mesh VPN that creates a private network between your devices. End-to-end encryption, regardless of physical location. Free for personal use.

Our specific configuration:

The Hetzner VPS is on the Tailscale network
Our phones and computers are too
The gateway only accepts connections from Tailscale IP ranges
The firewall is explicitly configured: only Tailscale can reach the gateway

If someone scans the server from the internet, the gateway port does not even show up.

Distributed Nodes

Clawdbot lets you connect "nodes": computers that the agent can control remotely. My MacBook is connected as a node. I can be out on the street, message the agent via WhatsApp saying "open Chrome and look up X" and it does it on my Mac.

All traffic between nodes and the gateway goes through Tailscale.

Access Control

The agent does not talk to just anyone. We have an allowlist of authorized phone numbers. Permissions are granular: who can interact, which commands it can execute, which tools are available, whether execution requires prior approval.

The Proactive Agent: The Real Game Changer

This is the feature that turns a chatbot into an actual assistant. The agent does not wait for you to write. It messages you first.

Email alerts. The agent monitors your email and notifies you via WhatsApp when something relevant arrives.
Calendar reminders. It reads your calendar and alerts you ahead of meetings.
Custom alerts. Weather updates, price monitoring, GitHub repository activity.
Scheduled tasks. "Every Monday at 9, review my pending tasks and tell me what I need to do this week."

Proactivity completely transforms the relationship with the assistant.

Real Cost Breakdown

Item	Monthly Cost
Hetzner CX22 VPS	~4 EUR
Prepaid SIM with data	~6 EUR
Tailscale	0 EUR (free tier)
Total infrastructure	~10 EUR/month

The AI model is separate. If you already have a subscription to Claude Pro, ChatGPT Plus, or similar, you pay nothing extra.

Lessons Learned After Weeks of Use

More useful than expected. What started as a technical experiment became a daily tool.
Proactivity is the differentiator. Having the agent message you when something important happens is a paradigm shift.
Security is non-negotiable. Tailscale, firewall, allowlist, granular permissions. All of it.
WhatsApp as an interface reduces friction. It is where you already are.

How to Get Started

Clawdbot is open source and available on GitHub with comprehensive documentation and an active Discord community.

Get a cheap VPS (Hetzner, DigitalOcean, or similar)
Install Clawdbot following the documentation
Connect WhatsApp with a dedicated SIM
Set up Tailscale on the server and your devices
Define an allowlist of authorized numbers
Configure tools and permissions

Security recommendations:

Do not expose the gateway to the internet. Use Tailscale.
Do not run as root.
Configure user allowlists.
Enable approval for command execution.

We wrote a detailed blog post with the full architecture and step-by-step setup: Read the full article on our blog

Article published by the 498AS team. Questions about the setup? Drop a comment or get in touch.

Clawdbot: The AI Assistant Siri Promised But Never Delivered — Complete 2026 Guide

carlosortet — Sun, 25 Jan 2026 18:32:18 +0000

Clawdbot is an open-source personal AI assistant that radically transforms how we interact with artificial intelligence. Unlike ChatGPT or Claude web, Clawdbot lives inside your everyday messaging apps — WhatsApp, Telegram, Slack, Discord — and can proactively message you, remember conversations from weeks ago, and execute tasks on your computer.

The Problem Clawdbot Solves

For over a decade, big tech companies promised us intelligent assistants that would transform our productivity. Siri arrived in 2011. Google Assistant in 2016. Alexa conquered millions of homes. And yet, in 2026, most users remain frustrated with these tools.

The fundamental problem is that these traditional assistants:

Wait passively for you to speak to them
Forget everything when the session closes
Can't execute complex tasks
Live in closed ecosystems that limit their utility
Prioritize data collection over real functionality

Clawdbot represents a paradigm shift. It's what Siri should have been and never was.

Clawdbot vs Siri vs Google Assistant vs Alexa: 2026 Comparison

Feature	Clawdbot	Siri	Google Assistant	Alexa
Proactivity	Writes to you when there's news	Only responds	Limited to routines	Only basic notifications
Persistent memory	Remembers weeks/months	Forgets on close	Very limited	Session only
Executes PC tasks	Terminal, browser, files	No	No	No
Open source	MIT License, auditable	Closed	Closed	Closed
Your data on your machine	Local-first	Apple servers	Google servers	Amazon servers
WhatsApp integration	Native	No	No	No
Monthly cost	~$27 (VPS+Claude)	Free (with iPhone)	Free	Free (with Echo)

Verdict: Clawdbot is the only assistant that combines real proactivity, persistent memory, and total user control. The trade-off is it requires initial technical setup.

What is Clawdbot and Why Does It Matter

Clawdbot is an open-source personal AI assistant that runs locally on your machine and responds through the messaging apps you already use: WhatsApp, Telegram, Slack, Discord, Signal, iMessage, and 50+ additional integrations.

The fundamental difference with ChatGPT or Claude web is that Clawdbot lives inside your messaging apps. You don't go to a website. You message it like any other contact. And it can:

Proactively send you messages
Remember past conversations
Execute tasks on your computer
Run 24/7

The Local-First Philosophy

One of the most relevant aspects of Clawdbot is its local-first architecture:

Your data stays on your machine by default
No dependency on external servers (except the AI model you choose)
You can audit all the code since it's open-source
You have total control over what it can and cannot do

The Creator: Peter Steinberger

Peter Steinberger (@steipete) is the creator of Clawdbot. He's not an anonymous developer or a weekend project funded by VCs promising to revolutionize the world.

His profile:

Ex-founder of PSPDFKit, a PDF software company with global presence
13+ years of experience in native iOS development
Based in Vienna and London
Builds in public, sharing both successes and failures

Technical Architecture

Clawdbot uses a hub-and-spoke architecture with a centralized Gateway that acts as the control plane. The Gateway manages:

Session routing and isolation: Each conversation stays separate
Channel connections and events: WhatsApp, Telegram, Slack, etc.
Tool execution and streaming: Real-time responses
Canvas/A2UI hosting: Visual interface when needed
Device node orchestration: Multiple synchronized devices

Supported AI Models

Clawdbot isn't tied to a single provider:

Provider	Models	Notes
Anthropic	Claude Pro/Max	Recommended: Claude Opus 4.5
OpenAI	ChatGPT/Codex	GPT-4 available
Local	Ollama, LM Studio	For maximum privacy

Installation and Requirements

System Requirements

Component	Requirement
Node.js	>= 22
Package Manager	npm, pnpm (preferred), or bun
Operating System	macOS, Linux, Windows (via WSL2)
Minimum RAM	2GB (4GB for web automation)
Optional	Docker (for sandboxing)

Quick Installation

npm install -g clawdbot@latest
clawdbot onboard --install-daemon
clawdbot gateway --port 18789 --verbose

Or the one-liner:

curl -fsSL https://clawd.bot/install.sh | bash

Estimated setup time: 30 minutes for basic configuration. 2-4 hours for complete customization with skills.

Differentiating Features

1. Persistent Memory

Unlike traditional chatbots that forget everything when the session closes, Clawdbot maintains continuous 24/7 context, learned user preferences, history of past conversations, and structured memory files. The workspace can be a Git repository — if the bot "learns" something incorrect, you can git revert and return to a previous state.

2. Real Proactivity

This is perhaps the most differentiating feature. Clawdbot doesn't wait passively:

Morning briefings: Summary of emails, calendar, relevant news
Contextual alerts: "Your meeting starts in 20 minutes"
Topic tracking: "That project you mentioned has updates"
Smart reminders: Based on usage patterns

3. System Control

Clawdbot can interact with your computer:

Browser: Automation with Chrome/Chromium (CDP)
Files: Reading and writing
Terminal: Shell command execution
Scripts: On-the-fly creation and execution

4. Extensibility

The skills system allows extending capabilities. Most interesting: Clawdbot can write its own plugins to gain new capabilities.

Real-World Use Cases

Personal Productivity

Automatic Inbox Zero — Email cleanup, mass unsubscription, organization
Smart Calendar — Contextual reminders based on location/time
Flight Check-in — Complete automation without intervention
Morning Briefings — Personalized daily summary

Advanced Automation

WhatsApp Memory Vault — A user automatically transcribed over 1000 voice messages with integrated semantic search.

Grocery Autopilot — From recipe photo to completed shopping cart in under 5 minutes.

Complete Website Migration — @thekitze: "Rebuilt my entire site via Telegram while watching Netflix in bed. Notion to Astro, 18 posts migrated, DNS moved to Cloudflare. Never opened my laptop."

Community Testimonials

"I got up and running today with @clawdbot and it's been nothing short of an iPhone moment for me." — @dajaset

"Using @clawdbot for a week and it genuinely feels like early AGI." — @davekiss

"Not enterprise. Not hosted. Infrastructure you control. This is what personal AI should feel like." — @karpathy

Real Costs

Component	Cost
Basic VPS	~5 EUR/month (Hetzner, 2GB RAM)
Recommended VPS	~10 EUR/month (4GB RAM)
Claude Pro	$20/month
Claude Max	$100-200/month

Minimum functional configuration: ~$27/month (VPS + Claude Pro)

Clawdbot software is 100% free and open-source under MIT license.

Security and Privacy

Clawdbot follows a local-first architecture: your data stays on your machine by default. Being open-source, you can audit all the code. However, the agent has broad access to your system (files, terminal, browser), so you should carefully configure permissions and use Docker containers to limit access in group contexts.

Security Recommendations

Use restrictive pairing
SSH tunneling for remote access
Dedicated number for WhatsApp (ban risk)
Carefully evaluate skill permissions
Run clawdbot doctor regularly

Who is Clawdbot For

Ideal Profiles

Developers — Infinite extensibility, hackable
Technical Power Users — Total control, privacy
Founders/CEOs — 24/7 automation
Content Creators — Multi-platform, community management

Who Should NOT Use It (Yet)

Non-technical users — Requires terminal, VPS knowledge
Companies with strict compliance — Unaudited security
Error intolerant users — Still early stage software

Conclusions

Clawdbot represents a milestone in the evolution of personal AI assistants. It's not perfect — onboarding is still rough, there are occasional bugs, and it requires technical knowledge — but it's genuinely useful in ways Siri never achieved.

For developers and technical power users, it's a transformative tool. For the average user, it's probably better to wait for more polished versions.

What's clear is that the model of "assistant that lives in your infrastructure and contacts you proactively" is the future. Big Tech will eventually arrive. The open-source community got there first.

GitHub: github.com/clawdbot/clawdbot

Analysis by ZOOPA Research | 498AS Innovation Lab | January 2026

DEV Community: carlosortet

How LLMs search, retrieve and answer with real-time web data

Core idea

What happens when the user asks

Search decision

Search engines and grounding by model

ChatGPT

Gemini

Microsoft Copilot

Claude

Perplexity

What matters for GEO

Search, crawling and retrieval

Tokenization: why each LLM represents information differently

Vectorizing the question and the documents

Semantic matching and chunk selection

Measuring whether a chunk is a good match for a prompt

A reproducible protocol for measuring prompt-chunk fit

How different LLMs do semantic matching and selection

ChatGPT

Gemini

Claude

Copilot

Perplexity

Why domains are not selected only by authority

UGC sources inside the GEO flow

Technical validation of sources in the main LLMs

ChatGPT

Gemini

Claude

Copilot

Perplexity

Source validation

Answer construction

Where GeoRadar, S.A.M. and LEO enter

Why volume matters

The critical point

From diagnosis to brand and business impact

Technical sources consulted

It's January 32: how to know if an AI trained on your content, and prove it

An LLM watermark exists, and you detect it by counting

Radioactivity: the trail that passes from one model to another

Your best invisible ink is a lie: January 32

The memorization wall

The ceiling: no strong mark is unerasable

The other door: not whether they remember you, whether they cite you

What we're building on top of this

Technical sources

48,000 characters in 2,700 tokens: lets discuss how LLMs read text as images

The mechanic: images are billed by pixels, not by characters

Why pixels beat words at carrying words

The fine print per provider

What the research says

Inside a production pipeline

When imaging your context adds, and when it makes it worse

Try it in ten minutes

The part almost nobody connects: this is also how AIs read your website

What this means if you publish content

What we're building on top of this

Text that fades like memory

Technical sources consulted

I keep wondering: if AI already writes 22-46% of new code, what is MAI-Code-1 really training on?

Reasoning model vs classic LLM, the 30-second version

Copilot was never a model

The honest part, from the lab

Why this matters beyond the benchmark

The skeptic's asterisk

From expensive tokens to intelligent compression: how we optimize LLM costs in production

Layer 1: Fallback policies

Layer 2: Router shadow

Layer 3: Local models

The compression landscape

Real-world examples

TurboQuant: the paper that caught our attention

Why it matters technically

Benchmark results

What this means in practice

Further reading

OpenAI's Agentic Commerce Protocol: a technical look at how ChatGPT becomes a shopping agent

What ACP actually is