DEV Community: Jan Michalík

Can an LLM actually read your website? I built a scanner to stop guessing

Jan Michalík — Tue, 14 Jul 2026 13:34:35 +0000

`> I'm the developer behind a privacy-first RAG chatbot for WordPress. This is a writeup from pagecoder.ai about a small tool I ended up building one weekend.

This one didn't start with a product idea. It started with an ad, on a Saturday morning.

Weekends are when I get to be a nerd for fun. No client work, just coffee, a croissant, and whatever rabbit hole the internet drops me into. This particular Saturday my feed was full of the same pitch, over and over: we'll get you into ChatGPT. Get cited by Perplexity. Show up in Google's AI answers. The future of search is here and you, my friend, are invisible in it.

I didn't buy anything. But I went looking for the price. And then the croissant went down the wrong pipe.

The business model was "hope you don't understand this"

Almost nobody sold it as "here's what's broken, here's the fix." They sold it as a relationship. Monthly retainer, content on a schedule, a dashboard that mostly reminds you to keep paying. The pitch leaned on one quiet assumption: this is AI, it's far too complicated for you, so hand it over forever and don't ask questions.

That's a business model built on the customer not understanding the thing. And the annoying part, as a developer looking at it, is that the expensive mysterious part is mostly just a measurement.

Most of "AI visibility" is measurable

I wrote down what it actually breaks into. It's not magic. It's about five questions, and most of them are things you'd check in a terminal:

Can AI even reach your site? The new crawlers (GPTBot, ClaudeBot, PerplexityBot) obey robots.txt like everyone else. Plenty of sites quietly block the exact bots they're paying someone to get them into. One-line check.
Can it read what it finds? Models read your page in chunks. If it's one undifferentiated wall of markup, it chunks badly, and badly-chunked content is hard to retrieve and cite.
Does it know what you are? Structured data (JSON-LD), clear entities, consistent naming. This is how a model anchors "this page is about this entity that does this thing" instead of guessing.
Can it actually quote you? You can literally test this. Hand a page to a model and ask what it understood and would cite. If the extraction is thin, so is your citability.
Do you actually show up? Ask a search-capable model a real question in your niche and see whether you're in the answer, only in the sources, or nowhere.

None of that needs a subscription. It needs someone to run the checks.

So I built the audit, not the retainer

That's the tool. You give it a URL, it runs those checks, and it hands back a readiness score, a breakdown by category, and a list of specific findings, each with the evidence and the fix. Then you go fix them. That's the whole idea. If some fixes are worth outsourcing, outsource them to whoever you like. The diagnosis shouldn't cost you a salary.

The first URL I pointed it at was my own, because you don't ship a mirror without looking in it first. It found things. I fixed them.

One rule I built it around: no fake confidence

Most of these dashboards feel slippery because they show a guess as a fact. One confident number you can't interrogate.

So the scan has one rule: every signal says how sure it is. Measured, estimated, or not measured. If I actually fetched your robots.txt and read the rule, that's measured. If I'm inferring authority from public signals, that's estimated and labelled. If something genuinely can't be known from the outside, it says so instead of inventing a number to fill the box. Less impressive, more honest. That's the trade I want.

It's a loop, and it's honest about timing

A scan is only useful if you can tell whether your changes worked, so it's built as a loop, not a certificate. Scan for a baseline, make your fixes, scan again, diff.

For on-page readiness that loop is fast. Fix your robots.txt or your structure, re-scan minutes later, watch the score move. Visibility is slower and it says so. When you change your site, the AI answers don't update the same afternoon, because the models and search engines have to recrawl and reindex you first. Give it a couple of weeks before you expect the visibility probe to move.

The part no scan can fix for you

Here's the bit the ads skip. Most of your on-page readiness you can fix in a weekend. But whether a model actually trusts and quotes you is mostly not on your page at all. It's your reputation across the rest of the web: who links to you, who mentions you, whether you're a known entity or a stranger to the model.

No scan conjures that overnight, and neither does a monthly retainer, whatever the pitch says. What a scan can do is separate the easy on-page problems you fix today from the slow reputation ones that only real work over time moves. Knowing which is which, before you pay anyone, is half the battle.

Take it as a nudge, not gospel

The reason I built the measurement instead of reading opinions is that this stuff gets asserted a lot and checked rarely, and it's very checkable. If you're wondering whether an LLM can see your site, don't take a vendor's word for it, and don't take mine. Point something at a URL and read what the model actually makes of the page.

I build privacy-first tools for the web. This one is AI Visibility Scan - point it at a URL and see where you stand, free to try. My main product is RAG Chat, a privacy-first AI chatbot + search for WordPress. Need something custom built? Tell me what you need. No tracking pixels were used in this post.`

Search with no AI in the answer, and why I chose plain chunks over tree-RAG

Jan Michalík — Sun, 07 Jun 2026 20:28:39 +0000

`> I develop a privacy-first RAG chatbot for WordPress. This combines two writeups from pagecoder.ai - on search and on chunking - and continues my earlier post on what AI chat plugins leak per question.

Last cycle I added two things to my WordPress RAG plugin that people kept asking for: visitor-facing search, and indexing big PDFs. Each one came down to a retrieval decision worth writing down.

1. Search: retrieve, but don't generate

My chatbot answers in two stages - retrieve the most relevant content, then hand it to a model that writes a reply. Search is just the first stage, stopped before the second.

When you type into the search box, I run the same retrieval the chatbot uses and then stop. No model is asked to compose an answer. You get the raw matches back: ranked results with your terms highlighted.

Stopping early is the point. It buys three things:

No hallucination. Nothing is generated, so nothing can be invented. Every result links to a real page.
You land on the source. A ranked link, not a paraphrase that may or may not match the page.
It's lighter. The slow, expensive part of a chat reply is the model writing a few hundred words. Skip it and the same lookup gets cheaper and faster.

One honest caveat, because the whole product is about not overclaiming: search is not zero AI. To match on meaning and not just keywords, the query is still turned into an embedding (same backend the chat uses, then discarded). What it does not do is the thing people actually worry about - no model writes text about your content.

2. Chunking: the boring method beat the clever one

To index a big PDF you first have to cut it into pieces. There's a boring way and a clever way.

Boring: fixed-size chunks. Walk the document, cut it into roughly equal pieces of a few paragraphs each, with a little overlap so a sentence on a boundary isn't lost. No idea what a heading is. Just consistent slices.

Clever: tree-RAG (e.g. PageIndex). Build a tree from the table of contents, sections become nodes, and at query time you walk down to the most relevant branch and pull that whole section. On paper it's obviously better for long structured documents.

I wanted the clever one - it's the more impressive thing to say you built. So I tested it properly instead of guessing: a graded eval (every answer scored correct / partial / wrong), run on ordinary pages and on the long, table-of-contents-heavy PDFs that are the tree's home turf.

It didn't win. Directional results - I'd rather you run your own eval than trust my numbers:

What I measured	Fixed-size chunks	Tree-RAG
Accuracy on ordinary pages	Higher	Lower
Accuracy on long structured PDFs	More reliable	Mixed
Cost per question	Cheaper	Noticeably more
Tokens fed to the model	Lean	Much heavier
The biggest PDF I threw at it	Indexed fine	Failed to build its tree at all
Where it actually won	Most question types	Broad "summarize the whole thing" questions

Why it won: tokens and cost

It comes down to how much you hand the model. When the tree pulls "the most relevant section," that section can be pages long and all of it goes into the prompt. Fixed-size chunks hand over a few tight pieces and nothing else. That shows up on the bill (you pay for what the model reads) and in quality (burying the answer in surrounding text gives the model room to wander). Lean retrieval is often more accurate, not just cheaper.

The honest caveat

The tree wasn't useless. For "what is this entire document about?" it was genuinely better - it can feed the model a whole structured section at once. If your use case is almost entirely whole-document summarization, it might be worth the cost. For the specific "where does it say X?" questions real visitors ask, the boring slices won. So I shipped chunks and parked the tree.

What this means for PDFs (and your data)

I index big PDFs with the method that proved itself. Text is extracted, sliced, and searched - and on the privacy side it works like the rest of the product: extraction happens in memory, the file itself is never stored on my side, and the searchable pieces live in your own database.

Take it as a nudge, not gospel

The reason I tested instead of reading opinions is that this stuff is easy to assert and easy to check. If you're choosing a retrieval or chunking strategy, build a small graded eval on your documents before adopting the fancy thing. It's the most useful afternoon you'll spend on RAG.

I build RAG Chat - a privacy-first AI chatbot + search for WordPress. 7-day free trial, no card. Need something custom built? Tell me what you need. No tracking pixels were used in this post.`

WordPress AI chat plugins make 6–11 outbound requests per visitor question. Architecture writeup of an alternative.

Jan Michalík — Thu, 07 May 2026 21:35:54 +0000

Originally published at pagecoder.ai/blog/why-your-ai-chatbot-is-a-tracker. Cross-posted here for the dev.to community.

Last week we sat in a café in Wien with a friend — a freelance dev who's been shipping WordPress sites for a decade. He'd just installed an AI chat plugin for a client, a small cosmetics brand. It looked nice. Brand colors, custom name.

Then he opened the network tab.

"Three. Four. Five." He scrolled.
"Eight. Wait — eleven? Where is this one going?"

Eleven outbound requests. Per visitor question. Before the bot's reply finishes rendering.

If you ship WordPress sites for clients and you're considering an AI chatbot, this post is the architecture audit you probably haven't done yourself yet.

The architecture of the leak

visitor question
  ├─→ third-party AI provider (LLM call)
  │     - prompt logged for "abuse monitoring"
  │     - retention window = vendor-defined
  ├─→ chatbot vendor's own backend (the "data product")
  │     - browser fingerprint + IP + geolocation
  │     - full conversation + page URL
  │     - this IS the actual product the vendor sells
  └─→ widget's embedded CDNs / analytics endpoints (~3-9 of these)
        - CDN providers owe no privacy policy
        - implicit consent via embed

A single visitor question on a small WordPress site running a popular AI chatbot can leak to 6–11 different companies before the page renders the reply. None of them are in the site's cookie banner. None appear in any data-subject access request. None know who the brand is — all of them know who its visitors are.

EU regulators are catching up. Default architecture isn't.

Why we started over

We'd been installing those plugins for clients for years. Ticked the privacy-policy boxes. Stopped noticing. Then a client asked, casually, "where exactly does that go when someone types it in?"

We didn't have a clean answer.

So we started over. Three architectural rules:

1. Vectors live on the user's server

Standard SaaS playbook says "use Pinecone / Weaviate / our managed vector DB." We store embeddings in the customer's WordPress database (custom post type with vector + chunk_id + source_post_ref). Lookup is a single SQL query with cosine similarity computed in PHP — yes, not as fast as a dedicated vector DB, but fast enough for the typical 10K-20K-chunk corpus a small WP site has.

Tradeoff: scaling. Customers with 1M+ chunks would need a real vector DB. We bet 99% of WP customers won't hit that wall.

2. The math is stateless

The flow:

embedding (request) → similarity_search() → top_k chunks → response → done

No log. No "queries collection" admin tab. The backend literally forgets the request happened.

This costs us product features:

Can't show "your top 10 most-asked questions" dashboard
Can't do per-user conversation history
Can't optimize answers based on aggregate signals

The only honest way to promise we won't lose your visitors' data is to never have it.

3. Zero third-party calls from the widget

Fonts subset and self-hosted (no Google Fonts CDN)
JS bundle has no external script tags
No analytics pixel (we built our own backend on the same server, daily-rotating salts, no IP storage)
No social tracking pixels for share buttons (intent URLs only)

Open the network tab on a site running our plugin and you'll count to two: WP itself, and one stateless math endpoint at our backend.

The Loop

Standard chatbots forget the conversation. Visitor asks, bot answers, session closes, the site never learns. The chatbot vendor learns — they aggregate questions across all customers — but the site owner sees nothing.

We made that the actual product. Plugin clusters incoming question variations, drafts a clean FAQ entry, shows the admin two buttons: publish or discard. AI proposes; human curates. Output is a real indexed page at /faq/your-question/ with FAQPage schema markup.

Visitors who type the question into the bot get the answer instantly. Visitors who Google the question land on the same page. One piece of content, two jobs.

Three questions before installing any AI chatbot

Where do my visitors' questions go? If the answer involves any company name other than yours, the answer is "to that company too".
Where do you store my content? "Our cloud" = your content is part of their dataset. "Your database" — ask to see the table.
What happens if I uninstall? "Your data stays with us forever" vs "the data is gone, because it was always yours".

We pass all three. We're one of the few that do.

Full manifesto with the cosmetics-brand callback and the closing scene: pagecoder.ai/blog/why-your-ai-chatbot-is-a-tracker

Plugin: pagecoder.ai/products/rag-chat

(Disclosure: I'm a co-founder. The audit pattern in this post applies regardless of which plugin you pick — even if you never use ours.)