DEV Community

Matteo Perino
Matteo Perino

Posted on

Four small CLIs to make your site visible to AI engines

Most of the GEO/SEO tooling on the market right now reads like it was written to sell a course, not to solve a problem.

So I wrote four tools instead.

Four Node CLIs, zero runtime dependencies, MIT, each one does one thing. They all live under the @geosuite scope on npm, and the source is at github.com/TryGeoSuite.

Here's what they do, and the design call behind each one.


1. @geosuite/ai-crawler-bots

What it does: tells you whether GPTBot, ClaudeBot, PerplexityBot, and ~20 other AI crawlers can actually reach your site, and where the block is coming from when they can't.

npx @geosuite/ai-crawler-bots robots https://your-site.com
Enter fullscreen mode Exit fullscreen mode

The non-obvious part: when a request comes back 403, the result distinguishes between an edge block (Cloudflare / CloudFront / Vercel / Akamai / Fastly / Netlify fingerprint in the response) and an origin block (no such fingerprint — your application or web server). The remediation is different in each case: edge means flip a toggle in your CDN dashboard, origin means update a config.

It also parses robots.txt with line-level provenance, so when a bot is Disallowed it tells you which line in which group did it. And it detects the # BEGIN Cloudflare Managed content# END Cloudflare Managed Content markers Cloudflare injects when "Block AI Bots" is enabled — if your own rules would have allowed the bot but the managed block disallows it, the report says so.

UA strings come from operator docs, not third-party SEO blogs that copy each other. We don't accept entries without a docs link.

ai-crawler-bots


2. @geosuite/schema-templates

What it does: ships 23 copy-paste-ready schema.org JSON-LD templates plus an offline structural validator.

npx @geosuite/schema-templates list
npx @geosuite/schema-templates show Product
Enter fullscreen mode Exit fullscreen mode

JSON-LD is the cheapest, least ambiguous signal you can give an AI assistant about what your page is. It will not on its own make ChatGPT cite you — authority and freshness still matter — but it removes a class of avoidable failures. The AI no longer has to guess your prices, your author, or whether a number on the page is a benchmark or a typo.

I deliberately excluded fields that aren't truly recommended for each type. Padding templates with every optional schema.org property dilutes the signal. If you need a field that's not there, schema.org is the source of truth — add it yourself.

There's also geosuite-schema fill <Type> --url <url> --ai if you want the LLM to populate placeholders from a real page, but the deterministic side (templates + validator) does not need a network or an API key.

schema-templates


3. @geosuite/llms-txt-generator

What it does: turns a sitemap.xml into an llms.txt file per the proposed standard at llmstxt.org.

npx @geosuite/llms-txt-generator https://your-site.com/sitemap.xml \
  --name="Your Site" --enrich --out=public/llms.txt
Enter fullscreen mode Exit fullscreen mode

llms.txt is intended to be the LLM-shaped equivalent of a sitemap: a curated, sectioned, markdown index of your most important pages. The format is small enough to be parsed by classical tooling (regex) and also legible to a model — that's the point.

The generator is deterministic. With --enrich it fetches each URL once and pulls <title> + <meta name="description"> via regex. No headless browser, no LLM dependency in the default path. (--ai is opt-in if you want the LLM to rewrite descriptions; we send only URL + title + meta, never the page body.)

Sitemap-index files are flattened automatically. Pass them like a flat sitemap.

llms-txt-generator


4. @geosuite/sitemap-builder

What it does: crawls a site and emits a valid sitemap.xml. For sites that ship without one (more common than you'd think on custom builds).

npx @geosuite/sitemap-builder https://your-site.com --output sitemap.xml
Enter fullscreen mode Exit fullscreen mode

BFS, same-origin only, three caps stack: page count, depth, wall-clock budget. Whichever fires first wins. Drops obvious non-HTML extensions and fragment-only links. Output is sitemaps.org-compliant — <loc> plus optional <lastmod>, no <changefreq> or <priority> (deprecated, ignored by every major engine).

Whole tool is around 250 lines of vanilla Node. No puppeteer, no cheerio, no axios. Just node:http, node:https, and a few regexes.

sitemap-builder


The design choices, all in one place

  • Zero runtime dependencies. The four packages combined add ~0 install footprint to your project. The only exception is llms-txt-generator, which depends on fast-xml-parser for the sitemap-index path because writing your own XML parser is a footgun.
  • AI mode is opt-in. Every CLI has a --ai flag. Without it, behaviour is fully deterministic. With it, payloads are minimal and structured (verdicts, titles, depths) — never raw HTML or page bodies.
  • One tool, one job. Composable via stdout/JSON. If you want to chain sitemap-builder into llms-txt-generator, that's a single pipe.
  • Boring code. No clever metaprogramming. The whole stack is meant to be readable in an afternoon. If it isn't, that's a bug, not a feature.

Why open source the building blocks

The same checks power GeoSuite, the hosted product I'm building (history, alerts, dashboards, integrations into your content pipeline). But the building blocks belong open: I find it dishonest to sell a black box that does things any developer could verify.

If you find a bot UA missing — or worse, a wrong one — the place to send it is bots.json in ai-crawler-bots, with a link to the operator's docs. UA strings drift a couple of times per year per operator, and that file ages faster than anything else in the suite.

PRs and issues welcome. Especially the ones that prove me wrong.

github.com/TryGeoSuite

Top comments (0)