If you've been following the AI search space, you've probably heard about llms.txt – a simple Markdown file that tells AI crawlers (ChatGPT, Claude, Perplexity, Gemini, DeepSeek etc.) what your site is about and what content they're allowed to use. Think of it as robots.txt, but designed for the LLM era.
In this post I'll walk through how I built a CLI tool – geo-ai-cli — that generates and validates these files for any Node.js project, and share some of the interesting design decisions along the way.
What is llms.txt?
The llms.txt proposal defines a standard for exposing structured site content to AI systems. Two files:
llms.txt – a compact summary: site name, description, and a list of key pages with titles and URLs
llms-full.txt – the same, but with full content for each page
A minimal llms.txt looks like this:
# My SaaS Product
> A brief description for AI crawlers.
## Pages
- [Home](https://example.com/): Welcome page
- [Pricing](https://example.com/pricing): Plans and pricing
## Blog
- [Getting Started](https://example.com/blog/start): First steps guide
AI search engines like Perplexity, Claude and ChatGPT browse these files to understand your site's content and include it in answers. It's essentially GEO – Generative Engine Optimization.
The CLI
npm install --save-dev geo-ai-cli
Four commands cover the full workflow:
npx geo-ai init # scaffold a config file
npx geo-ai generate # generate llms.txt + llms-full.txt
npx geo-ai validate # check output files are valid
npx geo-ai inspect # preview config and crawler rules
init
Scaffolds a geo-ai.config.ts in the current directory. If a config already exists, it exits without overwriting – safe to run multiple times.
// geo-ai.config.ts
import type { GeoAIConfig } from 'geo-ai-core';
export default {
siteName: 'My Site',
siteUrl: 'https://example.com',
siteDescription: 'A brief description for AI crawlers.',
crawlers: 'all',
provider: {
Pages: [
{ title: 'Home', url: 'https://example.com/', description: 'Welcome page' },
],
Blog: [
{ title: 'Getting Started', url: 'https://example.com/blog/start', description: 'First steps' },
],
},
} satisfies GeoAIConfig;
The provider field is a plain object where each key becomes a section in llms.txt. You can also pass a ContentProvider instance if you want to pull data from a CMS or API dynamically.
generate
Loads your config, calls createGeoAI(config) from geo-ai-core, and writes both files to ./public (or wherever you point --out):
npx geo-ai generate --out ./dist/public
Config discovery follows a priority order: geo-ai.config.ts → geo-ai.config.js → geo-ai.config.json. TypeScript configs are loaded via dynamic import(), so you get full type safety without a separate compilation step.
validate
Checks that the generated files are present and structurally valid. Works on local files or a live URL:
# local
npx geo-ai validate --path ./public
# remote
npx geo-ai validate --url https://example.com
Validation rules are intentionally simple:
- file missing: not_found
- content < 50 chars: warn
- doesn't start with #: fail
- starts with #: pass
Exit code 1 on any fail or not_found, so it integrates cleanly into CI:
# .github/workflows/deploy.yml
- run: npx geo-ai generate
- run: npx geo-ai validate
inspect
Prints a human-readable summary of your config: site info, which AI bots are allowed/disallowed, and how many items are in each section. Useful for debugging before you generate.
npx geo-ai inspect
# Site: My Site
# URL: https://example.com
# Out: ./public
# Crawlers:
# GPTBot: allow
# ClaudeBot: allow
# PerplexityBot: allow
# ...
# Sections:
# Pages: 3 item(s)
# Blog: 12 item(s)
You can also point it at a live URL to fetch and display the remote files:
npx geo-ai inspect --url https://example.com
Adding it to your build pipeline
The most useful place for geo-ai generate is right before your deployment step:
// package.json
{
"scripts": {
"build": "next build && geo-ai generate",
"postbuild": "geo-ai validate"
}
}
Or in CI:
- name: Build
run: npm run build
- name: Generate llms.txt
run: npx geo-ai generate
- name: Validate
run: npx geo-ai validate --url ${{ env.SITE_URL }}
Controlling which AI bots can crawl
By default, crawlers: 'all' allows all 16+ known AI bots. You can get granular:
export default {
siteName: 'My Site',
siteUrl: 'https://example.com',
crawlers: {
GPTBot: 'allow',
ClaudeBot: 'allow',
PerplexityBot: 'allow',
'Google-Extended': 'disallow', // opt out of Gemini training
Bytespider: 'disallow', // TikTok crawler
},
provider: { /* ... */ },
} satisfies GeoAIConfig;
The geo-ai-core engine uses this to generate a corresponding robots.txt block and per-bot allow/disallow rules in the llms files.
What's next
The CLI is part of a larger ecosystem:
- geo-ai-core — the zero-dependency engine (llms.txt generation, bot rules, crawl tracking, SEO signals, AI description generation via Claude/OpenAI)
- geo-ai-next — a thin Next.js wrapper with middleware and App Router handler
- WordPress and Shopify integrations
The geo-ai-core/ai entry point lets you bulk-generate AI descriptions for your content using Claude or OpenAI, with a built-in rate limiter and batching – but that's a topic for another post.
If you're building a public-facing Node.js project, adding llms.txt takes about 5 minutes and meaningfully improves how AI search engines understand your content. Give it a try:
npx geo-ai init
Links
Website: https://www.geoai.run
CLI page: https://www.geoai.run/cli
CLI docs: https://www.geoai.run/docs/integrations/cli
GitHub: https://github.com/madeburo/GEO-AI

Top comments (0)