KazKN

Posted on May 3

How to Build a Custom GPT Knowledge Base in Under 5 Minutes (No Local Setup)

#chatgpt #webdev #tutorial #ai

I have built knowledge bases for 14 custom GPTs in the last 3 months. The first one took me 4 hours. The last one took me 90 seconds. The difference is not skill. The difference is that I stopped fighting my laptop and started using a managed crawler that ships a clean JSON file straight to my ChatGPT upload box.

If you have ever opened a terminal at 11pm, run npm install gpt-crawler, watched ESM errors scroll for 6 minutes, then realized Playwright wants a separate Chromium download, you already know the problem. The crawl logic is fine. The local setup is the wound. This article walks through the workflow I use now: a one-click cloud crawler that produces a ChatGPT-ready knowledge file in under 5 minutes, no node_modules, no Python venv, no Chromium-Helper (Renderer) eating 8GB of RAM.

Everything below is what I run in production for paying clients. The Actor I use is GPT Crawler MCP on Apify, a hosted wrapper around the legendary BuilderIO/gpt-crawler (19k+ stars on GitHub) with an extra MCP standby mode that I will get to in section 4.

Key takeaway: the bottleneck on a custom GPT is not the prompt. It is the quality and freshness of the knowledge file. Fix that pipeline first and the rest gets cheap.

🎯 What a "knowledge file" actually is (and why ChatGPT cares)

A knowledge file is a single JSON, Markdown, or plain-text document containing the cleaned content of every page on a target site, deduplicated, stripped of nav and footer noise, with each page tagged by URL and title. ChatGPT custom GPTs accept up to 20 such files in their "Knowledge" slot. Claude Projects accept similar uploads in "Project knowledge". RAG pipelines (LangChain, LlamaIndex, n8n agents) embed the same file into Pinecone, pgvector, or Weaviate.

The reason ChatGPT cares is retrieval. When a user asks your custom GPT a question, OpenAI runs a vector search over the chunks of your knowledge file and stuffs the top hits into the system context. Garbage in, garbage out. If your file is full of cookie banners, JS hydration placeholders, or duplicate sidebar text, the retrieval picks those up and your GPT hallucinates with a straight face.

So the goal of a good crawl is not "scrape every byte". The goal is clean text per page, one entry per URL, JSON shaped like an LLM can chew it.

🧱 The shape ChatGPT wants

Here is the actual output of a production crawl on a docs site I ran yesterday, 30 pages, returned in 47 seconds:

{
  "url": "https://docs.your-product.com/getting-started",
  "title": "Getting started - YourProduct docs",
  "html": "Welcome to YourProduct...",
  "text": "Welcome to YourProduct. This guide walks you through the first 5 minutes...",
  "tokens": 412,
  "crawledAt": "2026-04-27T09:14:22.181Z"
}

Each page is one object. The combined file (an array of these objects) is what you upload. ChatGPT reads the text field, indexes on title, and uses url as the citation source. That is the entire contract.

⚡ The 5-minute workflow (zero local setup)

Here is the exact sequence I run for every new custom GPT. Total wall-clock: 4 to 6 minutes including the upload to ChatGPT.

1️⃣ Open the Actor page

Go to apify.com/kazkn/gpt-crawler-mcp and click Try for free. If you do not have an Apify account, the signup is 30 seconds, no credit card. The free tier covers about 100 pages per month, plenty to validate before you scale.

2️⃣ Paste your start URL and match pattern

This is the only step that requires thought. The urls field takes the entry point (https://docs.your-product.com). The match field is a glob that controls which links get followed (https://docs.your-product.com/**). If you forget the match pattern, the crawler will follow external links and your knowledge file will end up containing half of Stack Overflow.

Set maxPagesToCrawl to 30 for the first run. This is your cost ceiling and your sanity check. If 30 pages look clean, scale to 200 or 500.

3️⃣ Pick the output format

The Actor supports three formats:

JSON for ChatGPT custom GPTs and RAG pipelines (default, recommended).
Markdown for Claude Projects and human review.
TXT for legacy embedding pipelines that choke on JSON.

90% of the time I pick JSON. It carries the URL and tokens metadata that retrieval tools care about.

4️⃣ Click Start, wait, download

The run takes 30 seconds for 10 pages, 90 seconds for 30 pages, 3 minutes for 100 pages. When it finishes, the Storage > Key-value store tab has your output.json ready to download. One click, file saved.

5️⃣ Upload to ChatGPT

In ChatGPT, go to Create a GPT > Configure > Knowledge > Upload files, drop the JSON in. ChatGPT chunks and indexes it server-side in about 60 seconds. Done. Your custom GPT now answers from your docs, cites the URLs, and stops hallucinating about features that do not exist.

🔧 The MCP standby mode (live crawls from inside Claude Desktop)

This is the part most tutorials skip because most crawlers do not support it. The Actor exposes an MCP server in standby mode, which means an AI agent (Claude Desktop, Cursor, Windsurf, Continue.dev, any MCP-compatible client) can call the crawler live, mid-conversation, with no pre-indexing.

The use case: you are debugging in Cursor, you need the latest Stripe API docs, you do not want a stale knowledge file from last month. You type "crawl docs.stripe.com/api/customers, max 5 pages, return as JSON" in chat, the agent calls the Actor, you get the freshest content in 25 seconds.

Setup in Claude Desktop is one JSON block in ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "gpt-crawler": {
      "type": "url",
      "url": "https://kazkn--gpt-crawler-mcp.apify.actor/mcp?token=YOUR_APIFY_TOKEN",
      "timeout": 180000
    }
  }
}

The timeout: 180000 (180 seconds) is critical. Default MCP client timeouts are 30 seconds, which is shorter than most crawls. If you skip the timeout config you will see "interrupted connection" errors and waste an afternoon. The full client compatibility table lives in the Actor README and covers Cursor, Windsurf, Continue.dev, langchain-mcp, and the official npm SDK.

For a broader catalog of MCP-compatible Actors, browse the Apify MCP Servers category. The ecosystem is small but growing fast.

💰 Cost reality (and why this beats subscriptions)

This is where pay-per-event eats subscription pricing alive. The Actor uses Apify's PPE model:

Mode	Event	Price	Real-world cost
Batch (Console run)	per page crawled	$0.001	30 pages = $0.03, 200 pages = $0.20
MCP (standby)	per `tool-request`	$0.05	One call returning 30 or 200 pages = $0.05 flat
Cold start	per actor-start	$0.00005	Negligible

A typical custom GPT knowledge file is 30 to 200 pages, costing 3 to 20 cents in batch mode. Compare to Firecrawl's $39/month flat fee and you would need to build 200 knowledge files a month before the subscription pays off. I have never met anyone who needs that.

The math is brutal: if you build less than 5 knowledge files a month, PPE is 95% cheaper than any subscription crawler. If you build 50, it is still cheaper.

📐 The 512KB ceiling nobody warns you about

This is the silent failure mode that costs people 30 minutes the first time. ChatGPT custom GPTs accept a maximum of 20 files at 512KB each. If you crawl a 300-page docs site and dump everything into one JSON file, you will land at 6 to 12MB and ChatGPT will silently refuse the upload. No useful error message, just a vague "could not process" that has burned at least one Saturday afternoon for every Custom GPT builder I know.

The Actor has a maxTokens parameter that auto-truncates the crawl to fit. Set it to 100,000 and you will land safely under 512KB. Set it to 250,000 if you upload to Claude Projects, which is more generous on file size. If your site genuinely needs more context, split it into 3 to 5 thematic crawls (one per docs section) and upload them as separate files. ChatGPT lets you stack 20 of them.

Rule of thumb: 100k tokens per JSON file = 1 ChatGPT-compatible knowledge slot. 5 slots covers most B2B SaaS documentation.

🧪 What to verify after your first crawl

Before you upload to ChatGPT, open the JSON and grep for three things:

No duplicate URLs. If the same page appears twice, your match pattern is too loose.
No empty text fields. If a page has 0 tokens, JS hydration was probably blocked. Increase waitForSelectorTimeout to 3000ms.
No nav/footer noise leaking into text. If every page has the same 200-word footer, set selector to main or article instead of the default body.

Fix these three and your custom GPT retrieval quality jumps by maybe 40%. I do not have a formal benchmark, just 14 client GPTs of qualitative feedback.

🚀 Beyond the first crawl

Once the basic workflow clicks, the obvious next moves are scheduled re-crawls (Apify cron, $0 extra), n8n integration (the Apify connector is one drag-drop), and chaining the Actor into a RAG pipeline that pushes embeddings to Pinecone on every refresh. I cover those flows in other posts on my Apify portfolio page. For the official Standby mode docs, read the Apify Standby reference.

✅ Conclusion

The fastest custom GPT I ever shipped went from "client sent docs URL" to "GPT live in ChatGPT" in under 5 minutes, billed at $0.04 in crawl cost. The slowest took 4 hours and ended with me reinstalling Node. The difference is not the model, the prompt, or the docs. It is the crawler.

Stop running scrapers on your laptop. Start treating the knowledge file as a managed-cloud problem. The pricing is honest, the MCP mode is genuinely new, and the BuilderIO core under the hood is the same one 19k GitHub stargazers already trusted.

If you want the live version with batch + MCP both ready, the Actor is at apify.com/kazkn/gpt-crawler-mcp. First 100 pages are free. Try it on your own docs site this afternoon and judge from the JSON.

❓ FAQ

🟢 How long does it take to build a knowledge file for a 50-page docs site?

In batch mode through the Apify Console, a 50-page crawl typically completes in 75 to 120 seconds wall-clock, depending on the target site's response time and JavaScript rendering needs. Including upload to ChatGPT, your custom GPT goes live in roughly 4 minutes total, no local installation required.

🟢 What format works best for ChatGPT custom GPT knowledge?

JSON is the recommended format because it preserves URL metadata, page titles, and token counts that ChatGPT's retrieval system uses for citations and ranking. Markdown is preferable for Claude Projects where humans also read the knowledge. Plain TXT only fits legacy embedding pipelines that cannot parse structured input.

🟢 Can I crawl JavaScript-heavy sites built with React or Next.js?

Yes. The Actor uses Playwright with headless Chromium, identical to running BuilderIO/gpt-crawler locally, so client-rendered React, Vue, and Next.js sites are fully supported. Use the selector input to target the post-hydration content container, and bump waitForSelectorTimeout to 3000ms if the site hydrates slowly.

🟢 Do I need an Apify subscription to use this Actor?

No. Apify's free tier includes monthly platform credits that cover roughly 100 crawled pages per month at the $0.001 batch rate, enough to build and validate 2 or 3 small knowledge files. Beyond that you pay only for what you crawl, with no monthly fee, no commitment, and automatic Bronze/Silver/Gold subscription discounts if you scale up later.

DEV Community