KazKN

Posted on May 3

How I Built a Hosted Web-Crawler-to-Knowledge-File Pipeline (and Shipped It as an MCP Server in 6 Weeks)

#ai #webdev #tutorial #opensource

I am a Switzerland-based indie developer. In the last six weeks I forked a 19,000-star open-source crawler, wrapped it for Apify's serverless infrastructure, added an MCP server mode, shipped 10 builds, and pushed the actor through Apify's automated daily test until the "Fix Actor health issues" warning cleared. The actor is kazkn/gpt-crawler-mcp. This post is the build log.

I am writing it because three people I respect asked me independently in the same week how I would put a "knowledge crawler" into the hands of non-developers without making them install Playwright. Whoever you are, if you have ever thought "this would be a great Apify actor, but I do not want to fight the runtime", this is for you.

Key takeaway up front: the value is not the crawler. There are dozens of crawlers. The value is removing every reason a user would not click Run — pinned runtime, prefilled inputs, MCP standby, transparent pricing, screenshots that work. The build log below is mostly about that one job.

🧭 The starting point — why fork instead of build

I had two real, paying clients last quarter who needed a custom GPT pre-loaded with a knowledge file built from a docs site. The first one I tried to set up with the canonical BuilderIO/gpt-crawler repo (19k+ stars, ISC license). It is a beautiful piece of code. Clone, npm install, edit config.ts, run npm start, get a JSON.

It took me 90 minutes. Not because the crawler is hard — the crawler is easy. It took 90 minutes because:

Playwright wanted a fresh Chromium download
Node 18 vs Node 20 ESM differences
A docker-on-mac quirk that ate 8 GB of RAM during a 30-page run
Two re-runs because I forgot to widen the match glob

A senior engineer eats this in 90 minutes once. A non-engineer client never gets past minute 12. And I am charging the client $5,000 for a custom GPT — they should not be running Playwright at home to refresh their knowledge.

So the decision was structural, not technical: take BuilderIO's crawl logic (it works, it is battle-tested, the maintainers ship every few weeks) and put it behind a managed runtime where the user clicks once and gets a JSON. That is what an Apify actor is, and that is what I shipped.

I did not rewrite the crawler. I do not have anything to add to BuilderIO's core. The repo is still credited as the source — the wrapper is a thin Apify adapter around their core.ts.

🏗️ Architecture in one diagram (and one paragraph)

                       ┌─────────────────────────────┐
                       │   Apify Console / API       │
                       │   (one-click Run)           │
                       └──────────────┬──────────────┘
                                      │
                                      ▼
        ┌──────────────────────────────────────────────────┐
        │                   main.ts                        │
        │   ┌──────────────┐         ┌─────────────────┐   │
        │   │ batch_runner │  OR     │   mcp_server    │   │
        │   │  (one-shot)  │         │  (Standby HTTP) │   │
        │   └──────┬───────┘         └────────┬────────┘   │
        │          │                          │            │
        │          ▼                          ▼            │
        │   ┌────────────────────────────────────┐         │
        │   │           crawler_core.ts          │         │
        │   │  (BuilderIO/gpt-crawler core 1:1)  │         │
        │   │  Playwright + Crawlee + match glob │         │
        │   └─────────────────┬──────────────────┘         │
        └─────────────────────┼────────────────────────────┘
                              ▼
              ┌───────────────────────────────┐
              │  Apify Default Dataset        │
              │  + Key-value store (output)   │
              └───────────────────────────────┘

The whole actor is three TypeScript files plus the Apify config. main.ts is a 30-line dispatcher that picks batch_runner or mcp_server based on the mcpMode input flag (or the Apify Standby env var). batch_runner.ts is a one-shot crawl that pushes each page to the default dataset and writes a combined knowledge file to the key-value store. mcp_server.ts is an Express handler that exposes crawl_to_knowledge as an MCP tool over /mcp.

The crawl is BuilderIO's untouched logic. My contribution is the plumbing.

🧰 The MCP server mode — why I added it and why I almost did not

When I started, I thought MCP was a hype detour. Three months later I think it might be the most undersold feature of Apify Standby. Here is why.

A custom GPT loaded with a static knowledge file is good. But a custom GPT (or a Claude Project, or a Cursor agent) that can call a crawl tool live, mid-conversation, with whatever URL the user just mentioned, is better. The user types "check the latest LangChain LCEL docs and answer" and the agent fires crawl_to_knowledge(url=...) and gets fresh content in the response. No re-uploading a JSON, no stale embeddings, no overnight re-crawl pipeline.

Apify Standby makes this almost free. You set actorStandby.isEnabled: true in the actor manifest, expose an HTTP server on the port Apify gives you, and Apify keeps a warm instance alive between requests. The Actor wakes on first call, stays warm for ~60 seconds, charges per tool call.

The MCP wrapper itself is ~200 lines of Express. It implements the initialize, tools/list, tools/call handshake (per the MCP spec), exposes the single crawl_to_knowledge tool, and returns the dataset rows as the response. The hardest part was not the protocol — it was deciding what not to expose. I picked one tool. Not five. One tool that does the obvious thing.

Lesson: for an MCP server, ship the smallest set of tools that are immediately useful. A single, well-named, well-documented tool beats five half-baked ones every time.

💰 Pricing — the part nobody talks about

Apify supports per-event billing (Pay-Per-Event, PPE). I get to charge for whatever events I want, at whatever price. After three iterations I landed on:

Event	Price	When
`apify-actor-start`	$0.00005	Each cold-start
`apify-default-dataset-item`	$0.001	Each page crawled in batch mode
`tool-request` (MCP)	$0.05	Each `crawl_to_knowledge` MCP call (covers any number of pages in that one call)

The MCP price is intentionally flat. A tool call returns the entire knowledge file in one response, so the user pays once per crawl request regardless of page count. The page cap is enforced by their maxPagesToCrawl input — that is also their cost cap.

I tested three pricing models before this:

Per-page in MCP mode too. Killed it because users could not predict their bill.
Subscription ($10/mo). Killed it because Apify Store users hate subscriptions on small-utility actors.
Free + tip jar. Killed it because nobody tips a JSON file.

Flat per-call MCP + per-page batch is the model that maps to user mental models: "I am asking for one knowledge file, that costs $0.05 to compute and serve".

🐛 The day the daily test failed for three days in a row

Apify runs an automated test on every Apify Store actor, every day, and labels actors that fail three consecutive days as "under maintenance". I did not know this until day 4, when the warning appeared on my actor page.

The test: run the actor with the prefilled input from the schema, expect a SUCCEEDED status with a non-empty default dataset, in under five minutes.

My prefill at the time was https://www.builder.io/c/docs/developers with a match glob that covered the same path. It worked perfectly when I ran it manually. But the daily test fired at a UTC hour when builder.io's docs site occasionally returned an empty body to scraping IPs (different anti-bot behavior at different hours, presumably).

I spent half a Saturday debugging. I rebuilt the actor twice, switched runtime memory from 1 GB to 2 GB to 4 GB, added retries. None of that mattered. The crawler was fine. The target site was the problem.

Fix: I changed the prefilled URL to https://docs.apify.com/platform/actors. Apify's own docs site is, unsurprisingly, not served from the same datacenter that runs the test. It is rock-solid. And it is on-brand — users testing my actor for the first time get a knowledge file built from Apify's docs, which is a useful demo in itself.

The test cleared the next day. The warning is gone.

Lesson: an Apify actor's prefilled input is a test contract with the platform. Treat it as more than a demo — it is the input that decides whether your actor gets deprecated.

🎨 The input schema redesign — making non-engineers feel safe

The first version of the input form was an engineering vomit: 11 fields, no grouping, defaults that made sense to me and confused anyone else. I shipped it to a friend who runs an AI agency. He spent four minutes staring at it before he ran the actor. Four minutes of not running the actor is fatal in a free-tier funnel.

I rewrote the schema four times. The version live today has:

Numbered first-person headings. "1. URL to start crawling from". "2. Which links should I follow?". "3. How many pages? (also your cost cap)".
A welcome banner at the top of the form that literally says "Just click Save & Start below — the prefilled values crawl the Apify docs as a demo. Then change the URL to your own docs site and re-run."
Three collapsible sections for everything else: 💰 Scope & cost control, ⚙️ Advanced (most users can skip these), 🤖 MCP server mode (advanced — for AI agents).
Cost calculator inline. Each maxPagesToCrawl value comes with an expected duration and dollar cost: 10 = ~30 s · ~$0.01. 100 = ~3 min · ~$0.10. The page cap is also the spend cap.
isSecret: true on the cookie field, so users behind a login do not panic about Apify storing their session.
Selector defaulted to body with a docstring that says "works for 95 % of sites — only override if you want to drop nav/sidebar/footer".

The friend tried it again two weeks later. He ran the actor in 22 seconds.

"properties": {
  "urls": {
    "title": "1. URL to start crawling from",
    "type": "array",
    "description": "Paste the address of the docs site...",
    "editor": "stringList",
    "prefill": ["https://docs.apify.com/platform/actors"]
  },
  "match": {
    "title": "2. Which links should I follow?",
    "type": "string",
    "description": "Use the same prefix as your start URL with `/**` at the end...",
    "default": "https://docs.apify.com/platform/actors/**"
  },
  "maxPagesToCrawl": {
    "title": "3. How many pages? (also your cost cap)",
    "type": "integer",
    "sectionCaption": "💰 Scope & cost control",
    "description": "10 = ~30 s · ~$0.01 — quick test ✅ default..."
  }
}

The full schema is in .actor/input_schema.json on GitHub.

Lesson: the input schema is the product. Spend more time on it than on the README.

🔍 SEO research that actually shipped — alphabet soup with denylist

I do this for every actor now. It is the cheapest, highest-leverage hour in the build log.

I run a Google Autocomplete API alphabet soup against six product-specific seeds + four French seeds, forward + reverse + question prefixes. Around 700 API calls (free, no auth required), denylist regex to filter crypto/trading/discord noise, tier the surviving keywords S/A/B by intent.

The killer result for this actor: the head term claude project knowledge exceeds maximum remove some to continue came back six times. That is a real Claude error message that real users are typing into Google. It is not a vanity keyword — it is a pain point. And my actor is the answer (consolidate 20 raw HTML pages into one Markdown file, fits under the project knowledge cap, problem solved).

I added a dedicated H2 in the README ("🚨 Claude Project knowledge exceeds maximum, remove some to continue?") plus a use-case entry plus six FAQ questions sourced from Google's "People also ask" box on the same query. The README went from 313 lines to 420 lines, all driven by the soup data, none of it filler.

SEEDS_EN = [
    "knowledge file chatgpt",
    "claude project knowledge",
    "chatgpt custom gpt",
    "rag from website",
    "scrape docs for ai",
    "firecrawl alternative",
]

DENY_RX = re.compile(
    r"\b(trading|bitcoin|crypto|discord|telegram|forex|nft|"
    r"chess|game|minecraft|roblox|fortnite)\b", re.I)

Lesson: your seed quality is the upper bound of your keyword research. Spend ten minutes picking six seeds that are very specific to your product and the soup will surface real user pain. Generic seeds surface bot noise.

📸 Screenshots — the part I underestimated

The Apify Store gallery auto-extracts images from the README. So the screenshots in the README are the gallery. Spending an evening on them is high-leverage.

I captured five viewport-only PNGs at 1270×760 (Apify's gallery spec) using screencapture -x plus a Pillow crop pass that strips the macOS menu bar, the Chrome chrome (tab bar, URL bar, bookmarks bar), and the Chrome MCP debug banner that sits below the URL bar when DevTools is attached. The cleanest crop offset turned out to be TOP=440 at retina (2× DPR).

Each screenshot has a one-sentence caption that ties to a value prop:

Input form → "Guided 2-step input — paste URL, set link pattern, hit Run."
Recent runs → "100% success rate, ~50s average wallclock."
Dataset table → "Each crawled page → one clean structured row."
Apify Store public → "Live on Apify Store. Pricing, README, FAQ — all on one page."

TOP = 440           # strip menu bar + tab bar + URL bar + bookmarks + MCP debug
BOTTOM_TRIM = 200   # strip dock
cropped = img.crop((0, TOP, w, h - BOTTOM_TRIM))

The screenshots are hosted on Imgur (anonymous upload via Client-ID, free, no expiry). The README embeds them with ![alt](https://i.imgur.com/<id>.png). Apify's Store renderer handles the rest.

📈 What is live as of today

10 builds shipped, 0.1.1 → 0.1.10. Latest is 0.1.10.
7 successful runs on record, 100% success rate, average 50 seconds wallclock.
README ≈ 420 lines, with one screenshot section, one Claude-pain-point H2, two FAQ blocks (mine + PAA-derived), one comparison table vs Firecrawl and vs running BuilderIO locally.
Apify Store page live: apify.com/kazkn/gpt-crawler-mcp.
GitHub repo open source: github.com/DataKazKN/gpt-crawler-mcp, ISC.
MCP server endpoint live: https://kazkn--gpt-crawler-mcp.apify.actor/mcp?token=YOUR_APIFY_TOKEN — drop into Claude Desktop, Cursor, Windsurf, or any MCP client.

🎯 What I would do differently

Ship the MCP mode in v0.1.0. I shipped batch-only first and added MCP later. The MCP mode is what gets the actor mentioned in AI-agent posts. Ship it day one.
Pick the prefilled URL on day one too. I burned three days on the daily-test warning before I switched away from a third-party domain. The prefill is part of the contract with the Apify Store. Pin it to a domain you trust on day one.
Write the input schema before writing the crawler. The schema is the product. The crawler is just the implementation. I wrote the crawler first and the schema last and rewrote it four times. Backwards.

🧠 What is next

I am working on a v0.2.0 with three additions:

Sitemap-first mode. When the start URL has a sitemap.xml, prefer it over link-following. Faster and more deterministic.
Token-aware chunking. Output in chunks of N tokens to skip the post-crawl chunking step entirely.
A second MCP tool: summarize_knowledge_file — runs an LLM pass on the raw crawl output, returns a concentrated 2 000-token brief. For users who want the model's takeaway, not the raw pages.

If any of those would change your workflow, ping me on the GitHub repo or open an issue on the Apify actor page. I read every one.

🔗 Links

Apify Store actor: apify.com/kazkn/gpt-crawler-mcp
GitHub repo (ISC): github.com/DataKazKN/gpt-crawler-mcp
Built on: BuilderIO/gpt-crawler — credit to the upstream maintainers.
MCP spec: modelcontextprotocol.io
Apify Standby docs: docs.apify.com/platform/actors/running/standby

If you build on top of any of this, I would love to read your version. The whole point of releasing the wrapper is that someone else can fork it and ship a Notion-to-GPT or YouTube-to-GPT spinoff in a weekend. Go.

— Yorick (KazKN), Switzerland

DEV Community