DEV Community

Cover image for Introducing GPT Crawler - Turn Any Site Into a Custom GPT With Just a URL
Steve Sewell for Builder.io

Posted on • Originally published at builder.io on

Introducing GPT Crawler - Turn Any Site Into a Custom GPT With Just a URL

✒️ Written by Steve Sewell

Let's create a custom GPT in just two minutes using a new open-source project called GPT Crawler. This project allows us to provide a site URL, which it will crawl and use as the knowledge base for the GPT.

You can either share this GPT or integrate it as a custom assistant into your sites and apps.

Why create a custom GPT from a site

I created my first custom GPT based on the Builder.io docs site, forum, and example projects on github and it can now answer detailed questions with code snippets about integrating Builder.io into your site or app. You can try it here (currently requires a paid ChatGPT plan).

Our hope is that by making our docs site interactive, people can more simply find the answers they are looking for using a chat interface.

And this can help not just with discoverability, saving people time not having to dig through to find the specific docs they need, but also personalize the results, so even the most esoteric questions can be answered.

This method can be applied to virtually anything to create custom bots with up-to-date information from any resource on the web.

Get started with GPT Crawler

First, we'll use this new GPT crawler project that I've just open-sourced.

Clone the repo

To get started, all we need to do is clone the repository, which we can do via a simple git clone command.

git clone https://github.com/builderio/gpt-crawler
Enter fullscreen mode Exit fullscreen mode

Install dependencies

After cloning, I'll cd into the repository and then install the dependencies with NPM install.

cd gpt-crawler
npm install
Enter fullscreen mode Exit fullscreen mode

Configure the crawler

Next, we open the config.ts file in the code and supply our configuration. Here, we can provide a base URL, where the crawl will start and look for links to crawl subsequent pages. We can also provide a matching pattern. For instance, I might want to crawl only docs and nothing else.

export const config: Config = {
  // Start the crawl at this URL
  url: "https://www.builder.io/c/docs/developers",
  // Only crawl URLs matching this pattern
  match: "https://www.builder.io/c/docs/**",
  // Only grab the text from within this selector
  selector: `.docs-builder-container`,
  // Don't crawl more than 1000 pages
  maxPagesToCrawl: 1000,
  // The file name that our results will output to
  outputFileName: "output.json",
};
Enter fullscreen mode Exit fullscreen mode

I recommend providing a selector as well. For the Builder docs, for example, I set it to scrape only a specific area and not the sidebar, navigation, or other elements.

Run the crawler

Now, we can run NPM start in our terminal and watch in real time as the crawler processes your pages.

npm start
Enter fullscreen mode Exit fullscreen mode

This crawler uses a headless browser, so it can include any markup, even those that are purely client-side rendered. You can also customize the crawler to log into a site to crawl non-public information as well

Upload your knowledge file

After the crawl is complete, we'll have a new output.json file, which includes the title, URL, and extracted text from all the crawled pages.

[
  {
    "title": "Creating a Private Model - Builder.io",
    "url": "https://www.builder.io/c/docs/private-models",
    "html": "..."
  },
  {
    "title": "Integrating Sections - Builder.io",
    "url": "https://www.builder.io/c/docs/integrate-section-building",
    "html": "..."
  },
  ...
]
Enter fullscreen mode Exit fullscreen mode

Create a custom GPT (UI access)

We can now upload this directly to ChatGPT by creating a new GPT, configuring it, and then uploading the file we just generated for knowledge. Once uploaded, this GPT assistant will have all the information from those docs and be able to answer unlimited questions about them.

Create a custom assistant (API access)

Alternatively, if you want to integrate this into your own products, you can go to the OpenAI API dashboard, create a new assistant, and upload the generated file in a similar manner.

This allows you to access the assistant over an API, providing custom-tailored assistance within your products that have specific knowledge about your product, right from your docs or any other website, simply by providing a URL and crawling the web

Conclusion

If you have a use case where you or others would value a custom GPT specifically focused on a given topic or information set that can be scanned via a website, give this a try and I can’t wait to see what you build!

And if you see ways to make this project better, send a PR!

Introducing Visual Copilot: a new AI model to convert Figma designs to high quality code in a click.

No setup needed. 100% free. Supports all popular frameworks.

Try Visual Copilot

Read the full post on the Builder.io blog

Top comments (3)

Collapse
 
rorschach profile image
R-Lek • Edited

This is really neat! I'm positive there's a huge amount of online docs that could really benefit from this tool.

Question though, I was attempting to run the crawler on this site: help.anva.nl
I configured it like this:

export const defaultConfig: Config = {
url: "https://help.anva.nl/topic",
match: "https://help.anva.nl/topic/**",
maxPagesToCrawl: 50,
outputFileName: "output.json",
maxTokens: 2000000,
};

-figuring it would extract all possible pages after /topic/, such as:
help.anva.nl/topic/6910/ADN-dataco...
help.anva.nl/topic/6910/ADN-dataco...
help.anva.nl/topic/49476/Afdruk/
(menu)
etc.

However, the output I get is:
INFO PlaywrightCrawler: Starting the crawler.
INFO PlaywrightCrawler: Crawling: Page 1 / 50 - URL: https://help.anva.nl/topic/57578/Welkom_in_de_ANVA_Help...
INFO PlaywrightCrawler: Crawling: Page 2 / 50 - URL: https://help.anva.nl/topic/57578/Welkom_in_de_ANVA_Help#...
INFO PlaywrightCrawler: Crawling: Page 3 / 50 - URL: https://help.anva.nl/topic/57578/53212.htm...
INFO PlaywrightCrawler: Crawling: Page 4 / 50 - URL: https://help.anva.nl/topic/57578/indexpage.htm...
INFO PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO PlaywrightCrawler: Final request statistics:

I've tried some variations but I can't seem to extract anything more, could you tell me what I'm doing wrong?

Collapse
 
balogh08 profile image
Balogh Botond

Awesome idea! Cannot wait to try it but the Holidays are coming 😀

I do similar thing locally on a project code base, but I often run into exceeding the token limit... I do like this idea providing the title and urls like a key-value pair so not all the context loaded at once. But I'm not sure the html part in the output.json, does it not lead to exceed the token limit?

Collapse
 
usulpro profile image
Oleg Proskurin

That is great! Will play with it on Holidays 🎄

🤔 I'm thinking - would it work if point it to a Github repo? For my experience, GPT4 gives much better results with big projects than Copilot in VSCode