DEV Community

Cover image for Build your own ChatGPT starter kit
Brian Douglas
Brian Douglas

Posted on • Edited on

Build your own ChatGPT starter kit

ChatGPT is an excellent general-purpose example of how we can use AI to answer casual questions, but it could do better when the questions require domain-specific knowledge. Thanks to this ChatGPT starter kit, you can train the model on websites you define.

header image was generated using midjourney

GitHub logo gannonh / chatgpt-pgvector

ChatGTP (gpt3.5-turbo) starter app

What is gannonh/gpt3.5-turbo-pgvector?

This starter app was put together by @gannonh and makes great use of the Supabase pgvectors and OpenAI Embedding feature. The app leverages Next.js to stand up a simple prompt interface.

Live demo: https://astro-labs.app/docs

astro-labs.app demo

How does it work?

This starter app uses embeddings to generate a vector representation of a document and then uses vector search to find the most similar documents to the query. The results of the vector search are then used to construct a prompt for GPT-3, which is then used to generate a response. The response is then streamed to the user.

Web pages are scraped, stripped to plain text, and split into 1000-character documents.

// Stripe text from HTML
// pages/api/generate-embeddings.ts

async function getDocuments(urls: string[]) {
  const documents = [];
  for (const url of urls) {
    const response = await fetch(URL);
    const html = await response.text();
    const $ = cheerio.load(html);
    // tag based e.g. <main>
    const articleText = $("body").text();
    // class based e.g. <div class="docs-content">
    // const articleText = $(".docs-content").text();

    let start = 0;
    while (start < articleText.length) {
      const end = start + docSize;
      const chunk = articleText.slice(start, end);
      documents.push({ url, body: chunk });
      start = end;
    }
  }
  return documents;
}
Enter fullscreen mode Exit fullscreen mode

Once the URLs are stripped down to the text, they are sent to the Supabase after some embedding creation using the text-embedding-ada-002 model.

The OpenAI docs recommend using text-embedding-ada-002 for nearly all use cases. Fun fact, this is the same embedding Notion's AI tool uses under the hood. It's better, cheaper, and simpler to use.

text-embedding-ada-002 announcement

// Create embeddings from URLs 
// pages/api/generate-embeddings.ts

const documents = await getDocuments(urls);

for (const {
    url,
    body
  }
  of documents) {
  const input = body.replace(/\n/g, " "); 

  console.log("\nDocument length: \n", body.length);
  console.log("\nURL: \n", url);

  const embeddingResponse = await openAi.createEmbedding({
    model: "text-embedding-ada-002",
    input
  });

  console.log("\nembeddingResponse: \n", embeddingResponse);

  const [{
    embedding
  }] = embeddingResponse.data.data;

  // In production we should handle possible errors
  await supabaseClient.from("documents").insert({
    content: input,
    embedding,
    URL
  });
}
Enter fullscreen mode Exit fullscreen mode

gpt3.5-turbo-pgvector is an excellent starter for folks looking to try out OpenAI on their own data or sites. I see this being extremely useful in the documentation and now understand why OpenAI doesn't need to search in their docs (this is a joke, they should add search). Search in docs could be replaced by projects setting up their own embeddings.

Share in the comments if you have a use case for this.

Also, if you have a project leveraging OpenAI or similar, leave a link in the comments. I'd love to take a look and include it in my 9 days of OpenAI series.

Find more AI projects using OpenSauced

Stay saucy.

Top comments (4)

Collapse
 
emilmarian profile image
emil marian

Very interesting, what I am curious is for how long have you been studying this? I mean, it is a good article, but for me to reproduce your result and to understand what I am doing (I have a background in CS) may take more than just a day, to also prepare a good article with instructions like this one it would take me even more.

Collapse
 
itsbrex profile image
itsbrex

Great write up. I've be playing with embeddings the last couple weeks and really like it. Vector search is so powerful. Exciting times we're in. 🤙

Collapse
 
rootdown001 profile image
Lance Anderson

This is very coop. Thank you for posting. I'm going to play with it...

Collapse
 
arslanmumtaz profile image
Arslan Mumtaz

Great work thanks for posting