ChatGPT is an excellent general-purpose example of how we can use AI to answer casual questions, but it could do better when the questions require domain-specific knowledge. Thanks to this ChatGPT starter kit, you can train the model on websites you define.
header image was generated using midjourney
gannonh / chatgpt-pgvector
ChatGTP (gpt3.5-turbo) starter app
What is gannonh/gpt3.5-turbo-pgvector?
This starter app was put together by @gannonh and makes great use of the Supabase pgvectors and OpenAI Embedding feature. The app leverages Next.js to stand up a simple prompt interface.
Live demo: https://astro-labs.app/docs
How does it work?
This starter app uses embeddings to generate a vector representation of a document and then uses vector search to find the most similar documents to the query. The results of the vector search are then used to construct a prompt for GPT-3, which is then used to generate a response. The response is then streamed to the user.
Web pages are scraped, stripped to plain text, and split into 1000-character documents.
// Stripe text from HTML
// pages/api/generate-embeddings.ts
async function getDocuments(urls: string[]) {
const documents = [];
for (const url of urls) {
const response = await fetch(URL);
const html = await response.text();
const $ = cheerio.load(html);
// tag based e.g. <main>
const articleText = $("body").text();
// class based e.g. <div class="docs-content">
// const articleText = $(".docs-content").text();
let start = 0;
while (start < articleText.length) {
const end = start + docSize;
const chunk = articleText.slice(start, end);
documents.push({ url, body: chunk });
start = end;
}
}
return documents;
}
Once the URLs are stripped down to the text, they are sent to the Supabase after some embedding creation using the text-embedding-ada-002
model.
The OpenAI docs recommend using text-embedding-ada-002 for nearly all use cases. Fun fact, this is the same embedding Notion's AI tool uses under the hood. It's better, cheaper, and simpler to use.
text-embedding-ada-002 announcement
// Create embeddings from URLs
// pages/api/generate-embeddings.ts
const documents = await getDocuments(urls);
for (const {
url,
body
}
of documents) {
const input = body.replace(/\n/g, " ");
console.log("\nDocument length: \n", body.length);
console.log("\nURL: \n", url);
const embeddingResponse = await openAi.createEmbedding({
model: "text-embedding-ada-002",
input
});
console.log("\nembeddingResponse: \n", embeddingResponse);
const [{
embedding
}] = embeddingResponse.data.data;
// In production we should handle possible errors
await supabaseClient.from("documents").insert({
content: input,
embedding,
URL
});
}
gpt3.5-turbo-pgvector is an excellent starter for folks looking to try out OpenAI on their own data or sites. I see this being extremely useful in the documentation and now understand why OpenAI doesn't need to search in their docs (this is a joke, they should add search). Search in docs could be replaced by projects setting up their own embeddings.
Share in the comments if you have a use case for this.
Also, if you have a project leveraging OpenAI or similar, leave a link in the comments. I'd love to take a look and include it in my 9 days of OpenAI series.
Find more AI projects using OpenSauced
Stay saucy.
Top comments (4)
Very interesting, what I am curious is for how long have you been studying this? I mean, it is a good article, but for me to reproduce your result and to understand what I am doing (I have a background in CS) may take more than just a day, to also prepare a good article with instructions like this one it would take me even more.
Great write up. I've be playing with embeddings the last couple weeks and really like it. Vector search is so powerful. Exciting times we're in. 🤙
This is very coop. Thank you for posting. I'm going to play with it...
Great work thanks for posting