The no-frills guide for how to make a search engine for a small collection of text documents in 10 minutes.
Overview
I have 130 text documents of writing which I want to be able to search. Here are the steps I will take:
- Compute some numbers for each document
- Compute some numbers for the search query
- See which document's numbers are closest to the query
Embeddings
The numbers are called "embeddings". I compute them via OpenAI's embedding service, using the Ada-002 model. Here's how:
import openai
openai.api_key = "your key here"
def embed(s):
response = openai.Embedding.create(
input=s,
model="text-embedding-ada-002"
)
return response['data'][0]['embedding']
This code was pasted verbatim from the OpenAI documentation.
To do it for every document, let's have Github copilot write it for us:
def save_embeddings(folder):
"""Iterates over the txt files in the folder, embeds them, and saves the result"""
Copilot gives us the completion
def save_embeddings(folder):
"""Iterates over the txt files in the folder, embeds them, and saves the result"""
from os import listdir
import pickle
import time
embeddings = {}
for filename in listdir(folder):
time.sleep(.1) # I added this, otherwise the OpenAI API will block requests
if filename.endswith(".txt"):
with open(folder + "/" + filename, 'r') as f:
text = f.read()
embedding = embed(text)
embeddings[filename] = embedding
with open('embeddings.pickle', 'wb') as f:
pickle.dump(embeddings, f)
Search
We need to do three things: get the embeddings we saved, embed our search text, and find the closest document. Let's see if Copilot can do it.
def load_embeddings
Copilot gives us
def load_embeddings():
import pickle
with open('embeddings.pickle', 'rb') as f:
embeddings = pickle.load(f)
return embeddings
Next, let's find the match.
def closest_to
Copilot completes with
def closest_to(embeddings, embedding):
import numpy as np
from scipy.spatial.distance import cosine
return min(embeddings.keys(), key=lambda k: cosine(embeddings[k], embedding))
Finally we need to be able to search with whatever text we want.
def closest_to_text(text):
Copilot once again fills in the blanks
def closest_to_text(text):
embeddings = load_embeddings()
embedding = embed(text)
return closest_to(embeddings, embedding)
Conclusion
That's all there is to it. First I run save_embeddings on whichever folder contains our text documents. Then I call closest_to_text with whatever the search query is.
Try it out on https://www.splitbound.com/search/
Notes
Copilot brought in the numpy and scipy libraries and I used the openai library, all of which need to be installed with pip.
It's also possible to find the worst match, available when using the search on the website above. Alternatively you can modify the code to return the top 5 matches.
Copilot used the cosine distance, which isn't necessary since the OpenAI embeddings are all normalized (vectors with magnitude 1). A dot product would suffice.
It's faster to keep the embeddings in memory rather than loading them every time the function is called.
The web version above additionally requires wrapping the above search in a little Flask app and returning the relevant file's text contents as a response. The frontend is pretty short:
<style>
#essay p {
margin: 20px 0;
}
</style>
<script>
function search(inverse) {
const essay = document.getElementById("essay");
essay.textContent = "";
const query = document.getElementById("query").value;
console.log("Searching", query);
fetch("your url here", {
method: "POST",
mode: "cors",
cache: "no-cache",
credentials: "same-origin", // include, *same-origin, omit
headers: {
"Content-Type": "application/json",
},
redirect: "follow",
referrerPolicy: "no-referrer",
body: JSON.stringify({text: query, inverse: inverse}),
}).then(r => r.text()).then(d => {
d.split('\n').filter(s => !!s).forEach((s,i) => {
const c = document.createElement(i===0 ? "h2" : "p");
const node = document.createTextNode(s);
c.appendChild(node);
essay.appendChild(c);
});
});
}
</script>
<textarea id="query"></textarea>
<div style="display:flex;flex-direction:row; margin-top:30px;">
<button onclick="search(false);" class="button">Most Similar</button>
<button onclick="search(true);" class="button">Least Similar</button>
</div>
<div id="essay"></div>
Top comments (0)