10 minute search engine

George Hoqqanen — Thu, 13 Apr 2023 20:54:39 +0000

The no-frills guide for how to make a search engine for a small collection of text documents in 10 minutes.

Overview

I have 130 text documents of writing which I want to be able to search. Here are the steps I will take:

Compute some numbers for each document
Compute some numbers for the search query
See which document's numbers are closest to the query

Embeddings

The numbers are called "embeddings". I compute them via OpenAI's embedding service, using the Ada-002 model. Here's how:

import openai
openai.api_key = "your key here"

def embed(s):
  response = openai.Embedding.create(
      input=s,
      model="text-embedding-ada-002"
  )
  return response['data'][0]['embedding']

This code was pasted verbatim from the OpenAI documentation.

To do it for every document, let's have Github copilot write it for us:

def save_embeddings(folder):
  """Iterates over the txt files in the folder, embeds them, and saves the result"""

Copilot gives us the completion

def save_embeddings(folder):
  """Iterates over the txt files in the folder, embeds them, and saves the result"""
  from os import listdir
  import pickle
  import time
  embeddings = {}
  for filename in listdir(folder):
    time.sleep(.1) # I added this, otherwise the OpenAI API will block requests
    if filename.endswith(".txt"):
      with open(folder + "/" + filename, 'r') as f:
        text = f.read()
        embedding = embed(text)
        embeddings[filename] = embedding
  with open('embeddings.pickle', 'wb') as f:
    pickle.dump(embeddings, f)

Search

We need to do three things: get the embeddings we saved, embed our search text, and find the closest document. Let's see if Copilot can do it.

def load_embeddings

Copilot gives us

def load_embeddings():
  import pickle
  with open('embeddings.pickle', 'rb') as f:
    embeddings = pickle.load(f)
  return embeddings

Next, let's find the match.

def closest_to

Copilot completes with

def closest_to(embeddings, embedding):
  import numpy as np
  from scipy.spatial.distance import cosine
  return min(embeddings.keys(), key=lambda k: cosine(embeddings[k], embedding))

Finally we need to be able to search with whatever text we want.

def closest_to_text(text):

Copilot once again fills in the blanks

def closest_to_text(text):
  embeddings = load_embeddings()
  embedding = embed(text)
  return closest_to(embeddings, embedding)

Conclusion

That's all there is to it. First I run save_embeddings on whichever folder contains our text documents. Then I call closest_to_text with whatever the search query is.

Try it out on https://www.splitbound.com/search/

Notes

Copilot brought in the numpy and scipy libraries and I used the openai library, all of which need to be installed with pip.

It's also possible to find the worst match, available when using the search on the website above. Alternatively you can modify the code to return the top 5 matches.

Copilot used the cosine distance, which isn't necessary since the OpenAI embeddings are all normalized (vectors with magnitude 1). A dot product would suffice.

It's faster to keep the embeddings in memory rather than loading them every time the function is called.

The web version above additionally requires wrapping the above search in a little Flask app and returning the relevant file's text contents as a response. The frontend is pretty short: