<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: George Hoqqanen</title>
    <description>The latest articles on DEV Community by George Hoqqanen (@hoqqanen).</description>
    <link>https://dev.to/hoqqanen</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1064672%2F0cb82751-4b14-4e34-8732-c9eca8899689.png</url>
      <title>DEV Community: George Hoqqanen</title>
      <link>https://dev.to/hoqqanen</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/hoqqanen"/>
    <language>en</language>
    <item>
      <title>10 minute search engine</title>
      <dc:creator>George Hoqqanen</dc:creator>
      <pubDate>Thu, 13 Apr 2023 20:54:39 +0000</pubDate>
      <link>https://dev.to/hoqqanen/10-minute-search-engine-5cjp</link>
      <guid>https://dev.to/hoqqanen/10-minute-search-engine-5cjp</guid>
      <description>&lt;p&gt;The no-frills guide for how to make a search engine for a small collection of text documents in 10 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;I have 130 text documents of writing which I want to be able to search. Here are the steps I will take:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Compute some numbers for each document&lt;/li&gt;
&lt;li&gt;Compute some numbers for the search query&lt;/li&gt;
&lt;li&gt;See which document's numbers are closest to the query&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Embeddings
&lt;/h2&gt;

&lt;p&gt;The numbers are called "embeddings". I compute them via OpenAI's embedding service, using the Ada-002 model. Here's how:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import openai
openai.api_key = "your key here"

def embed(s):
  response = openai.Embedding.create(
      input=s,
      model="text-embedding-ada-002"
  )
  return response['data'][0]['embedding']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code was pasted verbatim from the OpenAI documentation.&lt;/p&gt;

&lt;p&gt;To do it for every document, let's have Github copilot write it for us:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def save_embeddings(folder):
  """Iterates over the txt files in the folder, embeds them, and saves the result"""
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Copilot gives us the completion&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def save_embeddings(folder):
  """Iterates over the txt files in the folder, embeds them, and saves the result"""
  from os import listdir
  import pickle
  import time
  embeddings = {}
  for filename in listdir(folder):
    time.sleep(.1) # I added this, otherwise the OpenAI API will block requests
    if filename.endswith(".txt"):
      with open(folder + "/" + filename, 'r') as f:
        text = f.read()
        embedding = embed(text)
        embeddings[filename] = embedding
  with open('embeddings.pickle', 'wb') as f:
    pickle.dump(embeddings, f)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Search
&lt;/h2&gt;

&lt;p&gt;We need to do three things: get the embeddings we saved, embed our search text, and find the closest document. Let's see if Copilot can do it.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;def load_embeddings&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Copilot gives us&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def load_embeddings():
  import pickle
  with open('embeddings.pickle', 'rb') as f:
    embeddings = pickle.load(f)
  return embeddings
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, let's find the match.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;def closest_to&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Copilot completes with&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def closest_to(embeddings, embedding):
  import numpy as np
  from scipy.spatial.distance import cosine
  return min(embeddings.keys(), key=lambda k: cosine(embeddings[k], embedding))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally we need to be able to search with whatever text we want.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;def closest_to_text(text):&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Copilot once again fills in the blanks&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def closest_to_text(text):
  embeddings = load_embeddings()
  embedding = embed(text)
  return closest_to(embeddings, embedding)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;That's all there is to it. First I run save_embeddings on whichever folder contains our text documents. Then I call closest_to_text with whatever the search query is.&lt;/p&gt;

&lt;p&gt;Try it out on &lt;a href="https://www.splitbound.com/search/"&gt;https://www.splitbound.com/search/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Notes
&lt;/h2&gt;

&lt;p&gt;Copilot brought in the numpy and scipy libraries and I used the openai library, all of which need to be installed with pip.&lt;/p&gt;

&lt;p&gt;It's also possible to find the worst match, available when using the search on the website above. Alternatively you can modify the code to return the top 5 matches.&lt;/p&gt;

&lt;p&gt;Copilot used the cosine distance, which isn't necessary since the OpenAI embeddings are all normalized (vectors with magnitude 1). A dot product would suffice.&lt;/p&gt;

&lt;p&gt;It's faster to keep the embeddings in memory rather than loading them every time the function is called.&lt;/p&gt;

&lt;p&gt;The web version above additionally requires wrapping the above search in a little Flask app and returning the relevant file's text contents as a response. The frontend is pretty short:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;style&amp;gt;
#essay p {
    margin: 20px 0;
}
&amp;lt;/style&amp;gt;
&amp;lt;script&amp;gt;
function search(inverse) {
    const essay = document.getElementById("essay");
    essay.textContent = "";
    const query = document.getElementById("query").value;
    console.log("Searching", query);
fetch("your url here", {
    method: "POST",
    mode: "cors",
    cache: "no-cache",
    credentials: "same-origin", // include, *same-origin, omit
    headers: {
      "Content-Type": "application/json",
    },
    redirect: "follow",
    referrerPolicy: "no-referrer",
    body: JSON.stringify({text: query, inverse: inverse}),
  }).then(r =&amp;gt; r.text()).then(d =&amp;gt; {
    d.split('\n').filter(s =&amp;gt; !!s).forEach((s,i) =&amp;gt; {
        const c = document.createElement(i===0 ? "h2" : "p");
        const node = document.createTextNode(s);
        c.appendChild(node);
        essay.appendChild(c);
    });
});
 }
&amp;lt;/script&amp;gt;
&amp;lt;textarea id="query"&amp;gt;&amp;lt;/textarea&amp;gt;
&amp;lt;div style="display:flex;flex-direction:row; margin-top:30px;"&amp;gt;
&amp;lt;button onclick="search(false);" class="button"&amp;gt;Most Similar&amp;lt;/button&amp;gt;
&amp;lt;button onclick="search(true);" class="button"&amp;gt;Least Similar&amp;lt;/button&amp;gt;
&amp;lt;/div&amp;gt;
&amp;lt;div id="essay"&amp;gt;&amp;lt;/div&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
    </item>
  </channel>
</rss>
