DEV Community

Omar Shehata
Omar Shehata

Posted on • Updated on

Open source semantic embedding, search & clustering in NodeJS

This tutorial walks you through how to do semantic embedding completely offline with open source models in NodeJS. No knowledge of AI/ML is required. Source code: https://github.com/OmarShehata/minimal-embedding-template

An "embedding" is a high dimensional vector (x,y,z,...) that represents a concept. Think of it as the internal representation of words in an LLM. You can compute distances between these vectors.

Example: "Man" is much closer to "boy" & "woman", compared to "chicken". "Coffee" and "wifi" are somewhat close, and are both close to "coffee shop".

illustration of what these vectors might look like projected down to 2D

illustration of what these vectors might look like projected down to 2D

I think this is an extremely underutilized feature of modern LLM's, and it's much cheaper compute wise compared to text generation. Most of the time I don't really want the LLM to generate text as much as I want to see & manipulate the semantic concepts like this.

Setup

Clone the repo: https://github.com/OmarShehata/semantic-embedding-template. This contains a minimal NodeJS template that you can copy/paste and build on.

It uses (1) gpt4all as the LLM engine. This is where the open source model comes from, and is what converts a word/string/document into a vector. (2) Vectra as a local, single file vector database. Allows us to index & search vectors.

Run pnpm install, then run the first example in example-simple-embedding/index.js:

pnpm simple-embedding
Enter fullscreen mode Exit fullscreen mode

This takes an array of strings and converts them to vectors:

await embeddings.insertText(['coffee shop', 'wifi', ...])
Enter fullscreen mode Exit fullscreen mode

You can print the vectors with embeddings.getTextMap(). You can do a search as shown below. This returns a sorted list of the closest vectors in the DB, along with the cosine distance.

const results = await embeddings.search('coffee')
// returns:
// [
//   [ 'coffee shop', 0.8214959697396015 ],
//   [ 'wifi', 0.711907901740376 ],
//   [ 'hard work', 0.6709908415581982 ],
//   [ 'love peace & joy, relaxation', 0.6495931802131457 ]
// ]
Enter fullscreen mode Exit fullscreen mode

(1 means it's exactly the same vector, -1 means it's exactly opposite, 0 means no correlation)

embeddings is a thin wrapper around gpt4all

lib/embeddings.js implements insertText which:

  1. checks to see if these words are already in the DB
  2. inserts them if they are not, with a batch update

The search function takes the query string and converts it to a vector, then runs a query with the vectra DB.

The specific model I'm using here is nomic-embed-text-v1.5 which is an open source model & free to use model that runs locally on your machine.

OpenAI embeddings

lib/embeddings-openai.js is a version of this file that has exactly the same API but sends the text to OpenAI. See OpenAI's embedding docs.

The OpenAI model captures more nuance in my experience (for example, it captures the semantic meaning of emojis whereas the open source one doesn't seem to).

Set the OPEN_API_KEY environment variable to use this. To run the example in example-openai-embedding/index.js:

pnpm openai-embedding
Enter fullscreen mode Exit fullscreen mode

Clustering

To run example-clustering/index.js:

pnpm clustering
Enter fullscreen mode Exit fullscreen mode

This clusters the vectors in the DB using k-means. You tell it the number of clusters you want to create, and it iterates over each point to find the "k nearest neighbors" to create these clusters.

Normally, you don't know how many clusters are in the data. There's various techniques to find this. One way is the "elbow method" where you cluster the dataset for increasingly higher cluster sizes and compute a "score". The score represents how close all items in the cluster are to a centroid. So the lower the score the more you end up with clusters of semantically related things.


I hope you found this useful! You can use the base code here to basically recreate Neal's Infinite Craft game. Basically put all the words in the dictionary in the vector database. Then to combine two words, add the vectors (or get the average?), then search for the closest thing to that combined vector.

This is my personal sandbox that I hope to add more stuff to. For example, there are models that can convert an image to a text description. You can then get a vector embedding for that text, and with that you can build an app where you can "CTRL+F" for your images (again, all offline, and free!)

Top comments (3)

Collapse
 
martinbaun profile image
Martin Baun

Great guideline! Thanks for sharing

Collapse
 
omar4ur profile image
Omar Shehata

thanks Martin!! I wrote this partially because I kept finding dozens of tutorials on this but they're all "ads" (like telling me to use this or that service). And I just wanted to know how to do it in a super simple nodeJS script!!

Collapse
 
mitch1009 profile image
Mitch Chimwemwe Chanza

Awesome post , keep it up.