Open source semantic embedding, search & clustering in NodeJS

#node #ai #vectordatabase #javascript

This tutorial walks you through how to do semantic embedding completely offline with open source models in NodeJS. No knowledge of AI/ML is required. Source code: https://github.com/OmarShehata/minimal-embedding-template

An "embedding" is a high dimensional vector (x,y,z,...) that represents a concept. Think of it as the internal representation of words in an LLM. You can compute distances between these vectors.

Example: "Man" is much closer to "boy" & "woman", compared to "chicken". "Coffee" and "wifi" are somewhat close, and are both close to "coffee shop".

illustration of what these vectors might look like projected down to 2D

I think this is an extremely underutilized feature of modern LLM's, and it's much cheaper compute wise compared to text generation. Most of the time I don't really want the LLM to generate text as much as I want to see & manipulate the semantic concepts like this.

Setup

Clone the repo: https://github.com/OmarShehata/semantic-embedding-template. This contains a minimal NodeJS template that you can copy/paste and build on.

It uses (1) gpt4all as the LLM engine. This is where the open source model comes from, and is what converts a word/string/document into a vector. (2) Vectra as a local, single file vector database. Allows us to index & search vectors.

Run pnpm install, then run the first example in example-simple-embedding/index.js:

pnpm simple-embedding

This takes an array of strings and converts them to vectors:

await embeddings.insertText(['coffee shop', 'wifi', ...])

You can print the vectors with embeddings.getTextMap(). You can do a search as shown below. This returns a sorted list of the closest vectors in the DB, along with the cosine distance.

const results = await embeddings.search('coffee')
// returns:
// [
//   [ 'coffee shop', 0.8214959697396015 ],
//   [ 'wifi', 0.711907901740376 ],
//   [ 'hard work', 0.6709908415581982 ],
//   [ 'love peace & joy, relaxation', 0.6495931802131457 ]
// ]

(1 means it's exactly the same vector, -1 means it's exactly opposite, 0 means no correlation)

`embeddings` is a thin wrapper around gpt4all

lib/embeddings.js implements insertText which:

checks to see if these words are already in the DB
inserts them if they are not, with a batch update

The search function takes the query string and converts it to a vector, then runs a query with the vectra DB.

The specific model I'm using here is nomic-embed-text-v1.5 which is an open source model & free to use model that runs locally on your machine.

OpenAI embeddings

lib/embeddings-openai.js is a version of this file that has exactly the same API but sends the text to OpenAI. See OpenAI's embedding docs.

The OpenAI model captures more nuance in my experience (for example, it captures the semantic meaning of emojis whereas the open source one doesn't seem to).

Set the OPEN_API_KEY environment variable to use this. To run the example in example-openai-embedding/index.js:

pnpm openai-embedding

Clustering

To run example-clustering/index.js:

pnpm clustering

This clusters the vectors in the DB using k-means. You tell it the number of clusters you want to create, and it iterates over each point to find the "k nearest neighbors" to create these clusters.

Normally, you don't know how many clusters are in the data. There's various techniques to find this. One way is the "elbow method" where you cluster the dataset for increasingly higher cluster sizes and compute a "score". The score represents how close all items in the cluster are to a centroid. So the lower the score the more you end up with clusters of semantically related things.

I hope you found this useful! You can use the base code here to basically recreate Neal's Infinite Craft game. Basically put all the words in the dictionary in the vector database. Then to combine two words, add the vectors (or get the average?), then search for the closest thing to that combined vector.

This is my personal sandbox that I hope to add more stuff to. For example, there are models that can convert an image to a text description. You can then get a vector embedding for that text, and with that you can build an app where you can "CTRL+F" for your images (again, all offline, and free!)

Quadratic AI – The Spreadsheet with AI, Code, and Connections

AI-Powered Insights: Ask questions in plain English and get instant visualizations
Multi-Language Support: Seamlessly switch between Python, SQL, and JavaScript in one workspace
Zero Setup Required: Connect to databases or drag-and-drop files straight from your browser
Live Collaboration: Work together in real-time, no matter where your team is located
Beyond Formulas: Tackle complex analysis that traditional spreadsheets can't handle

Get started for free.

Watch The Demo 📊✨

Top comments (3)

Martin Baun • Sep 26 '24

Great guideline! Thanks for sharing

Omar Shehata • Sep 26 '24

thanks Martin!! I wrote this partially because I kept finding dozens of tutorials on this but they're all "ads" (like telling me to use this or that service). And I just wanted to know how to do it in a super simple nodeJS script!!