DEV Community

hoaitx
hoaitx

Posted on • Originally published at 2coffee.dev

Semantic Search Feature

Hi there, have you noticed that the autumn atmosphere has become clearer in Hanoi recently? The morning was cool, and the evening came with strong winds. But behind that was a busy week for me. I was focused on running the "deadline" for my company's project, and in the evening, I tried to complete the search function for my blog. This deadline was different from usual because it was the main feature for the year for the product. And as for the blog, the search function had to be completed sooner or later, and this was the perfect time to do it.

Before switching to Fresh, my blog already had a search function. The way I did it back then was to use Postgres's fulltext-search. For those who don't know, before using Postgres, I also used redisearch for search. Generally speaking, Postgres gave better results, while redisearch was more complex. But in reality, my blog data wasn't that massive, so redisearch didn't have a chance to shine.

When I switched to Fresh, AI was booming. Many people were talking about AI and what it could do. After completing the basic features and getting ready to work on the search function, I thought, "Why not try using AI?" So, I decided to "release" the new blog version without the search function.

To create a search function with AI, I had to spend a lot of time researching and experimenting. I learned about how to implement it, how to use LLMs models, embeddings models, vector data types, how to convert data to vectors, and how to query...

To put it simply, a vector is a finite set of numbers, like in mathematics. The number of elements in the set determines the size (dimension) of the vector. The larger the size, the more the vector can generalize the data it represents. To convert regular data (text, speech, images, etc.) to vectors, there are many ways, but thanks to the popularity of LLMs today, you can just put the data into an embeddings model, and it will give you vector data.

Semantic search (semantic search) is different from traditional fulltext keyword search. Fulltext search is based on the amount of text characters entered to match and return the most relevant results. Meanwhile, semantic search is based on the content. Suppose your article is explaining how node.js works. When searching for the phrase "how node.js works", semantic search can find the article. On the other hand, fulltext search will try to find articles containing the words "node.js", "works", ...

To query vector data, you need at least two steps. First, convert the query into a vector, then use the query functions. For example, with pg-vector - a Postgres extension that supports vectors - there are query functions like:

pg-vector search

You can see L2 distance, Cosine distance, L1 distance... as vector comparison methods. Depending on the use case, you choose the query type accordingly. For example, in the search problem, I chose the Cosine distance method - that is, the two vectors should have a similar shape.

How to do it

Flow

First, choose a suitable database. I'm using Turso as my main database. However, Turso is based on SQLite, which isn't optimized for vector data. Although they introduced an extension to support vectors, it's a bit complicated.

pg-vector is the opposite. It's widely used and is a Postgres extension. When it comes to Postgres, I think of Supabase, which offers free usage. Supabase has pg-vector integrated, and activation is just a click away, making it a great choice.

Next is choosing models. To save costs, I've been looking for free models from the start. I couldn't help but mention groq with its Completions API. However, groq doesn't have embeddings models, so I had to find another one.

nomic-embed-text is an embeddings model I found in Ollama's library. It can vectorize text. Additionally, Nomic provides a free embeddings API with limitations. However, I should remind you that Nomic isn't a multilingual model. It supports Vietnamese to a limited extent, so the generated vector might not be optimal for Vietnamese semantics.

After preparing everything, it's time to write code to add vector data and search logic.

First, convert the article content into a vector and store it in Supabase. Instead of converting the entire article content, I summarize the main content of the article before feeding it into nomic-embed-text. This helps remove unnecessary information and reduce the input token count for the model to process.

Another note is that although these models have free APIs, they always come with limitations. Processing data for the first time is very expensive, as I have over 400 articles in both Vietnamese and English. A better approach is to run the Llama 3.2 3B and nomic-embed-text models locally. I use LM Studio for this.

The search logic is simple. Take the user's query -> pass it through nomic-embed-text to convert to a vector -> query cosin with the article vector and sort by the closest distance between the two vectors.

However, if the user searches for keywords like node.js, javascript, etc., it's likely that semantic search won't return results because the data is too short, and the generated vector doesn't contain enough meaning, making the cosine distance too large. Therefore, to handle this case, I need to maintain a fulltext search mechanism. Fortunately, Supabase supports this type of search.

Challenges

Looking back, it seems simple, but the most challenging part for me was the data preprocessing steps.

An article usually conveys multiple ideas, including main and secondary content. Typically, searchers are only interested in the main content of the article, and they tend to search for related things. If I convert the entire article content into a vector, it will be "diluted" or "noisy" because the vector size is limited. I think that if I can remove secondary information and emphasize the main idea, the search will be more accurate. Imagine an article with 1500 words converted into a 1024-dimensional vector, compared to an article with only the main content of 500 words in the same vector. Which one represents the data more "clearly"?

Users' search patterns are also hard to predict because everyone searches differently. Some people like to keep it short, while others like to write longer or provide context for their questions... Therefore, processing user input data is also a challenge. How can I convert it into a concise and relevant query that matches the search content on the blog?

The quality of the AI model used is also an issue. Generally, the more trained a model is, the better it is, and commercial models come with quality assurance. However, to minimize costs, I'm currently using free LLMs models with limitations. Hopefully, one day I'll be able to integrate more powerful models to improve the search quality for my blog.

Top comments (0)