José Marrugo

Posted on Dec 18, 2023

Vector Search pt. 1 - Features 🪵

#machinelearning

I'd never asked myself how Google, Youtube, Pinterest, or Netflix perform their searches. I was just using those services to lookup for information, shows, or images, but without even consider the underlying technology behind them, and I really didn't care.

That was until I encountered the concept of Vector Search, a concept so deeply grained in ourselves that when you get to know it, you start looking at things in very different ways.

This concept could be used to explain on a high level how retrieval systems work, how Facebook matches friends together, and even how we understand the world around ourselves.

This article is part of a series that aims to bring light to such an interesting concept that we all know and use, but very few of us really understand.

And also, how companies or individuals could use it to offer a higher value in their services.

Let's get started!

What is a Feature?

Straight to the point, a feature is a characteristic that we use to describe something. The typical examples would be, that the apples (🍎) are red, a King (🤴) is a person, etc, but if we really want to go into a deeper level, we would need longer descriptions:

An Apple (🍎) is a fruit that could have three different colors, red, green, and yellow, is a bit sweet, and it is a great source of fiber and vitamins.
A King (🤴) is a person in position of power, who is usually the husband of a Queen. They live in castles and wear a crown.

If we want to add those descriptions to a table, for the first descriptions, we could have two columns, one for the name, and one for the description. To find something, we just do a column filter and search for the words person or red.

The problem with the simple descriptions is that they would only work in a world with few objects, and in real life we should consider descriptions like the second ones to really differentiate things. There are lots of red stuff, and there are hundreds of types of persons.

But, we can't put those long texts in a table and expect to filter them like that, we need to organize the information in the key words.

Apple: ["fruit","red","green","yellow", "sweet", "fiber", "vitamins"]
King: ["person", "power", "husband", "castle","crown"]

If we add more objects like bananas and Queen, we could do something similar:

Banana: ["fruit","yellow", "sweet", "fiber", "vitamins"]
Queen: ["person", "power", "wife", "castle","crown"]

As humans, we understand the concepts behind those words without even thinking about it, because we have encountered them in real life so many times that describing them becomes easy.

We know that Bananas, apples, watermelons, and strawberries are somewhat more related among them than with other concepts like King, which would be more related to Queen a lot more than it is to apples.

That knowledge is encoded in our brains. The name for that understanding is known as Semantics, the relation between a concept and its meaning.

Going back to the imaginary table, if you try to figure out how to save and filter features to finally obtain one single thing, you could get to a very important question:

If we want to add a lot of objects, people, landscapes, animals, and fruits, it wouldn't be possible to directly add that many descriptions in columns, so how could we do it?🤔

That's a real problem in Computer Science, and probably I won't answer it in the best possible way, but one solution could be: Simplifying.

Simplifying our understanding of the world using Features

If we compare the two descriptions given at the beginning of this article we would note that the first was oversimplified, but the second one was not.

In real life scenarios, we should come up with multiple options to summarize characteristics, for example:

name, color, gender.
cell_type, is_human, lives_in_castle.

Looking for some mid-point, we could use both for our data, organize the information in a table, and get something like:

word	color	gender	cell_type	is_human	lives_in_castle
Apple	red		plant	False	False
Banana	yellow		plant	False	False
King		male	animal	True	True
Queen		female	animal	True	True

This way if we want to perform a search, create groups, or keep adding more diverse things, the summary of features will scale up, but... How much?

Summarizing the features even more

First of all, if we want to really scale things up, we would need a little help describing things. Choosing a set of characteristics that could be enough to summarize any object would be a very demanding task, and depending on the items it probably will change over time.

One interesting solution to this problem is to select pivot items, and score the similarity of any new item to those pivot items.

Note: A pivot here is like a reference point (⛳️), it allows us to guide ourselves in a space that is not well defined.

Let's choose the apple and the King as pivot items, we could add more objects and then create a similarity table:

word	Apple	King
King	0.0	1.0
Queen	0.0	0.8
Apple	1.0	0.0
Banana	0.8	0.0
Dog	0.2	0.4
Woman	0.1	0.6
Watermelon	0.8	0.0
Strawberry	0.7	0.0

Note that the similarity here is subjective

We summarized 5 feature columns in just 2, and we achieved something that will benefit us later, assessing numerically the similarity to each of the pivots.

This way we could sort the items accordingly to the similarity to each of the pivots, for example, let's sort things up depending on the similarity to the concept of King:

word	Apple	King
King	0.0	1.0
Queen	0.0	0.8
Woman	0.1	0.6
Dog	0.2	0.4
Apple	1.0	0.0
Banana	0.6	0.0
Watermelon	0.8	0.0
Strawberry	0.7	0.0

You could start to see how easy it is to keep adding items, and sorting them depending on their subjective similarity to the pivots. Those pivots could be interpreted as dimensions in a 2 dimensional plane, that means that each word could have a magnitude (lenght) and a direction in that plane! Something with those characteristics is known as a Vector, for now we'll keep plotting them as points.

Now, we can use sorting algorithms, instead of just filtering. We are able to find things more easily in the data, and define a similarity threshold to exclude the things that aren't of interest to us.

But, If we want to add thousands of items with their respective vectors to our table, do we have to think and input these similarities one by one? 😓

This question would be answered in the next part!

We'll see how can we create these tables for any object, and how a computer could enhance our experience while doing it!

See you soon! 😎

DEV Community

Vector Search pt. 1 - Features 🪵

What is a Feature?

Simplifying our understanding of the world using Features

Summarizing the features even more

Top comments (0)

Read next

ToonCrafter: Generative Cartoon Interpolation

Generative Adversarial Network (GAN)

CVPR Pre-Show: Open3DSG: an Open-Vocabulary 3D Scene Graph Generation Method

Introducing Semantic Kernel