Sawaedo

Posted on May 21, 2024 • Edited on Oct 15, 2024

Vector Search pt. 2 - Vision Algorithms 🎑

#machinelearning #computervision

If we want to add thousands of items with their respective vectors to a table, do we have to think and input these similarities one by one? 😓

There are some algorithms that convert things from the real world, to vectors in a multidimensional space, for example, if i pass the image of a King, the algorithm will convert that image to a point in a space with numeric coordinates. These types of algorithms are called Feature Extractors, and could extract the vectors from Images, Texts, Voices, and more.

The algorithms that could be used to understand the world could be Vision Algorithms, text Algorithms, sound algorithms, and Multimodal Algorithms.

Here we'll review the concept of Vision Algorithms, and hopefully in other articles we'll review the other types, also there will be some references to algorithms that you can explore and use for your own projects.

To make things a bit more intuitive we'll try to explain how ourselves as humans understand the world based on features.

How we get the features of the world? 🌍

The world is a place full of sounds, colors, textures, flavors, as other sensations, and we have multiple ways of sensing it, we use our eyes, ears, tongue, skin, nose, and brain.

Our senses collect the information about the world, our brain interpret that information by doing a mental map and matching that map with previous learned maps, these maps are wired in us, in our neurons.

For example, each time we hear a dog barking, we match that sound to the concept of dog, and we could visualize the dog, or even remember how a dog smells.

But the most interesting part is that those mental maps could be shared among things that we don't fully know, giving us the ability to generalize to a lot of things.

For example, if we see a picture of a Wolf or hear it howling, we can undestand that those signals come from a wolf, even if we just watched them from the TV.

Since Wolfs and Dogs look similar to us but are not the same, we could infer that the neural circuit of both concepts is very similar, but due to a small difference in the neuron activations, our brains know that they are not fully the same things.

This special feature of our brains allows us to infer that wolfs could bite, run very fast, and like meat, just like dogs.

So given this ability of us, one could ask the question:
Could this process be replicated by a computer?

If it is possible, we could make a computer understand that a dog and a wolf are similar concepts, just by using words, sounds, or images!

Vision Algorithms 📷

To make computers understand Image Data, we came up with this fascinate set of algorithms called Vision Algorithms. Classically they were used to detect borders, corners, letters, color histograms, etc, but these days they are used to a broad range of applications thanks to DeepLearning.

The multiple types of vision algorithms usually work with a Camera as the Sensor, and the algorithm will look for shapes, colors, or textures, patterns in general to understand the captured image.

The camera would be like the Eyes of the system, and the algorithm would be like the brain, it will match the sensed signal with a previous pattern or condition and will return something.

The modern image algorithms for feature extraction usually are based on deeplearning, they use something called convolutions to extract important landmarks of the images sequentially and finally compress the most important landmarks into a feature vector, but overall they could be oversimplified by something like:

These modern algorithms could be trained in two common ways:

Supervised.
Semi-supervised/Self-supervised.

Supervised

The supervised method to train feature-extractors consist of training a Convolutional model to classify an image into one of several labels.

One common set of labels are the ones of the ImageNet Dataset, which contains 1000 object classes. After training the algorithm with the imageNet data, we get the extractor part from the full model.

Then, the Feature Extractor Part could be used to get the Embeddings in the latent space.

This is the method used for the MobileNets, EfficientNets, and ALexNet Models.

Semi-supervised/Self-supervised

These methods consist on using an untrained model with an error function (measurement of correctness) based on Contrastive Loss.

The Contrastive Loss is used to measure how similar 2 vectors are in latent space(vector space), it is useful for checking the similarity between any kind of vector, even if they were extracted using different approaches.

Here we'll explain how to use this method using only images.

Basically the approach using only images is called SimCLR and consist on doing some kind of perturbation on a given Image, then, compute its embedding, and finally, check with the contrastive loss how similar is the embedding of the perturbed image to the embedding of the original unperturbed image.

The magnitude of the difference will be used to update the Feature-extractor model. Over time the model will be trained, using just multiple images and their perturbed twins, it is very useful when we don't have lots of labeled data.

Other contrastive Learning techniques could be used with embeddings from images, text, audio, etc. It is the key component behind multimodal models, and probably we'll review it in a future article.

Image-based Latent Space

The latent space is the space where images are represented by their vectors, after extracting the features of a given set of images we'll get a latent space.

Remember the previus article how there was a space of fruits and a royal family?

That space could be obtained using a lot of images of those concepts and extracting their Feature Vectors.

If we have a well trained algorithm with also some animals, and a deep understanding of them, we could use that Vision Algorithm to get the vector in the latent space of a little Monkey, and we'll obtain something like:

That way the Monkey concept obtained from passing a monkey image into a Vision Model would be somewhat close to a banana, and also close to the King and Queen concepts, since we are all mammals.

Isn't it awesome?

Businesses use this algorithms all the time, for extracting products embeddings, like clocks, toys, shoes, etc. Used in conjunction with retrieval systems one could find things that are similar in the feature space using just images!

Closing

We've seen that we can represent concepts in a N-dimensional space using only images, and the algorithms mimics in some way what the brain does.

I hope you've learned something new from this article, and I'll be glad if you send me your questions, correct me, or simply comment what do you think of it! 💯

Pd
The Next post could be an overview about multimodal Feature-Extractors, or the main concepts behind vector Retrieval Systems. I haven't decided yet.😅

And since I'm kind of slow to write these kind of articles, I could be posting some non-technical ones in the meantime.

So, take care, and keep learning!