DEV Community: skeltsyboiii

Image search with localization and open-vocabulary reranking using Marqo, yolox, CLIP and OWL-ViT

skeltsyboiii — Fri, 16 Dec 2022 05:53:01 +0000

TL;DR: Here we show how image search can be evolved to add localization and re-ranking by leveraging Marqo, yolox, CLIP and OWL-ViT. Adding the extra dimension of localization can improve retrieval performance and enable new use cases for image search while also helping with explainability. Re-ranking with an open vocabulary detection model allows for even finer-grained localsiation. The first part of the article covers background information while the second part contains working code (also found here).

Introduction

Image search has come a long way. Originally if you wanted to search a collection of images you would use keyword based search across manually curated meta-data. Vector representations followed and provided an avenue for more direct ways to query the image content. This has developed even more with the advent of cross-modal models like CLIP that allow searching images with natural language. Here we show it can be evolved further using a modern search stack that adds localization and the ability to re-rank.

Image Search

Popular modern image search (or image retrieval) is often based on embedding images into a latent space (e.g. by transforming images into vectors and tensors). A query is embedded into the same space while search results are found by finding the closest matching embedding and returning their corresponding images.

This single-stage retrieval based on embeddings is the same one that has been popular in many natural language processing applications like information retrieval, question and answering and chatbots. In many of these applications the matching documents are not just presented as part of the results but the part of the text that is the best match is also highlighted. This highlighting is what we can bring to image search via localization.

Image Search + Localisation

There are a number of ways to get localization in image search. There is a strong latency:relevancy trade off as more sophisticated methods take longer to process. However, there are two broad categories of localization - (1) heuristic - where a heuristic is used to obtain localization and, (2) model - where another ML model is used to provide the localization. The localization can also happen at the time of indexing ('index-time partitioning') or after an initial set of search results have been returned ("search-time localization"). The latter is akin to a second stage re-ranker from traditional two stage retrieval systems.

Index-time Partitioning

Here we will explain index-time partitioning for localization. In the indexing step, the image is partitioned into patches. Each of these patches and the original image are embedded and then stored in an index. This has the advantage that the time penalty is (mostly) paid off when indexing instead of searching.
In the retrieval step the query is embedded and compared not just to the original image but to all the patches as well. This now allows the location of the sub-image to also be returned.

Variations of this approach can also be used to do 'augment-time-indexing'. Instead of the image being broken into sub-patches it is augmented any number of times using any number of operations. Each of these augmented images are then stored in the same way the sub-patches were.

Heuristic partitioning methods

As explained above, one of the simplest ways to get localisation in image search is to partition the image into patches. This is done using a rule or heuristic to crop the image into other sub-images and store the embeddings for each of these patches. The simplest scheme for images is to split the image into an N x M equally sized patches and embed those. More sophisticated methods can be performed by other machine learning models like object detectors.

S# Model based partitioning methods

For the model based approaches, ideally we want "important" or relevant parts of the image to be detected by a model and proposed as the sub-images. Different use cases will have different requirements but surprisingly we can do some pretty good stuff with pretty generic models. To achieve this we can exploit some properties of object detectors and attention (i.e. transformers) based models.

For example, in two-stage detectors like Faster-RCNN the first stage consists of a region-proposal netwrok (RPN). The RPN is used to propose regions of the image that contains objects of interest and happens before any fine-grained classification occurs. The RPN is trainable and can be used to propose possible interesting parts of an image. Alternatives are to simply use a fast lighter-weight detector like yolo and make the output boxes the proposed regions (ignoring class) or use "objectness" scores to rank the proposed (now class agnostic) boxes. Finally, other alternatives exist like models that output "saliency" maps which can be obtained from supervised learning or through self-supervised methods like DINO. DINO has the added benefit that since it is self-supervised, it makes fine-tuning on custom datasets simple and a way to provide domain specific localisation.

Re-ranking

An alternative approach to single-stage retrieval is two-stage retrieval. Two-stage retrieval consists of an initial retrieval of documents which are then reranked by a localization model or heuristic. The first stage is where the initial candidate documents are retrieved and the second stage can re-rank (i.e. re-order) the results based on another model or heuristic.

One of the reasons for this type of architecture is to strike a balance between speed and relevancy. For example, the first stage can trade off speed and relevancy to provide a fast initial selection while the re-ranker can then provide a better final ordering by using a different model. The re-ranker can be used to add additional diversity or context (e.g. personalisation) to the results ranking or to add other things like localization (for images or videos). The diagram above has an illustrated example of this - the first stage retrieval of images comes from dense embeddings (e.g. from CLIP) while the second-stage re-ranker re-orders them based on a second (different) model.

Search-time localization as re-ranking

As we saw earlier, localization can be introduced by dividing the images into patches at index-time and then searching across the image and child patches. An alternative approach is to defer the localization to the second stage via a reranker.

There are multiple ways to do this, for example, you could do what was done at indexing time and divide each image in the retrieved results. However, doing that on its own ignores the crucial thing that we have now - the query. If we blindly divide the images and try and then match the query to the patches the additional information from the query is not used as effectively as it could. Instead, the proposal mechanism can be conditioned on the query. From this the results can then be re-ordered, for example by using the score that comes from the proposed regions.

Conditioning the proposals based on the query has its roots in tasks like visual question and answering. The way this differs from other object detection problems is that the output is no longer restricted to a fixed vocabulary of objects but can take free form queries ('open vocabulary'). One good candidate model for this is OWL-ViT (Vision Transformer for Open-World Localization). OWL-ViT is a zero-shot text-conditioned object detection model. OWL-ViT uses CLIP as its backbone, while a vision transformer and a causal language model are used for the visual and text features respectively. Open-vocabulary classification is enabled by replacing the classification output with the class-name embeddings obtained from the text model.

Putting it all together

In the previous section it was explained how image search works in general and how localization can be incorporated at both index and search time. In this section a full example with working code will be used to demonstrate each of these things in practice.

Image dataset

For this example we are using about 10,000 images of various everyday objects. Here are some example images:

We are going to index this dataset using a couple of different methods and then search with and without the localization based reranker.

Starting Marqo

We will be using Marqo to do the image search with localization that was explained previously (full code is also here). To start Marqo run the following from your terminal (assuming a cuda compatible GPU):

docker run --name marqo -it --privileged -p 8882:8882 --gpus all --add-host host.docker.internal:host-gateway -e MARQO_MODELS_TO_PRELOAD='[]' marqoai/marqo:0.0.10

If no GPU is available remove the --gpus all flag from the above command.

Preparing the documents

We can either use the s3 urls directly or you can download the images and use them locally (see here for details). For now we will use the urls directly and create the documents for indexing.

import pandas as pd
import os

df = pd.read_csv('files.csv')
documents = [{"image_location":s3_uri, '_id':os.path.basename(s3_uri)} for s3_uri in df['s3_uri']]

Indexing with localization

Now we have the document we are going to index them using no index-time localization, using DINO and using yolox. We setup the client and the base settings.

from marqo import Client
client = Client()

# setup the settings so we can comapre the different methods
patch_methods = [None, "dino-v2", "yolox"]

settings = {
    "index_defaults": {
        "treat_urls_and_pointers_as_images": True,
        "image_preprocessing": {
            "patch_method": None
        },
        "model":"ViT-B/32",
        "normalize_embeddings":True,
    },
}

To use the different methods we change the method name. We will iterate through each method and index the images in a different index.

for patch_method in patch_methods:

    index_name = f"visual_search-{str(patch_method).lower()}"     

    settings['index_defaults']['image_preprocessing']['patch_method'] = patch_method

    response = client.create_index(index_name, settings_dict=settings)

    # index the documents on the GPU using multiple processes
    response = client.index(index_name).add_documents(documents, device='cuda', 
                                server_batch_size=50, processes=2)

If no GPU is available, set device='cpu'.

Searching with localization

Now we will demonstrate how to use the two different methods to get localization in image search.

Search using index time localization

We can now perform some searches against our indexed data and see the localization.

response = client.index(index_name).search("brocolli", device="cuda")
print(response['hits'][0])

We can see in the highlights field the coordinates of the bounding box that best matched the query.

bbox = response['hits']['_highlights']['image_location']
print(bbox)

The top six results are shown below with their corresponding top bounding box highlight.

The method here uses a pre-trained yolox model to propose the bounding boxes at indexing time and each of the sub-images are indexed alongside the original. Some filtering and non-max suppression (NMS) is applied and the maximum number of proposals per image is capped at ten. The class agnostic scores are used for the NMS. We can see the results from another method which is named dino-v2.

Dino-v2 uses base transformer models from DINO which is a self supervised representation learning method. Apart from being used as a pre-training step the attention maps from these models tend to focus on object within the images. These attention maps can be used to determine the salient or important parts of the images. The nice thing about this method is it is self-supervised and does not require labels or bounding boxes. It is also amenable to fine-tuning on domain specific data to provide better localization for specific tasks. The difference between dino-v1 and dino-v2 is that the proposals for v2 are generated per attention map, while v1 uses a summed attention map. This means the v1 generates fewer proposals than v2 (and means less storage is required).

Search using search time localization

As described earlier, the alternative way to get localization is to have an object detector acting as a reranker and localizer. In Marqo we can specify the model here for re-ranking. The re-ranking model is OWL-ViT. OWL-ViT is an open vocabulary object detector that generates proposals after conditioning with a text prompt (or query). This conditional localisation is ideal to use as a reranker since we have the query to condition the model with for localisation.

response = client.index(index_name).search("brocolli", device="cuda", 
        searchable_attributes=['image_location'], reranker="owl/ViT-B/32")
print(response['hits'][0])

The localisation provided by the reranker does not require any index time localisation. It can even be used with lexical search which does not use any embeddings for the first stage retrieval.
We can see in the highlights field the coordinates of the bounding box that best matched the query after reranking,

bbox = response['hits']['_highlights']['image_location']
print(bbox)

and we can plot these results as well. The localisation is better here as the proposals are done in conjunction with the query.

Conclusion

We have shown how using a two-stage retrieval system can enable multiple avenues for adding localisation to image search. We showed how yolox and DINO could be leveraged to provide index time localisation. OWL-ViT was shown as a second stage reranker that also provides localisation. The methods discussed allow for a variety of trade-offs, including speed and relevency. To see how many more applications like this can be built, check out Marqo!

How I used Marqo to create a multilingual legal database in 5 key lines of code

skeltsyboiii — Fri, 21 Oct 2022 09:00:08 +0000

The European Union has to deal with a peculiar problem — it has 24 official languages across 27 countries and these countries must abide by EU law. Experts in EU law have the complex task of navigating legal material in multiple languages.

What if there was a system where a user (like a lawyer) could search through a database of documents in their preferred language, and get the closest matching document in another? What if this user wanted to give access to this database to a colleague that uses a different language?
In this article, we present a solution that can search across multiple languages using a multilingual legal database built using Marqo, an open source tensor search engine, in just 5 key lines of code.

The dataset

The MultiEURLEX dataset is a collection of 65 thousand laws in 23 EU languages. EU laws are published in all member languages. This means that we may come across the same law in multiple languages.

Scope for this proof of concept

In the interest of time and for ease of replication, this proof-of-concept will be a database to store documents from two languages: Deutsch and English. We will also only use the dataset’s validation splits with 5000 documents from each language. Note that the machine learning model that Marqo will be using, stsb-xlm-r-multilingual (more about this model can be found here and here) can handle many more languages than just these two.

The solution was run on an ml.g4dn.2xlarge AWS machine. This comes with a Nvidia T4 GPU. The GPU speeds up the Marqo machine learning model which processes our documents as we insert them. These AWS machines are very easy to set up as SageMaker Jupyter Notebook instances.

The solution

If we were to develop this on a traditional SQL database or search engine, we’d have to manually create a translation layer to process the queries, and link each document with handcrafted or machine-generated translations.

An example of this would be to translate all the documents into English as they are stored. The search query would also be translated into English, and a keyword search would be performed using a technology like Elasticsearch. However this is problematic as a translated sentence is a lossy approximation of the source language and it introduces a significant component (real-time translation) into the system. This results in poorer search relevancy , worse latency, and additional system complexity.

Tensor search, the technology that powers Marqo, outperforms traditional keyword search methods.

First, we set up a Marqo instance on the machine, which has docker installed. Notice the --gpus all option. This allows Marqo to use GPUs it finds on the machine. If the machine you are using doesn’t have GPUs, then remove this option from the command.

docker rm -f marqo; docker run — name marqo -it —-privileged -p 8882:8882 --gpus all \ —-add-host host.docker.internal:host-gateway marqoai/marqo:0.0.3

We use pip to install the Marqo client (pip install marqo) and the datasets python package (pip install datasets). We will use the datasets package from Hugging Face to import the MultiEURLEX dataset.

Then, we start work on our Python script. We start by loading the the validation splits for the English and Deutsch datasets:

from datasets import load_dataset dataset_en = load_dataset(‘multi_eurlex’, ‘en’, split=”validation”) dataset_de = load_dataset(‘multi_eurlex’, ‘de’, split=”validation”)

We then import Marqo and set up the client. We tell the Marqo client to connect with the Marqo Docker container that we ran earlier.

from marqo import Client mq = Client(“<u>http://localhost:8882</u>")

Then, add a line telling Marqo to create the multilingual index:

mq.create_index(index_name=’my-multilingual-index’, model=’stsb-xlm-r-multilingual’)

Notice that here is where we tell Marqo what model to use. After this, we’ll iterate through each dataset, indexing each document as we go.
One small adjustment we’ll make is to split up text of very long documents (of over 100k chars) to make it easier to index and search.
At the end of each loop, we call the add_documents()function to insert the document:

mq.index(index_name="my-multilingual-index").add_documents( device="cuda", auto_refresh=False, documents=[{ "_id": doc_id, "language": lang, "text": sub_doc, "celex_id": doc["celex_id"], "labels": str(doc["labels"]) }] )

Here we set the device argument as "cuda". This tells Marqo to use the GPU it finds on the machine to index the document. If you don’t have a GPU, remove this argument or set it to "cpu". We encourage using a GPU as it will make the add_documents process significantly faster (our testing showed a 6–12x speed up).

We also set the auto_refresh argument to False. When indexing large volumes of data we encourage you to set this to False, as it optimises the indexing process.

And that’s the indexing process! Run the script to fill up the Marqo index with documents. It took us around 45 minutes with an AWS ml.g4dn.2xlarge machine.

Searching the index

We’ll define the following search function that sets some parameters for the call to Marqo:

# pprint is an inbuilt python formatter package that prints data in a readable way import pprint def search(query: str): result = mq.index(’my-multilingual-index’).search( q=query, searchable_attributes=[“text”] ) for res in result[“hits”]: pprint.pprint(res[“_highlights”])

The first thing to notice is the call to the Marqo search() function. We set searchable_attribues to the "text" field. This is because this is the field that holds the content relevant for searching.

We could print out the result straight away, but it contains the full original documents. These can be huge. Instead, we’ll just print out the highlights from each document. These highlights also show us what part of the document Marqo found most relevant to the search query. We do this by printing the _highlights attribute from each hit.

We search by passing a string query to the search function. For the search with query string:

“Laws about the fishing industry”

We get the following results as the top 2 highlights:

{‘text’: ‘Consequently, catch limits and fishing effort limits for the cod stocks in the Baltic Sea should be established in accordance with the rules laid down in Council Regulation (EC) No 1098/2007 of 18 ‘… {‘text’: ‘(18)’ ‘Bei der Nutzung der Fangmöglichkeiten ist geltendes Unionsrecht uneingeschränkt zu befolgen -’ ‘HAT FOLGENDE VERORDNUNG ERLASSEN:’ ‘TITEL I’ ‘GELTUNGSBEREICH UND BEGRIFFSBESTIMMUNGEN’ ‘Artikel 1’…

The second result is from a German document. Using Google Translate, the German document’s first line translates to

“When using the fishing opportunities, applicable Union law to be strictly followed”

Using Google Translate to translate the original fishing law query string into Deutsch gives us:

“Gesetze über die Fischereiindustrie”

Searching with this string gives us similar results to the English version of the query. The first result is an English document, with the same highlight as the English query. Marqo identifies both queries strings as having similar meaning.

Because we added the language code as a property of each document, we can filter for certain languages. We add a filter string to the search query:

mq.index(index_name=’my-multilingual-index’).search( q=query, searchable_attributes=[‘text’], filter_string=’language:en’ )

Searching with this filter for “Gesetze über saubere Energie” (Google translation of “Laws about clean energy”) yields only English language results. The top 3 results are:

_The electricity and water consumptions of products subject to this Regulation should be made more efficient by applying existing…

Products subject to this Regulation should be made more energy efficient by applying existing non-proprietary cost-effective…

The electricity consumption of products subject to this Regulation should be made more efficient by applying existing non-proprietary cost-effective technologies that can reduce the combined costs of purchasing and operating these products…_

Conclusion

Marqo is a tensor search engine that can be deployed in just 3 lines of code and solve search problems using the latest ML models from HuggingFace and OpenAI. In this article I showed how I used Marqo to quickly set up a multilingual legal database.

Marqo makes tensor search easy. Without needing to be a machine learning expert, you can use cutting-edge machine learning models to create an unrivalled search experience with minimal code. Check out the full code for the demo here. Check out (and contribute, if you can!) to our open source codebase here.