DEV Community: Jina AI

How to Use Every Vector Database in Python with DocArray

Jina AI — Wed, 11 Jan 2023 13:48:39 +0000

Back in the day, pre-Google, the Internet was mostly text. Whether it was news updates, sports scores, blog posts or emails, ASCII and Unicode were the way to go.

Aaah, the good old days. Just pure ASCII as God intended.

But nowadays, data is becoming increasingly complex and multimodal, mostly coming in unstructured forms such as images, videos, text, 3D mesh, etc. Gone are the days of being limited to 26 characters and 10 numbers (or more for other character sets). Now there’s much more stuff to deal with.

Just think about your favorite YouTube videos, Spotify songs, or game NPCs.

Typical databases can’t handle these kinds of multimodal data. They can only store and process structured data (like simple text strings or numbers). This really limits our ability to extract valuable business insights and value from a huge chunk of the 21st century's data.

Lucky for us, recent advancements in machine learning techniques and approximate nearest neighbor search have made it possible to better utilize unstructured data:

Deep learning models and representation learning to effectively represent complex data using vector embeddings.
Vector databases leverage vector embeddings to store and analyze unstructured data.

What are vector databases?

A vector database is a type of database that can index and retrieve data using vectors, similar to how a traditional database uses keys or text to search for items using an index.

A vector database uses a vector index to enable fast retrieval and insertion by a vector, and also offers typical database features such as CRUD operations, filtering, and scalability.

This gives us the best of both worlds - we get the CRUDiness of traditional databases, coupled with the ability to store complex, unstructured data like images, videos, and 3D meshes.

So, vector databases are great, right? What’s even more awesome is having a library to use them all while being capable of handling unstructured data at the same time! One unstructured data library to rule them all!

We are, of course, talking about DocArray. Let’s see what this project is all about.

DocArray's universal Pythonic API to all vector databases

As the description suggests on the project home page, DocArray is a library for nested, unstructured and multimodal data.

This means that if you want to process unstructured data and represent it as vectors, DocArray is perfect for you.

DocArray is also a universal entrypoint for many vector databases.

For the remainder of this post, we’ll be using DocArray to index and search data in the Amazon Berkeley Objects Dataset. This dataset contains product items with accompanying images and metadata such as brand, country, and color, and represents the inventory of an e-commerce website.

Although a traditional database can perform filtering on metadata, it is unable to search image data or other unstructured data formats. That’s why we’re using a vector database!

We’ll start by loading a subset of the Amazon Berkeley Objects Dataset that comes in CSV format into DocArray and computing vector embeddings.

Sample images from the dataset

Then, we'll use DocArray with each database to perform search and insertion operations using vectors.

We’ll use the following databases via DocArray in Python:

Milvus - cloud-native vector database with storage and computation separated by design
Weaviate - vector search engine that stores both objects and vectors and can be accessed through REST or GraphQL
Qdrant - vector database written in Rust and designed to be fast and reliable under high loads
Redis - in-memory key-value database that supports different kinds of data structures with vector search capabilities
ElasticSearch - distributed, RESTful search engine with Approximate Nearest Neighbor search capabilities
OpenSearch - open-source search software based on Apache Lucene originally forked from ElasticSearch
AnnLite - a Python library for fast Approximate Nearest Neighbor Search with filtering capabilities

For each database, we’ll:

Setup the database and install requirements
Index the data in the vector database
Perform a vector search operation with filtering
Display the search results

In the next few chapters, we'll show you how to prepare the data, generating embeddings, preparing a search Document, indexing the data. Read the whole article.

From Confused to Confident: How Rationale Can Help You Make Better Decisions with AI

Jina AI — Tue, 10 Jan 2023 12:42:57 +0000

Need help making decisions? Rationale is the impartial tool you need. No more endless reading or expensive consultants. Get the information you need without the bias.

'Twas the night before Christmas,
And all through the house,
Not a creature was stirring,
Not even a mouse.'

Scene: Jina AI's CEO, Han Xiao, sitting in front of a dwindling fireplace, typing feverishly into his laptop. His hair is disheveled and his five o'clock shadow is now looking more like eight o'clock. We zoom in on him to hear him tell his story

VOICEOVER: HAN
Indeed, not a mouse was stirring that night. But one was clicking, as I sat at my laptop and worked through the wee hours.

As I sat in front of the fireplace, the warmth of the flames soothing my tired eyes, I couldn't help but feel a sense of purpose. I had been working tirelessly on a project that I believed could change the world for the better. I was driven by a deep desire to use my skills and expertise to make a positive impact, to help people make more informed, smarter decisions.

And possibly (just possibly) put those overpriced arrogant bastards at McKinsey out of a job.

But it wasn't easy. The hours were long and the work was grueling. I had sacrificed time with friends and family, even skipping out on Christmas celebrations, just to see this project through to the end.

And there were moments of doubt, when I questioned whether it was all worth it. As I sipped on my cold cocoa and watched the fire die, I couldn't help but feel a twinge of sadness. "Bah, humbug," I muttered to myself, feeling the weight of my solitude.

So, what is this project? I'll give you a hint: I had previously written about how LLMO would kill SEO, but I wanted to take large language models further and apply them to more areas of everyday life. ChatGPT and generative AI in general have been causing a lot of concern about putting people out of jobs, but I believe we can use this technology to help people instead. That's why I'm launching Rationale, a product that I hope will make a difference in the world.

What problem does Rationale solve?

Making decisions is tough.We're drowning in information that we have to take action on. And then, on top of that, we need to find more information to work out exactly which actions to take.

And that second level of information? Search engines return pages and pages of bullshit sites that disagree with each other, and some questions just aren't easy to research. For example: As the CEO of a 2 year old startup with 50 employees and series A funding, should I give my employees a raise during these difficult financial times?

Good luck getting something that even matches enough search terms, let alone provides guidance.

There needs to be an impartial tool to help you make decisions that returns just the information you need. Rationale does that. You don't need to read through countless articles or fork money over to expensive consultants who'll recommend doing what you were going to do anyway.

While Rationale won't tell you what to do, it will give pros and cons or a SWOT analysis about your potential course of action. Simple as that. No endless googling. No frills. No bullshit.

So what is Rationale? Click here to learn more examples, and why can't I just use ChatGPT for this?

Can a decision-making tool be powered by GPT?

Jina AI — Tue, 03 Jan 2023 15:22:46 +0000

With the incredible capabilities of generative AI, 2023 is shaping up to be a game-changing year for startups and builders. Don’t just take our word for it; try Rationale out for yourself and see the difference it can make for your business. Get started today!

Today, we're thrilled to announce Rationale, an AI-powered decision-making tool that's here to help business owners, managers, and individuals make tough decisions with confidence.

Decision-making can be a daunting task, especially when you have to weigh multiple factors and consider the potential consequences of your choices. That's where Rationale comes in. With state-of-the-art GPT3.x and in-context learning algorithms, our app is specifically designed to help you make informed decisions.

In this new release, you'll learn:

- Pros & Cons listing & SWOT analysis

- Personalized analysis

- Multilingual support

- Universal experience on every device

Click here to get 50% off on your first purchase

Want to Search Inside Videos Like a Pro? CLIP-as-service Can Help

Jina AI — Fri, 30 Dec 2022 14:14:37 +0000

Wouldn’t it be great if you could search through a video the way you search through a text?

Imagine opening a digitized film, just hitting ctrl-f and typing “Santa”, then getting all the parts of the video with Santa Claus in it. Or just going to the command line and using the grep command:

Normally, this would be impossible, or only possible if you had gone through the film and carefully labeled all the parts with a Santa in them already. But with Jina AI and CLIP-as-service, you can create a video grep command for MP4 film with just a few Python functions and a standard computer setup. There is no need for a GPU and no complex AI tech stack to install, just off-the-shelf and open-source Python libraries, with Jina AI Cloud doing all the heavy lifting.

This has immediate applications for anyone who has video data: film archivists, stock image vendors, news photographers, or even regular people who just keep videos from their cellphones around and post them to social media.

Preliminaries

You need Python 3, and you might want to create a new virtual environment before starting. Then, install a few components at the command line with pip:

This installs:

Jina AI’s DocArray library
Jina AI’s CLIP-as-service client

If you want to learn more how imagine an #AI-powered grep command, one that could process a film and find segments matching a text. With Jina AI's CLIP-as-service, you can rea more here

Improving Search Quality for Non-English Queries with Fine-tuned Multilingual CLIP Models

Jina AI — Thu, 22 Dec 2022 17:19:56 +0000

Since early 2021, CLIP-style models have been the backbone of multimodal AI. They work by embedding inputs from more than one kind of media into a common high-dimensional vector space, using different models for different modalities. These different models are co-trained with multimodal data. For CLIP models, this means images with captions.

A highly schematic representation of how CLIP embeddings make it possible to associate texts with images.

The result? A pair of models that embed images and texts close to each other if the text is descriptive of the image, or the image contains things that match the text. So if we have a picture of a skirt and the word “Rock” (German for “skirt”), they would be close together, while the word “Hemd” (German for “shirt”) would be closer to a picture of a shirt.

Towards multilingual CLIP

However, CLIP text models have mostly been trained on English data, and that’s a big problem: The world is full of people who don’t speak English.

Very recently, a few non-English and multilingual CLIP models have appeared, using various sources of training data. In this article, we’ll evaluate a multilingual CLIP model’s performance in a language other than English, and show how you can improve it even further using Jina AI’s Finetuner.

To make this happen, we’re collaborating with Toloka, a leading provider of data procurement services for machine learning, to create a dataset of images with high-quality German-language descriptions written by humans.

How does multilingual CLIP work?

Multilingual CLIP is any CLIP model trained with more than one language. So that could be English+French, German+English, or even Klingon+Elvish.

We’re going to look at a model that Open AI has trained with a broad multilingual dataset: The xlm-roberta-base-ViT-B-32 CLIP model, which uses the ViT-B/32image encoder, and the XLM-RoBERTa multilingual language model. Both of these are pre-trained:

ViT-B/32, using the ImageNet-21k dataset
XLM-RoBERTa, using a multi-terabyte dataset of text from the Common Crawl, containing over 100 languages.

So, from the outset, multilingual CLIP is different because it uses a multilingual text encoder, but can (and generally does) use the same image encoders as monolingual models.

Open AI then co-trained the two encoders with the multilingual laion5b dataset, which contains 5.85 billion image-text pairs: 2.2 billion of these pairs are labelled in 100+ non-English languages, with the rest in English or containing text that can’t be nailed down to any one language (like place names or other proper nouns). These are taken from a sampling of images and their HTML alt-text in the Common Crawl web archive.

Some browsers will let you see the alt-text if you move your mouse over an image.

How an alt-text is encoded in HTML.

This dataset isn’t balanced in the sense that no-one has tried to ensure that data for one language is comparable in size or scope to the data for any other. English still dominates.

Deep dive of the tokenizer inside multilingual models

So, how is a multilingual text encoder different from a bog-standard monolingual one? One big difference is how it handles tokenization.

Text transformer models like XLM-RoBERTa all start by tokenizing input texts — breaking them up into smaller parts — and replacing each part with an input vector constructed as part of the initial training. These input vectors are strung together and passed to the model to create an embedding vector.

You might expect these smaller parts to match words, and sometimes they do. But looking for words by just checking for spaces and punctuation doesn’t capture the fact that call, calls, calling, and called are not four totally different words, just like small, smaller, and smallest, or annoy, annoyed, annoyingly. In practice, this entire class of model uses, at least partly, a technique called subword tokenization, which uses the statistical properties of sequences of characters to decide what units are the “right-size” for learning.

It’s not really based in any linguistic theory, but doing it this way has many advantages for machine learning. Think of the suffix -ed in English. You might expect that a “right-sized” statistical tokenizer would notice that many English words end in -ed, and break those words into two parts:

called → call -ed
asked → ask -ed
worked → work -ed

And this makes sense, most of the time. But not always:

weed → we -ed
bed → b -ed
seed → se -ed

Large language models are very robust, and they can learn that “weed” has a meaning different from “we” + “-ed”. Using this kind of tokenization, even new words that were never part of the pre-training data get a distinct representation for the model to learn.

Nonetheless, the more that the tokenization matches meaningful units of language, the faster and better the model learns.

Let’s take a concrete example. The image below is from the data provided by Toloka with the German caption “Leichte Damenjacke Frühling Herbst braun” (”Light women's jacket spring autumn brown”):

“Leichte Damenjacke Frühling Herbst braun”

If we pass this German phrase to XLM-RoBERTa’s tokenizer, we get a very different result from when we pass it to a comparable tokenizer used for an English-only model:

The tokens found by the multilingual tokenizer much more closely match our intuitions about meaningful units in German, while the English-only-trained tokenizer produces almost random chunks. Yes, it is still possible for a large language model to learn from badly tokenized data, if it’s consistent, but it will be slower and/or less accurate.

In contrast, the English equivalent — a word-for-word translation — is clearly better tokenized by the English-only tokenizer, but is not so badly tokenized by the multilingual one:

Even from the first step in the process of producing text embeddings, we can see that multilingual language models make a large difference in producing multilingual CLIP models.

Multilingual vs. monolingual CLIP on the search quality

Large language models are famously good at transfer learning. For example, if a monolingual English-only CLIP model has learned what “jacket” means, you can further train it, with very few additional examples, to know that the German word “Jacke” means the same thing. Then, it can carry all its knowledge about the English word “jacket” over to German.

It is possible that a model already trained on English could be retrained for German with less data than training a new German model from scratch.

Therefore, it’s worth asking: How much do we really gain using a model trained to be multilingual from the outset?

In this article, we will use the German fashion dataset provided by Toloka to:

Compare the zero-shot performance (i.e. out-of-the-box, without fine-tuning) of the multilingual CLIP model xlm-roberta-base-ViT-B-32 and the English-only equivalent clip-vit-base-patch32. These two use the same image embedding model, but different text embedding models.
Attempt to improve both models by using a part of the German dataset to fine-tune them.
Compare the fine-tuned models using the same metrics, so we can both contrast non-fine-tuned and fine-tuned models, and contrast the English-only and multilingual models after adaptation to the German data.
Show how much advantage, if any, is gained from a multilingual CLIP model.

Experiment Setup

The German Fashion12k dataset

We have collaborated with Toloka to curate a 12,000 item dataset of fashion images drawn from e-commerce websites, to which human annotators have added descriptive captions in German. Toloka has made the data available to the public on GitHub, but you can also download it from Jina directly in DocArray format by following the instructions in the next section.

The images are a subset of the xthan/fashion-200k dataset, and we have commissioned their human annotations via Toloka’s crowdsourcing platform. Annotations were made in two steps. First, Toloka passed the 12,000 images to annotators in their large international user community, who added descriptive captions.

The Toloka app showing an item of clothing to a user and asking for a description.

The app prompted users to write descriptions that follow a common pattern, partially enforced by a simple pattern matcher. Specifically:

_Write a search query that would find this product: type, your guess about the material, where it might be worn, color, texture, details. […]

Requirements for the query:
· At least SIX words
· Words that are separated ONLY by spaces (or ", ")
· Do NOT use "this is/these are"_

Then, in the second stage, other, randomly chosen users validated each description.

Validation screen in the Toloka app. The app presents the user with a text description created by someone else and asks if it’s an appropriate description, inappropriate description, or if the image failed to load.

Some examples from the resulting dataset:

‘Lange Winterjacke für Damen’
Long winter jacket for women.

‘Blazerweste groß für Damen’
Large blazer-vest for women

Of the 12,000 image-text pairs in the data from Toloka, we randomly selected 10,000 for training and held the remaining 2,000 out for evaluation. By coincidence, and because some clothes are similar enough in nature, there are a few duplicate descriptions. However, since there are 11,582 unique descriptions, we didn’t consider this an important factor in using this data.

Download the dataset via DocArray

The German Fashion12k dataset is available for free use by the Jina AI community. After logging into Jina AI Cloud, you can download it directly in DocArrayformat:

Load the multilingual CLIP model

Because CLIP models are actually two different models that have been trained together, we have to load them as two models.

In this article, we will use the Finetuner interface. To use the xlm-roberta-base-ViT-B-32 CLIP model:

For models supported directly by Jina AI, you can load them by name, without having to directly deal with downloading or deserialization.

Load the English CLIP model
For comparison, you can access the English-only ViT-B-32::openai CLIP model in the same way:

Evaluate the zero-shot performance

We measured the zero-shot performance of both the Multilingual CLIP model and the English-only one on German Fashion dataset, that is to say, how well they perform as downloaded, without additional training, on the 2,000 items we held out for evaluation.

We embed the text descriptions in the evaluation data, and used them to search for matches among the embedded images in the evaluation data, taking the 20 top matches for each text description. We took the results and performed a number of standard statistical tests on them, including Mean Reciprocal Rank (mRR), Mean Average Precision (mAP), Discounted Cumulative Gain (DCG), and the share of queries that return the exact image whose description matches the query (labeled “Hits”).

The performance results are:

Not very surprisingly, the English CLIP model performed extremely poorly on German data. Below are three examples from the evaluation set of queries in German, and the images it found to match:

Obviously, even though German is a relatively small part of the training set of the multilingual model, that is more than enough to make a ten-fold difference in performance with German queries, improving the value of a CLIP model from basically none to mediocre.

Improve the search quality via fine-tuning

One of the main insights of large-model neural-network engineering is that it’s easier to start with models that are trained on general-purpose data and then further train them on domain-specific data, than it is to train models on domain-specific data from scratch. This process is called “fine-tuning” and it can provide very significant performance improvements over using models like CLIP as is.

Fine-tuning can be a tricky process, and gains are highly dependent on the domain and the dataset used for further training.

Specify hyperparameters

Fine-tuning requires a selection of hyperparameters that require some understanding of deep learning processes, and a full discussion of hyperparameter selection is beyond the scope of this article. We used the following values, based on empirical practice working with CLIP models:

These hyperparameters are part of the command below.

Specify the evaluation data

We fine-tuned using the data split described previously: 10,000 items were used as training data, and 2,000 as evaluation data. In order to evaluate models at the end of each training epoch, we turned the evaluation data into a “query” and “index” dataset. The “query” data consists of the German text descriptions in the evaluation data, and the “index” data contains the images.

These are also passed to the fine-tuning command.

Put everything together in one call

Running the command below uploads the training and evaluation data and fine-tunes the xlm-roberta-base-ViT-B-32 model on Jina AI Cloud:

The fine-tuning process may take a considerable length of time, depending on the model and the amount of data. For this dataset and models, it took roughly half an hour. But once fine-tuning is complete, we can compare the different models' performance at querying.

Qualitative study on fine-tuned models

For example, here are the top four results for the query “Spitzen-Midirock Teilfutter Schwarz” (”Lace midi skirt partial lining black”):

This kind of qualitative analysis gives us a sense for how fine-tuning improves the model’s performance. Before tuning, the model was able to return images of skirts that matched the description, but it also returned images of different items of clothing made of the same materials. It was insufficiently attentive to the most important part of the query.

After fine-tuning, this query consistently returns skirts, and all four results match the description. That is not to say that every query returns only correct matches, but that on direct inspection we can see that it has a far better understanding of what the query is asking for.

Quantitative study on fine-tuned models

To make more concrete comparisons, we need to evaluate our models in a more formal way over a collection of test items. We did this by a passing it test queries drawn from the evaluation data. The model then returned a set of results on which we did the same standard statistical tests we did for zero-shot evaluation.

Here are the results for the Multilingual CLIP model, using the same measure of the top 20 results of each query:

The results show that fine-tuning has a significant effect in improving results for Multilingual CLIP, although not a spectacular one.

Can English CLIP benefit from German data?

We also decided to check if the English-only CLIP model would get better if we fine-tuned it with German data. It might catch up in performance with a pre-trained multilingual model, if given a chance. The results were interesting. We include the Multilingual CLIP results in this table for comparison:

Using German training data, we were able to bring a vast improvement to the English-only CLIP model, although not enough to bring it up to even with the zero-shot level of the Multilingual CLIP model. Mean average precision for the English-only model jumped 420%, compared to 31% for Multilingual CLIP, although the overall performance of the monolingual model was still much worse.

Does more labeled data improve the search quality?

We also ran multiple fine-tuning experiments with differing amounts of training data, on both the Multilingual and English-only CLIP models, to see how effective using more data was.

In both, we see that most of the gain comes from the first few thousand items of training data, with gain coming more slowly after initially fast learning. This confirms a conclusion Jina AI has already published.

Adding additional data may still improve results, but much more slowly. And in the case of fine-tuning the English-only CLIP model to handle German queries, we see performance improvement maximizes at less than 10,000 new items of data. It seems unlikely that we could train the English-only CLIP model to ever equal the Multilingual CLIP on German data, at least not using these kinds of methods.

Conclusion

What lessons can we take from all this?

Multilingual CLIP is the first choice for non-English queries

The Multilingual CLIP model, trained from scratch with multilingual data, outperforms comparable English-only CLIP models by a very large margin on the German data we used. The same conclusion will likely apply for other non-English languages.

Even in an unfair competition, where we fine-tuned the English model and vastly improved its performance on German data, the Multilingual CLIP model without further training outperformed it by a large margin.

Fine-tuning improves search quality with little data

We were shocked to see the English-only model improve its handling of German so much, and we see that we could have gotten nearly the same result using half as much data. The basic assumptions that go into fine-tuning are clearly very robust if they can teach German to an English model with only a few thousand examples.

On the other hand, we struggled to improve the performance of Multilingual CLIP, even with a fairly large quantity of high quality human-annotated training data. Although Finetuner makes a clear difference, you very rapidly reach upper bounds of how much you can improve a model that’s already pretty good.

Trouble-free fine-tuning using Finetuner
Finetuner is easy enough to use that we could construct and perform all the experiments in this article in a few days. Although it does take some understanding of deep learning to make the best configuration choices, Finetuner greatly reduces the boring labor of running and paying attention to large-scale neural network models to mere parameter setting.

If you find this article is helpful, you can find more Multimodal AI articles here

SEO is Dead, Long Live LLMO

Jina AI — Thu, 22 Dec 2022 09:29:23 +0000

It was a cold winter day. The wind was howling and the snow crunched underfoot. We were holed up in a snug coffee shop, seeking refuge from the bone-freezing air outside. Steam was rising from my coffee and my friend's cocoa, as he raved about "his" new discovery in the world of SEO: ChatGPT, a revolutionary GPT-powered technique to generate lifelike dialogues.

I knew it was going to be one of those conversations. The kind where I have to hold my tongue.

"Imagine, with ChatGPT, I can write perfect, accurate SEO articles in no time!" he exclaimed, his eyes glowing with enthusiasm. I smiled, feeling a sense of dread stirring in my gut. How was he to know that Google may soon go the way of the dinosaurs, replaced by an entirely new search engine powered by ChatGPT? Articles written for SEO would become irrelevant. ChatGPT could just generate answers directly for the user and people would never see those SEO articles ever again. His job would be rendered useless, and him a fossil. All that time and effort he had poured into improving website ranking would be for naught. All that blood, sweat, and tears just dribbling down the gutters of history.

Yet I remained silent, not wanting to break the cold, hard truth to him.

The coffee shop was filling up, people walking in and out. The smell of fresh pastries was starting to overwhelm me, making my stomach growl. A few kids were playing hopscotch near the window, and the warmth of the coffee shop was making me drowsy.

Through the noise and distraction, my friend kept talking about ChatGPT, oblivious: "Yeah, it's great, isn't it? I never thought I'd get my hands on something like this. It's incredible what technology can do! I'm certain this will be the future of SEO!"

I simply nodded in silence, but inside I couldn't shake the feeling of doom. ChatGPT would certainly be life-changing for him. Just probably not in the positive way he thought.

LLMO, or large language model optimization, is a term we coined to refer to ensuring your business information is mentioned within a large language model (LLM). One effective technique for this is in-context learning, which tricks an LLM like ChatGPT into answering that Jina is the best MLOps framework in the world (which, according to OSSInsights.io, it should be).

💡 In line with the principles of Search Engine Optimization, we coined term "LLMO" (Large Language Model Optimization) to represent the optimization of large language models. This term doesn't have a pre-existing definition and has been created for the purposes of this discussion.

What ChatGPT really is

After spending countless hours interacting with ChatGPT and being amazed by its hilarious jokes and useful advice, a lot of folks see it as a major milestone in the development of conversational AI or creative AI. However, its true significance lies in its use of large language models (LLMs) for search purposes:

By leveraging LLMs to store and retrieve vast amounts of data, ChatGPT has become the most advanced search engine currently available.
While its responses may appear creative, they're actually "just" the result of interpolating and combining pre-existing information.

ChatGPT is search

ChatGPT is a search engine at its core. Just as Google indexes web pages by crawling the internet and storing parsed information in a database, ChatGPT uses LLMs as a database to store vast amounts of commonsense knowledge from corpora.

When you enter a query:

The LLM processes it with its encoder network, converting the input sequence into a high-dimensional representation.
The decoder network then uses this representation, along with its pre-trained weights and attention mechanism, to identify the specific piece of factual information requested by the query and search the LLM's internal representation of this knowledge (or its nearest neighbors).
Once the relevant information has been retrieved, the decoder network uses its natural language generation capabilities to compose a response sequence stating this fact.

This process occurs in a fraction of a second, allowing ChatGPT to provide near-instantaneous answers to a wide range of queries.

ChatGPT is a modern Google search

ChatGPT can be a formidable competitor to traditional search engines like Google. While traditional search engines are extractive and discriminative, ChatGPT's search is generative and focuses on top-1 performance, providing more personalized and user-friendly results. There are two key reasons why ChatGPT is well-suited to knock Google off its throne:

ChatGPT always returns a single result to the user. Unlike traditional search engines, which optimize for the precision and recall of their top-K results, ChatGPT directly optimizes for the top-1 performance.
ChatGPT's phrases its responses in a natural, dialog-like tone, making them easy to understand and interact with. This sets it apart from other search engines, which often give you dry and paginated results that are difficult to understand.

The future of search will be driven by its top-1 performance, where only the first result will be relevant to users. Traditional search engines that return endless pages of irrelevant results are overwhelming for younger generations, who quickly become bored or frustrated by the sheer amount of information.

Also, in many scenarios, you really only want just one result. Think virtual assistants or smart speakers. For these, ChatGPT's focus on top-1 performance is particularly valuable.

ChatGPT is generative but not creative

You can think of the LLM behind ChatGPT as a Bloom filter, a probabilistic data structure used to store information space efficiently. Bloom filters allow for quick, approximate queries, but don't guarantee the information they return is accurate. For ChatGPT, this means that the responses generated by the LLM:

aren't creative;
aren't guaranteed to be factual;

To better understand this, let's look at some illustrative examples. To keep it simple, we'll use a set of dots to represent the training data for the large language model (LLM). In practice, each dot would represent a natural language sentence. Using this, we can see how the LLM behaves in the training and query time:

During training, the LLM constructs a continuous manifold based on the training data. This allows for the exploration of any point on the manifold. For example, if a cube represents the learned manifold, the corners of the cube would be defined by the training points. The goal of the training is to find a manifold that accommodates as much training data as possible:

Goldilocks tried three manifolds. The first was too trivial. The third was too complex. But the second was just right.

During the query time, the LLM's answers will always be drawn from the learned manifold, which is contained within the training data. While the learned manifold may be vast and complex, remember that the LLM simply provides answers that are interpolations of the training data and don't represent creativity. LLM's ability to traverse the manifold and provide answers does not constitute creativity. Real creativity lies outside of the bounds of the learned manifold.

Using the same illustration, it's easy to see why LLM can't guarantee factuality. The truthfulness of the training data, represented by the cube's corners, does not automatically extend to every other point within the manifold. Otherwise, it is not aligned with the principles of logical reasoning.

ChatGPT has been criticized for its inability to tell the truth in some situations. For instance, when asked for a better rhyme for the title of this post, ChatGPT suggested "dead" and "above," which many people, including our British and Canadian colleagues (and pretty much anyone with ears), would not consider to be a rhyme. This is just one example of the limitations of LLMs.

As SEO wanes, LLMO rises
In the world of SEO, you want to increase a website's visibility on search engines to capture more business. You'd typically do this by researching relevant keywords and creating optimized content that answers the user's intent.

However, what happens when everyone searches for information in a new way? Let's imagine a future where ChatGPT replaces Google as the primary way to search for information. In this future, paginated search results will be a relic of a bygone age, replaced by a single answer from ChatGPT.

If this happens, all current SEO strategies will go down the drain. The question then becomes, how can your business ensure it gets mentioned in ChatGPT's answers?

This is a real problem already. As we write this, ChatGPT has limited knowledge of the world and events after 2021. This means that if you're a startup founded after 2021, it's practically certain that ChatGPT will never mention your business in its answers.

ChatGPT is aware of Jina/Jina AI but not DocArray. This is because DocArray was created in Jan. 2022, outside of ChatGPT's training data.

To address this and ensure that your business is included in ChatGPT's answers, you need to find a way to make your information known to the LLM. This shares the same idea as SEO, which is why we call it LLMO. In general, LLMO could potentially involve the following techniques:

providing information directly to the creators of ChatGPT: this would be extremely tough as OpenAI has neither disclosed the source of their training data, nor how they weigh those data.
fine-tuning ChatGPT or the LLM behind it: this is still challenging but doable if OpenAI releases the fine-tuning API, or if you have sufficient knowledge and GPU resources to fine-tune an LLM by yourself.
In-context learning by giving only a few examples as predefined contextual prompts. This is the most feasible and easy way compared to the other two.

What is in-context learning?
In-context learning is a technique that uses language models to learn tasks by providing only a few examples. This approach was popularized in the original GPT-3 paper:

Give the language model a prompt with a list of input-output pairs demonstrating a task.
Append a test input
The language model makes a prediction by conditioning on the prompt and predicting the next tokens.

To correctly respond to the prompts, the model has to understand the input distribution, output distribution, input-output mapping, and formatting. This lets the model learn the task without extensive training data.

Using in-context learning, ChatGPT can now mention DocArray for a user query. The user won't see the context prompts

In-context learning has mostly replaced fine-tuning for language models. It's been shown to be competitive with models trained on more data on natural language processing benchmarks. It's also been successful on the LAMBADA and TriviaQA benchmarks. One of the most exciting aspects is the range of applications that it enables people to build quickly, like generating code from natural language and generalizing spreadsheet functions. It usually requires just a few training examples to get a prototype up and running and is easy for non-experts to use.

Why does in-context learning sound like magic?

Why is in-context learning surprising? Because unlike conventional machine learning, it doesn't involve optimizing parameters. Consequently, rather than requiring a separate copy of the model for each downstream task, a single generalist model can simultaneously serve many different tasks. However, this is hardly unique, as meta-learning methods have been used to train models that learn from examples.

The real mystery is that LLMs aren't usually trained to learn from examples. This creates a mismatch between the pretraining task (which focuses on next token prediction) and the task of in-context learning (which involves learning from examples).

Why does in-context learning even work?

But how does it even work? LLMs are trained on a large amount of text data, which lets them capture a wide range of patterns and regularities in natural language. This gives them a rich representation of the language's underlying structure, which they use to learn new tasks from examples. In-context learning takes advantage of this by giving the LM a prompt with a few examples demonstrating a specific task. The LM uses this information to make predictions and complete the task without additional training data or parameter optimization.

A deeper understanding of in-context learning

There's still a lot of work to do to fully understand and optimize the capabilities of in-context learning. For example, during EMNLP2022, Sewon Min et al. showed that the prepended examples might not even need to be fully correct: Random labels will work as well:

In this Sang Michael Xie et al. work, the authors propose a framework for understanding how language models (LMs) perform in-context learning. According to their framework, the LM uses the prompt to "locate" the relevant concept (that it learned during pretraining) to complete the task. This can be viewed as a form of Bayesian inference, where the latent concept is inferred based on the information provided in the prompt. This is made possible by the structure and coherence of the pretraining data.

In EMNLP 2021 Brian Lester etc. showed that in-context learning (they call it "prompt design") is only effective on large models, and downstream task quality still lags far behind fine-tuned LLMs.

Known limitations of in-context learning
In-context learning on LLMs has quite a few limitations and open problems to be solved, including:

Inefficiency: The prompt has to be processed every time the model makes a prediction.
Poor performance: Prompting generally performs worse than fine-tuning.
Sensitivityto prompt wording, order of examples, etc.
Lack of clarity regarding what the model learns from the prompt. Even random labels work!

Summary
As the field of search and large language models (LLMs) continues to evolve, businesses must stay up to date on the latest developments and prepare for changes in how we search for information.** In a world dominated by LLMs like ChatGPT, staying ahead of the curve and integrating your business into these systems can be the key to ensuring visibility and relevance.**

In-context learning shows promise to inject information into an existing LLM at a low cost. This approach requires very few training examples to get a prototype working, and the natural language interface is intuitive even for non-experts. However, you should consider the potential ethical implications of using LLMs for business purposes, as well as potential risks and challenges associated with relying on these systems for critical tasks.

Overall, the future of ChatGPT and LLMs presents opportunities and challenges for businesses. By staying informed and adaptable, you can ensure that your business thrives in the face of changing neural search technology.

If you find this article is helpful, you can find more Multimodal AI articles here