DEV Community

Cover image for Build Your Own Embedding Models Using LLMs
Marko Vidrih
Marko Vidrih

Posted on

Build Your Own Embedding Models Using LLMs

In our ongoing exploration of the latest AI advancements, this article focuses on the vital role of embeddings in deep learning, particularly when employing large language models (LLMs). The quality of embeddings directly affects the performance of the models in different applications.

Creating bespoke embedding models for specific applications is ideal. Nonetheless, developing these models is fraught with challenges. Therefore, developers often resort to pre-existing, broadly-applicable embedding models.

A novel approach by Microsoft researchers offers a promising solution. It simplifies and reduces the costs of developing customized embedding models. Leveraging open-source LLMs in place of traditional BERT-like encoders, this method streamlines retraining. It also employs Microsoft's own LLMs to autonomously produce labeled training data, paving the way for innovative LLM applications and enabling entities to develop tailored LLMs for their specific needs.

The Complexities of Embedding Model Development 

Embedding models are crutial in translating input data into numerical representations that encapsulate key attributes. Word embeddings, for instance, encapsulate the semantic essence of words, while sentence embeddings delineate the interplay of words within a sentence. Similarly, image embeddings reflect the visual attributes of their subjects. These embeddings are instrumental in tasks like comparing the likeness of words, sentences, or texts.

One significant application of embeddings is in retrieval augmented generation (RAG) with LLMs. Here, embeddings assist in identifying and retrieving documents relevant to a given prompt. The LLM then integrates the content of these documents into its response, enhancing accuracy and reducing reliance on information outside its training dataset.

The efficacy of RAG hinges heavily on the embedding model's quality. Ineffective embeddings may not accurately match documents to user prompts, hindering the retrieval of pertinent documents.

Customizing embedding models with specific data is one approach to enhance their relevance for particular applications. However, the prevalent method involves a complex, multi-stage training process, initially using large-scale, weakly-supervised text pairs for contrastive learning, followed by fine-tuning with a smaller, high-quality, and meticulously labeled dataset.

This method demands significant effort to curate relevant text pairs and often relies on manually compiled datasets that are limited in scope and linguistic variety. Hence, many developers stick with generic embedding models, which may not fully meet their application needs.

Revolutionizing Embedding Models with LLMs 

Microsoft's innovative technique diverges from the standard two-stage process, instead proposing a single-stage training approach using proprietary LLMs like GPT-4. This method starts with GPT-4 generating a range of potential embedding tasks. These tasks are then used to prompt the model to create training examples.

For instance, the initial stage provided a list of abstract task descriptions, such as locating legal case law relevant to a specific argument or finding recipes based on given ingredients.

Prompt for generating high-level retrieval tasks (source: arxiv)

The next step involved submitting one of these tasks to GPT-4, which then generated a JSON structure containing a specific user prompt and corresponding positive and negative examples, each about 150 words. The results were impressively accurate, save for a minor discrepancy in the hard negative example, which could potentially skew the embeddings.

Prompt for generating examples for a retrieval task (source: arxiv)

Despite the researchers not releasing their source code or data, this Python notebook offers a glimpse into this streamlined process, highlighting its adaptability and potential for customization.

To broaden the dataset's diversity, the team designed various prompt templates and synthesized them, generating over 500,000 examples with 150,000 unique instructions using GPT-3.5 and GPT-4 through Azure OpenAI Service. The total token usage was around 180 million, costing approximately $5,000.

Interestingly, the training employed an open-source auto-regressive model rather than a bidirectional encoder like BERT, which is typical. The rationale is that these models, already pre-trained on vast datasets, can be fine-tuned for embedding tasks at minimal costs.

They validated their method on Mistral-7B using synthetic data and 13 public datasets. Through techniques like LoRA, they reduced training expenses and achieved state-of-the-art results on renowned benchmark datasets, even surpassing OpenAI's Ada-002 and Cohere's models in RAG and embedding quality assessments.

LLMs and Future of Embeddings 

The study underscores that extensive auto-regressive pre-training allows LLMs to develop robust text representations, making only minor fine-tuning necessary to convert them into efficient embedding models.

The findings also indicate the feasibility of using LLMs to generate apt training data for fine-tuning embedding models cost-effectively. This has significant implications for future LLM applications, enabling organizations to develop custom embeddings for their specific needs.

The researchers suggest that generative language modeling and text embeddings are intrinsically linked, both requiring deep language comprehension by the model. They propose that a robust LLM should be capable of autonomously generating training data for an embedding task and then be fine-tuned with minimal effort. While their experiments offer promising insights, further research is needed to fully exploit this potential.


Follow me on social media:
https://twitter.com/nifty0x
https://www.linkedin.com/in/marko-vidrih/

Top comments (0)