DEV Community

Cover image for How I Learned Generative AI in Two Weeks (and You Can Too): Part 2 - Embeddings
Julia Zhou for LLMWare

Posted on

How I Learned Generative AI in Two Weeks (and You Can Too): Part 2 - Embeddings

Introduction

A few weeks ago, I shared my experience learning about Generative AI Libraries through LLMWare's Fast Start to RAG example 1. Today, I will continue this series by taking you through example 2. This is personally one of my favorite "lessons" in this LLMWare series, so I hope you will find it thought-provoking as well! This example will focus on embeddings and vectors. Let us start by exploring what exactly these terms mean!

How do embedding models work?

Embeddings models are trained on large amounts of language tokens to either predict the next token or fill in missing tokens. In either case, these models learn how to represent language! They take in large chunks of text as input and processes it through tokenization (breaking down into smaller pieces), conversion into numbers, and various layers of transformations. These steps build a representation of the input text to help formulate the output: vectors.

Vectors are created when the input text is translated into the language through which the model sees the world. Geometrically speaking, they are n-dimensional shapes where "n" is the number of embedding dimensions (typically, n is 768). The dimensions are represented by n floats, usually ranging between 0 and 1 or -1 and 1. Converting the text to numbers allows the model to more easily compare the similarity of two texts.

Try thinking back to high school geometry! You might remember that two points (or shapes) that are close to each other are considered more similar to one another than two far away points. This process is exactly what the model performs to compare texts and is known as a semantic search. Once a query is converted to a vector, that vector is compared to all the other vectors in the database. The ones that are the most similar are returned.

Now, we are ready to take a look at the example's code! This LLMWare Faststart example can be run in the same way as example 1, but instructions can be found in our README file if needed. Example 2 is directly copy-paste ready!

Example 2: Embeddings

Extra resources

In case you missed it, I will link my previous article in this series since this example will continue building on the foundation we built in example 1. The same process for creating libraries is utilized in example 2, so I will skip over it here.

Article - Example 1: Libraries

For visual learners, here is a video that works through example 2. Feel free to watch the video before following the steps in this article. Also, here is a Python Notebook that breaks down this example's code alongside the output.

Example 2 Notebook

Part 1 - Creating embeddings & storing vectors

As mentioned above, we will not cover the library building process in this article and will move directly into embedding models. For this demo, we will use the "mini-lm-sbert" model, which is efficient and is included in the default LLMWare package. Feel free to experiment with different models, including the OpenAI Text Embedding Ada!

Recall that in example 1, we not only created our library but also added our documents into a database. This database will make it extremely convenient to access test chunks that we can give to the embedding model.

Once the library has been created, let us focus our attention on the most important line of code:

library.install_new_embedding(embedding_model_name=embedding_model, vector_db=vector_db,batch_size=100)
Enter fullscreen mode Exit fullscreen mode

This line calls the install_new_embedding function and passes in the embedding model and vector names as parameters. The final parameter batch_size determines how many text chunks will be processed at a time. Considerations like efficiency, memory, model capability, and database size all factor into choosing the most appropriate batch size.

We can confirm that our embedding creation and vector storage was a success!

update = Status().get_embedding_status(library_name, embedding_model)
print("update: Embeddings Complete - Status() check at end of embedding - ", update)
Enter fullscreen mode Exit fullscreen mode

Part 2 - Queries

Now that we have the vector database, we can begin running queries on it! We will begin by creating a very simple query before passing it into the library and running a semantic query model on it.

sample_query = "incentive compensation"
query_results = Query(library).semantic_query(sample_query, result_count=20)
Enter fullscreen mode Exit fullscreen mode

We will use the following portion of code to iterate through the query results to view them, and we will especially look at the distance parameter.

for i, entries in enumerate(query_results):
  text = entries["text"]
  document_source = entries["file_source"]
  page_num = entries["page_num"]
  vector_distance = entries["distance"]

  if len(text) > 125: text = text[0:125] + " ... "

  print("\nupdate: query results - {} - document - {} - page num - {} distance - {} ".format(i, document_source, page_num, vector_distance))

  print("update: text sample - ", text)
Enter fullscreen mode Exit fullscreen mode

Let us run the example to see the results in action!

Part 3 - The results

Through the output, we can see that at first, we have no embeddings.

embedding record - before embedding  [{'embedding_status': 'no', 'embedding_model': 'none', 'embedding_db': 'none', 'embedded_blocks': 0, 'embedding_dims': 0, 'time_stamp': 'NA'}]
Enter fullscreen mode Exit fullscreen mode

Then, there are a series of outputs showing that we are creating embeddings in batches of 100, as expected. By the end, all of the text chunks will be converted to vectors.

update: Embeddings Complete - Status() check at end of embedding -  [{'_id': 2, 'key': 'example2_library_embedding_mini-lm-sbert', 'summary': '2211 of 2211 blocks', 'start_time': '1717690179.087806', 'end_time': '1717690199.5373614', 'total': 2211, 'current': 2211, 'units': 'blocks'}]

Enter fullscreen mode Exit fullscreen mode

Now, we have arrived back at the query result for-loop mentioned above. Looking at the first result, we can see that one, among many, of the outputted metadata points is distance. This distance value can be considered the distance between the vector for our query ("incentive compensation") and the vector for this sample block.

update: query results - 0 - document - Artemis Poseidon EXECUTIVE EMPLOYMENT AGREEMENT.pdf - page num - 4 distance - 0.24837934970855713 
Enter fullscreen mode Exit fullscreen mode

The query results are sorted from lowest to highest distance - that is, from most to least similar. For comparison, we can see that the tenth query result returned has a higher distance than the first one!

update: query results - 10 - document - Eileithyia EXECUTIVE EMPLOYMENT AGREEMENT.pdf - page num - 3 distance - 0.27305811643600464 
update: text sample -  in Employer's annual cash incentive   bonus plan (the “Plan”), based on the same terms and conditions as in existence for oth ... 
Enter fullscreen mode Exit fullscreen mode

Part 4 - Further exploration

For this example, we used the "faiss" vector database, but I encourage you to experiment with others as well.

Similarly, try using different embedding models to see how their characteristics might be optimized for certain types of inputs! A series of examples involving embeddings can be found on the LLMWare Github page.

Embeddings Examples

I hope you enjoyed this example about embeddings and vectors! The next example will be about prompts and models, stay tuned for the article.

Happy coding!

To see more ...

Please join our LLMWare community on discord to learn more about RAG and LLMs! https://discord.gg/5mx42AGbHm

Visit LLMWare's Website

Explore LLMWare on GitHub

Image from Freepik

Top comments (0)