DEV Community: Akriti Upadhyay

Perform Image-Driven Reverse Image Search on E-Commerce Sites with ImageBind and Qdrant

Akriti Upadhyay — Wed, 28 Feb 2024 13:55:27 +0000

Introduction

In 1950, when Alan Turing introduced the term "Machine Intelligence" in his paper "Computing Machinery and Intelligence," no one had imagined that one day, in the future, it would lead to various innovations using artificial intelligence in different domains. One domain that is very popular and important among users is online shopping. With the surge in e-commerce, users increasingly rely on visual cues to guide them in their purchasing decisions. In response to this shift in consumer behavior, image-driven product search has emerged as a powerful tool to enhance the shopping experience. E-commerce platforms like Amazon, Myntra, Ajio, and Meesho are using image-driven product search widely.

You must be familiar with image-driven searches on shopping websites. This innovative approach uses the visual content in images to let users explore products more intuitively and efficiently. By simply uploading or capturing an image, shoppers can quickly find similar or related items within a vast catalog. Whether seeking fashion inspiration, home decor ideas, or specific product recommendations, image-driven search offers a dynamic and personalized shopping journey tailored to individual preferences and tastes.

We can make the results accurate by using the recently developed all-in-one embedding model by Meta: ImageBind. But before using the embedding model, we need to use a vector database to store them.

When it comes to image search, vector databases have been particularly transformative. Traditional image search methods often rely on metadata tags or textual descriptions, which can be limited in capturing the rich visual content of images. With vector databases, images are transformed into high-dimensional vectors that encapsulate their visual features by allowing for more accurate and nuanced similarity comparisons. This means that users can search for images based on visual similarity by enabling tasks for e-commerce product search from images with remarkable precision. However, when it comes to vector databases, we are faced with a dilemma: which database will be the best for our application?

Here, I have chosen the Qdrant Vector database, which offers an advanced search algorithm for approximate nearest neighbor search for advanced AI applications, namely the HNSW algorithm. Meesho is already using the Qdrant vector database but still, the results are not very accurate. We can make the results more accurate by integrating the Qdrant and ImageBind embedding models. Before diving into the content, let’s look at the steps:

Loading the Dataset.
Initializing the Qdrant Vector DB.
Image Embeddings with ImageBind.
Deploying with Gradio.

Reverse Product Image Search with Qdrant

Let's install the dependencies first to get started with the reverse product image search.



%pip install opendatasets gradio qdrant-client transformers sentence_transformers sentencepiece tqdm

Loading the Dataset

Using the opendatasets library, download the Kaggle dataset using your username and key. You can obtain them by visiting the Settings page on Kaggle. Click on "Access API Keys," and a kaggle.json file will be downloaded. This file will contain your username and API key.



import opendatasets as od
od.download("https://www.kaggle.com/datasets/vikashrajluhaniwal/fashion-images")

Now, let’s store the images in a list so that we can easily access the images.



import random
import gradio as gr
from PIL import Image
from qdrant_client import QdrantClient
from qdrant_client.http import models
import tempfile
import os
from tqdm import tqdm



import os

def get_image_paths(directory):
    # Initialize an empty list to store the image paths
    image_paths = []

    # Iterate through all files and directories within the given directory
    for root, dirs, files in os.walk(directory):
        for file in files:
            # Check if the file has an image extension (e.g., .jpg, .png, .jpeg, etc.)
            if file.lower().endswith(('.png', '.jpg', '.jpeg', '.gif', '.bmp')):
                # Construct the full path to the image file
                image_path = os.path.join(root, file)
                # Append the image path to the list
                image_paths.append(image_path)

    return image_paths

# Directory paths
women_directory = './fashion-images/data/Footwear_Women/Images/images_with_product_ids/'
men_directory = './fashion-images/data/Footwear_Men/Images/images_with_product_ids/'
girls_directory = './fashion-images/data/Apparel_Girls/Images/images_with_product_ids/'
boys_directory = './fashion-images/data/Apparel_Boys/Images/images_with_product_ids/'

# Get image paths for different categories
image_paths_Women = get_image_paths(women_directory)
image_paths_Men = get_image_paths(men_directory)
image_paths_Girls = get_image_paths(girls_directory)
image_paths_Boys = get_image_paths(boys_directory)

all_image_paths = []
all_image_paths.append(image_paths_Boys)
all_image_paths.append(image_paths_Girls)
all_image_paths.append(image_paths_Men)
all_image_paths.append(image_paths_Women)

Initializing the Qdrant Vector DB

Initialize the Qdrant Client with in-memory storage. The collection name will be “imagebind_data” and we will be using cosine distance.



# Initialize Qdrant client and load collection
client = QdrantClient(":memory:")
client.recreate_collection(collection_name = "imagebind_data", 
                           vectors_config = {"image": models.VectorParams( size = 1024, distance = models.Distance.COSINE ) } )

Image Embeddings with ImageBind

ImageBind is an innovative model developed by Meta AI’s FAIR Lab. This model is designed to learn a joint embedding across six different modalities: images, text, audio, depth, thermal, and IMU data. One of the key features of ImageBind is its ability to learn this joint embedding without requiring all combinations of paired data. It has been discovered that only image-paired data is necessary to bind the modalities together effectively. This unique capability allows ImageBind to leverage recent large-scale vision-language models and extend their zero-shot capabilities to new modalities simply by utilizing their natural pairing with images.

We’ll use ImageBind for creating embeddings, but before diving deep, first, let’s follow some steps required for installing ImageBind.

Clone the git repository of Imagebind:



git clone https://github.com/facebookresearch/ImageBind.git

Change the directory:



cd Imagebind

Edit the requirements.txt file: Delete Mayavi and Cartopy if getting errors or facing problems.
Install the requirements:



pip install -r requirements.txt

Come back to your system.

Then, load the model.



import sys 
sys.path.append("./ImageBind/")

device = "cuda"
import imagebind
from imagebind.models import imagebind_model
model = imagebind_model.imagebind_huge(pretrained=True)
model.eval()
model.to(device)

After initializing the model, we will now create embeddings.



from imagebind.models.imagebind_model import ModalityType
from imagebind import data
import torch

embeddings_list = []

for image_paths in [image_paths_Boys, image_paths_Girls, image_paths_Men, image_paths_Women]:
    inputs = {ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device)}
    with torch.no_grad():
        embeddings = model(inputs)
    embeddings_list.append(embeddings)

Then we’ll update the Qdrant Vector DB with the generated embeddings.



import uuid

points = []

# Iterate over each embeddings and corresponding image paths
for idx, (embedding, image_paths) in enumerate(zip(embeddings, all_image_paths)):
    for sub_idx, sample in enumerate(image_paths):
        # Convert the sample to a dictionary
        payload = {"path": sample}
        # Generate a unique UUID for each point
        point_id = str(uuid.uuid4())
        points.append(models.PointStruct(id=point_id,
                                         vector= {"image": embedding['vision'][sub_idx]}, 
                                         payload=payload)
                      )

client.upsert(collection_name="imagebind_data", points=points)

We’ll prepare a processing function in which we will take the image as an input and perform a reverse image search with the help of the embeddings.



def process_text(image_query):

    user_query = [image_query]
    dtype, modality = ModalityType.VISION, 'image'
    user_input = {dtype: data.load_and_transform_vision_data(user_query, device)}

    with torch.no_grad():
        user_embeddings = model(user_input)

    image_hits = client.search(
        collection_name='imagebind_data',
        query_vector=models.NamedVector(
            name="image",
            vector=user_embeddings[dtype][0].tolist()
            )
    )
    # Check if 'path' is in the payload of the first hit
    if image_hits and 'path' in image_hits[0].payload:
        return (image_hits[0].payload['path'])
    else:
        return None

Deploying with Gradio

Now that we have prepared the processing image function, we’ll use Gradio to deploy by defining its interface.



import tempfile
tempfile.tempdir = "./fashion-images/data"

# Gradio Interface
iface = gr.Interface(
    title="Reverse Image Search with Imagebind",
    description="Leveraging Imagebind to perform reverse image search for ecommerce products",
    fn=process_text,
    inputs=[
        gr.Image(label="image_query", type="filepath")
        ],
    outputs=[
        gr.Image(label="Image")],  
)

Image Search Using Product Category

If you want to search images with the product category, then you have to define some functions. We’ll define a function to get images from the category.



# Define function to get images of selected category
def get_images_from_category(category):
    # Convert category to string
    category_str = str(category)
    # Directory path for selected category
    category_dir = f"./fashion-images/data/{category_str.replace(' ', '_')}/Images/images_with_product_ids/"
    # List of image paths
    image_paths = os.listdir(category_dir)
    # Open and return images
    images = [Image.open(os.path.join(category_dir, img_path)) for img_path in image_paths]
    return images

Then list the product categories.



# Define your product categories
product_categories = ["Apparel Boys", "Apparel Girls", "Footwear Men", "Footwear Women"]

After that, we’ll define a function for category selection.



# Define function to handle category selection
def select_category(category):
    # Get images corresponding to the selected category
    images = get_images_from_category(category)
    # Return a random image from the list
    return random.choice(images)

Deploying with Gradio

Now, we’ll create Gradio interface components for the category selection, such as category dropdown and submit button.



# Create interface components for the category selection
category_dropdown = gr.Dropdown(product_categories, label="Select a product category")
submit_button = gr.Button()
images_output = gr.Image(label="Images of Selected Category")

After that, we’ll create a Gradio interface and pass the functions and components.



category_search_interface = gr.Interface(
    fn=select_category,
    inputs=category_dropdown,
    outputs=images_output,
    title="Category-driven Product Search for Ecommerce",
    description="Select a product category to view a random image from the corresponding directory.",
)

Merging Two Gradio Interfaces

We have deployed two Gradio Interfaces: one is Reverse Image Search and another is Image Search Using Product Category. What if we can see both in one application? That can be performed by using TabbedInterface.



# Combine both interfaces into the same API
combined_interface = gr.TabbedInterface([iface, category_search_interface])

# Launch the combined interface
combined_interface.launch(share=True)

Now, we’ll get an internal URL and a public URL. The application is ready with two tabs.

Tab 0: Reverse Image Search with ImageBind for E-Commerce

Tab 1: Category-Driven Product Search for E-Commerce

Examples

Let’s see how our application performs.

I passed a girl’s image who was wearing a sleeveless Dress. Let’s see if our application can find a similar image from the products.

The application found a dress which is quite similar. Impressive!

Let’s try passing Shoes as an Image. I passed a picture of three people’s feet with sneakers. Let’s see the result.

The result is impressive. The output is sneakers of the same style.

Let’s try a category search. For example, I want to see what are the products in the Boys Apparel category. Gradio can give only one image as output, but in this application, it will not give the same image for every search.

The first search gave an image of a white t-shirt. Let’s search again to see what other t-shirts there are.

The result is a blue-gray t-shirt. So, the results are not repeated. Great!

Conclusion

With Qdrant Vector DB, reverse image search is possible for e-commerce products. We saw from our results how to perform a reverse image search by uploading an image and getting images of similar products from select product categories. The results were accurate with the help of the ImageBind embedding model.

Hope you enjoyed reading this blog. Now it is your turn to try this integration.

CodeSpace

You can find the code on GitHub.

Thanks for reading!

This blog was originally published here: https://medium.com/@akriti.upadhyay/perform-image-driven-reverse-image-search-on-e-commerce-sites-with-imagebind-and-qdrant-0a62f0169e19

Steps to Build RAG Application with Gemma 7B LLM

Akriti Upadhyay — Fri, 23 Feb 2024 06:07:17 +0000

Introduction

As large language models are advancing, the craze for building RAG (Retrieval Augmented Generation) applications is increasing. Google just launched an open-source model: Gemma. As we know, RAG represents a fusion between two fundamental methodologies: retrieval-based techniques and generative models. Retrieval-based techniques involve sourcing pertinent information from expansive knowledge repositories or corpora in response to specific queries. Generative models excel in crafting original text or responses by leveraging insights taken from training data to create new content from scratch. With this launch, why not try the new open-source model for building a RAG pipeline and see how it is performing?

Let’s get started and break the process into these steps:

Loading the Dataset: Cosmopedia
Embedding Generation with Hugging Face
Storing in the FAISS DB
Gemma: Introducing the SOTA model
Querying the RAG Pipeline

Building RAG Application on Gemma 7B

Before rolling our sleeves on, let’s install and import the required dependencies.

%pip install -q -U langchain torch transformers sentence-transformers datasets faiss-cpu

import torch
from datasets import load_dataset
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from transformers import AutoTokenizer
from transformers import AutoTokenizer, pipeline
from langchain import HuggingFacePipeline
from langchain.chains import RetrievalQA

Loading the Dataset: Cosmopedia

To make a RAG application, we have selected a Hugging Face dataset, Cosmopedia. This dataset consists of synthetic textbooks, blog posts, stories, posts, and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1. The dataset contains over 30 million files and 25 billion tokens, which makes it the largest open synthetic dataset to date.

This dataset contains 8 subsets. We’ll move with the ‘stories’ subset. We’ll load the dataset using the datasets library.

data = load_dataset("HuggingFaceTB/cosmopedia", "stories", split="train")

Then, we will convert it to a Pandas dataframe, and save it to a CSV file.

data = data.to_pandas()
data.to_csv("dataset.csv")
data.head()

Now that the dataset is saved on our system, we will use LangChain to load the dataset.

loader = CSVLoader(file_path='./dataset.csv')
data = loader.load()

Now that the data is loaded, we need to split the documents inside the data. Here, we split the documents into chunk sizes of 1000. This will help the model to work fast and efficiently.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(data)

Embedding Generation with Hugging Face

After that, we will generate embeddings using Hugging Face Embeddings and with the help of the Sentence Transformers model.

modelPath = "sentence-transformers/all-MiniLM-l6-v2"
model_kwargs = {'device':'cpu'}
encode_kwargs = {'normalize_embeddings': False}

embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,     
    model_kwargs=model_kwargs, 
    encode_kwargs=encode_kwargs 
)

Storing in the FAISS DB

The embeddings are generated, but we need them to be stored in a vector database. We’ll be saving those embeddings in the FAISS vector store, which is a library for efficient similarity search and clustering dense vectors.

db = FAISS.from_documents(docs, embeddings)

Gemma: Introducing the SOTA model

Gemma offers two model sizes, with 2 billion and 7 billion parameters respectively, catering to different computational constraints and application scenarios. Both pre-trained and fine-tuned checkpoints are provided, along with an open-source codebase for inference and serving. It is trained on up to 6 trillion tokens of text data and leverages similar architectures, datasets, and training methodologies as the Gemini models. Both exhibit strong generalist capabilities across text domains and excel in understanding and reasoning tasks on a large scale.

The release includes raw, pre-trained checkpoints as well as fine-tuned checkpoints optimized for specific tasks such as dialogue, instruction-following, helpfulness, and safety. Comprehensive evaluations have been conducted to assess the models' performance and address any shortcomings, which enables thorough research and investigation into model tuning regimes and the development of safer and more responsible model development methodologies. Gemma's performance surpasses that of comparable-scale open models across various domains, including question-answering, commonsense reasoning, mathematics and science, and coding, as demonstrated through both automated benchmarks and human evaluations. To know more about the Gemma model, visit this technical report.

To get started with the Gemma model, you should acknowledge their terms on Hugging Face. Then pass the Hugging Face token while logging in.

from huggingface_hub import notebook_login
notebook_login()

Initialize the tokenizer with the model.

model = AutoModelForCausalLM.from_pretrained("google/gemma-7b")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b", padding=True, truncation=True, max_length=512)

Create a text generation pipeline.

pipe = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer,
    return_tensors='pt',
    max_length=512,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda"
)

Initialize the LLM with pipeline and model kwargs.

llm = HuggingFacePipeline(
    pipeline=pipe,
    model_kwargs={"temperature": 0.7, "max_length": 512},
)

Now it is time to use the vector store and the LLM for question-answering retrieval.

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever()
)

Querying the RAG Pipeline

The RAG pipeline is ready; let’s pass the queries and see how it performs.

qa.invoke("Write an educational story for young children.")

The result is:

Once upon a time, in a cozy little village nestled between rolling hills and green meadows, there lived a curious kitten named Whiskers. Whiskers loved to explore every nook and cranny of the village, from the bustling marketplace to the quiet corners where flowers bloomed. One sunny morning, as Whiskers trotted down the cobblestone path, he spotted something shimmering in the distance. With his whiskers twitching in excitement, he scampered towards it, his little paws pitter-pattering on the ground. To his delight, he found a shiny object peeking out from beneath a bush--a beautiful, colorful kite! With a twinkle in his eye, Whiskers decided to take the kite on an adventure. He tugged at the string, and the kite soared into the sky, dancing gracefully with the gentle breeze. Whiskers giggled with joy as he watched the kite soar higher and higher, painting the sky with its vibrant colors.

Final Words

The Gemma 7B model performed very well. We got to read a beautiful story about a kitten. The new SOTA model was interesting and exciting to use. With the help of the FAISS vector store, we were able to build a RAG pipeline. Thanks for reading!

This article was originally published here: https://blog.superteams.ai/steps-to-build-rag-application-with-gemma-7b-llm-43f7251a36a1

How to Build an Advanced AI-Powered Enterprise Content Pipeline Using Mixtral 8x7B and Qdrant

Akriti Upadhyay — Wed, 21 Feb 2024 06:13:43 +0000

Introduction

As the digital landscape rapidly evolves, enterprises are facing the challenge of managing and harnessing the exponential growth of data to drive business success. With the expansion of volume and complexity of content, traditional content management approaches are failing to provide the agility and intelligence required to scale or extract valuable insights.

The integration of vector databases and the mixture of experts, such as Mixtral 8x7B LLM, offers a transformative solution for enterprises seeking to unlock the full potential of their content pipelines. In this blog post, we will explore the essential components and strategies for building an advanced AI-powered enterprise content pipeline using Mixtral 8x7B and Qdrant – an advanced vector database following the HNSW algorithm for approximate nearest neighbor search.

To build the advanced AI-powered pipeline, we’ll leverage a Retrieval Augmented Generation (RAG) pipeline by following these steps:

Loading the Dataset using LlamaIndex
Embedding Generation using Hugging Face
Building the Model using Mixtral 8x7B
Storing the Embedding in the Vector Store
Building a Retrieval pipeline
Querying the Retriever Query Engine

Enterprise Content Generation with Mixtral 8x7B

To build a RAG pipeline with Mixtral 8x7B, we’ll install the following dependencies:

%pip install -q llama-index==0.9.3 qdrant-client transformers[torch]

Loading the Dataset Using LlamaIndex

For the dataset, we have used Diffbot Knowledge Graph API. Diffbot is a sophisticated web scraping and data extraction tool that utilizes artificial intelligence to automatically retrieve and structure data from web pages. Unlike traditional web scraping methods that rely on manual programming to extract specific data elements, Diffbot uses machine learning algorithms to comprehend and interpret web content much like a human would. This allows Diffbot to accurately identify and extract various types of data, including articles, product details, and contact information, from a wide range of websites.

One of the standout features of Diffbot is its Knowledge Graph Search, which organizes the extracted data into a structured database known as a knowledge graph. A knowledge graph is a powerful representation of interconnected data that enables efficient searching, querying, and analysis. Diffbot's Knowledge Graph Search not only extracts individual data points from web pages but also establishes relationships between them by creating a comprehensive network of information.

To get the URL, make an account on Diffbot. Go to Knowledge Graph, and Search. Here we have used Organization in the Visual, and filtered by Industries->Pharmaceutical Companies.

Then, we chose GSK, which is a renowned pharmaceutical company, and clicked Articles.

After clicking Articles, we got an option to export it as CSV or make an API call.

We made an API call and used that URL to access the data in Python.

import requests
import json

# The Diffbot API URL
url = "https://kg.diffbot.com/<your-url>"

# Make a GET request to the API
response = requests.get(url)

# Parse the response text as JSON
data = response.json()

# Open a file in write mode
with open('json/response_text.json', 'w') as file:
    # Write the data to the file as JSON
    json.dump(data, file)

print("Response text has been saved to 'json/response_text.json'.")

The data is saved now in a JSON file. Using LlamaIndex Simple Directory Reader, we will load the data from the “json” directory.

from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("/home/akriti/Notebooks/json").load_data()

Now, it’s time to split the documents into chunks using Sentence Splitter.

from llama_index.node_parser.text import SentenceSplitter

# Create a SentenceSplitter object with a specified chunk size
text_parser = SentenceSplitter(chunk_size=1024)

# Initialize empty lists to store text chunks and corresponding document indexes
text_chunks = []
doc_idxs = []

# Iterate over each document in the 'documents' list along with its index
for doc_idx, doc in enumerate(documents):
    # Split the text of the current document into smaller chunks using the SentenceSplitter
    cur_text_chunks = text_parser.split_text(doc.text)

    # Extend the list of text chunks with the chunks from the current document
    text_chunks.extend(cur_text_chunks)

    # Extend the list of document indexes with the index of the current document,
    # repeated for each corresponding text chunk
    doc_idxs.extend([doc_idx] * len(cur_text_chunks))

After that, we will create a Text Node object where we will assign the metadata of the source document to the metadata attribute of the node so that the relationships between them can be managed easily.

# Import the TextNode class from the llama_index schema module
from llama_index.schema import TextNode

# Initialize an empty list to store nodes
nodes = []

# Iterate over each index and text chunk in the text_chunks list
for idx, text_chunk in enumerate(text_chunks):
    # Create a new TextNode object with the current text chunk
    node = TextNode(text=text_chunk)

    # Retrieve the corresponding source document using the document index from doc_idxs
    src_doc = documents[doc_idxs[idx]]

    # Assign the metadata of the source document to the metadata attribute of the node
    node.metadata = src_doc.metadata

    # Append the node to the list of nodes
    nodes.append(node)

Embedding Generation Using Hugging Face

There are many supported embedding tool integrations with LlamaIndex; here we are moving forward with the Hugging Face Embedding tool.

# Import the HuggingFaceEmbedding class from the llama_index embeddings module
from llama_index.embeddings import HuggingFaceEmbedding

# Initialize a HuggingFaceEmbedding object with the specified model name
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en")

# Iterate over each node in the nodes list
for node in nodes:
    # Get the content of the node along with its metadata
    content_with_metadata = node.get_content(metadata_mode="all")

    # Use the embedding model to get the text embedding for the node's content
    node_embedding = embed_model.get_text_embedding(content_with_metadata)

    # Assign the computed embedding to the embedding attribute of the node
    node.embedding = node_embedding

Building the Model Using Mixtral 8x7B

Mixtral 8x7B is a cutting-edge language model developed by Mistral AI. It is a sparse mixture of experts (MOE) model with open weights. It is designed to offer powerful AI capabilities by integrating elements from BERT, RoBERTa, and GPT-3. This model represents a significant advancement in natural language processing by providing a practical and accessible solution for various applications.

Mixtral 8x7B employs a Mixture of Experts (MoE) architecture and is a decoder-only model. In this architecture, each layer consists of 8 feedforward blocks, referred to as experts. During processing, a router network dynamically selects two experts for each token at every layer, which enables effective information processing and aggregation.

One of Mixtral 8x7B's standout features is its exceptional performance, characterized by high-quality outputs across diverse tasks. The model is pre-trained with multilingual data using a context size of 32k tokens. It outperforms Llama 2 and GPT-3.5 on most benchmarks but, in some cases, it matches Llama 2 and the GPT-3.5 model.

Mixtral 8x7B is also available in an Instruct form, which is supervised and fine-tuned on an instruction-following dataset, and optimized through Direct Preference Optimization (DPO) training.

Using Hugging Face and LlamaIndex, we will load the model.

import torch
from llama_index.llms import HuggingFaceLLM

# Instantiate a HuggingFaceLLM object with specified parameters
llm = HuggingFaceLLM(
    context_window=4096,  # Maximum context window size
    max_new_tokens=256,  # Maximum number of new tokens to generate
    generate_kwargs={"temperature": 0.7, "do_sample": False},  # Generation settings
    tokenizer_name="mistralai/Mixtral-8x7B-v0.1",  # Pre-trained tokenizer name
    model_name="mistralai/Mixtral-8x7B-v0.1",  # Pre-trained model name
    device_map="auto",  # Automatic device mapping
    stopping_ids=[50278, 50279, 50277, 1, 0],  # Tokens to stop generation
    tokenizer_kwargs={"max_length": 4096},  # Tokenizer arguments
    model_kwargs={"torch_dtype": torch.float16}  # Model arguments
)

After that, we will create a service context with the loaded LLM and the embedding model.

from llama_index import ServiceContext
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)

Storing the Embedding in the Vector Store

Here, we have used the Qdrant Vector Database to store the embeddings. Qdrant is a high-performance open-source vector search engine designed to efficiently index and search through large collections of high-dimensional vectors. It's particularly well-suited for use cases involving similarity search, where the goal is to find items that are most similar to a query vector within a large dataset.

We will initiate the Qdrant Client first, and create a collection by enabling hybrid search.

import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import StorageContext
from llama_index.core import VectorStoreIndex, ServiceContext, SimpleDirectoryReader

# Initialize Qdrant client
client = qdrant_client.QdrantClient(location=":memory:")

# Create Qdrant vector store
vector_store = QdrantVectorStore(client=client, collection_name="my_collection",enable_hybrid=True)

We will add the node to the vector store and create a storage context. Also, we will create an index where we will use documents, service context, and storage context.

# Add nodes to the vector store
vector_store.add(nodes)

# Create a storage context
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(documents, storage_context=storage_context, service_context=service_context)

We will then use a query and embedding model to create a query embedding, which we will use later as a reference.

query_str = "Can you update me about shingles vaccine?"
query_embedding = embed_model.get_query_embedding(query_str)

Using hybrid query mode, we will create a vector store query using LlamaIndex, where we will use the query embedding and save the query result.

from llama_index.vector_stores import VectorStoreQuery
query_mode = "hybrid"
vector_store_query = VectorStoreQuery(query_embedding=query_embedding, similarity_top_k=2, mode=query_mode)
query_result = vector_store.query(vector_store_query)

Then, we will parse the query result into the set of nodes.

from llama_index.schema import NodeWithScore
from typing import Optional

nodes_with_scores = []

for index, node in enumerate(query_result.nodes):
    score: Optional[float] = None
    if query_result.similarities is not None and index < len(query_result.similarities):
        score = query_result.similarities[index]
    nodes_with_scores.append(NodeWithScore(node=node, score=score))

Building a Retrieval Pipeline

For building a retrieval pipeline, we’ll use the above to create a retriever class.

from llama_index import QueryBundle
from llama_index.retrievers import BaseRetriever
from typing import Any, List, Optional
from llama_index.vector_stores import VectorStoreQuery
from llama_index.schema import NodeWithScore

class VectorDBRetriever(BaseRetriever):
    """Retriever over a qdrant vector store."""
    def __init__(self,
                 vector_store: QdrantVectorStore,
                 embed_model: Any,
                 query_mode: str = "hybrid",
                 similarity_top_k: int = 2) -> None:
        """Initialize parameters."""
        self._vector_store = vector_store
        self._embed_model = embed_model
        self._query_mode = query_mode
        self._similarity_top_k = similarity_top_k
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve."""
        query_embedding = self._embed_model.get_query_embedding(query_bundle.query_str)
        vector_store_query = VectorStoreQuery(
            query_embedding=query_embedding,
            similarity_top_k=self._similarity_top_k,
            mode=self._query_mode,
        )
        query_result = self._vector_store.query(vector_store_query)
        nodes_with_scores = []
        for index, node in enumerate(query_result.nodes):
            score: Optional[float] = None
            if query_result.similarities is not None and index < len(query_result.similarities):
                score = query_result.similarities[index]
            nodes_with_scores.append(NodeWithScore(node=node, score=score))
        return nodes_with_scores


retriever = VectorDBRetriever(
    vector_store, embed_model, query_mode="hybrid", similarity_top_k=2
)

Finally, our query engine will be ready with the help of Retriever Query Engine.

from llama_index.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(
 retriever, service_context=service_context
)

Querying the Retriever Query Engine

As our query engine is ready, now is the time to pass some queries and see some results.

Question 1:

query_str = "Write a paragraph about GSK announcement about its shares."
response = query_engine.query(query_str)
print(str(response))

The response will be:

GSK plc announced the completion of its share consolidation on July 18, 2022. This followed the demerger of the Consumer Healthcare business from the GSK Group to form Haleon. The consolidation of GSK shares became effective at 8.00 a.m. on July 19, 2022. As part of the consolidation, a ratio of 4 new ordinary shares was applied for every 5 existing ordinary shares. Fractional entitlements that arose from the consolidation were aggregated and sold in the open market, with the net proceeds paid to each relevant shareholder according to their entitlement. Following the issuance and consolidation, the total number of voting rights in GSK as of July 19, 2022, was 4,067,352,076.

Question 2:

query_str = "Write a paragraph about GSK's RSV vaccine."
response = query_engine.query(query_str)
print(str(response))

The response will be:

GSK's Arexvy is the world's first respiratory syncytial virus (RSV) vaccine for older adults. The US Food and Drug Administration (FDA) approved Arexvy for the prevention of lower respiratory tract disease (LRTD) caused by RSV in individuals 60 years of age and older. This groundbreaking approval enables adults aged 60 years and older to be protected from RSV disease for the first time. The approval is based on data from the positive pivotal AReSVi-006 phase III trial that showed exceptional efficacy in older adults, including those with underlying medical conditions, and in those with severe RSV disease. The US launch was planned before the 2023/24 RSV season. RSV is a common, contagious virus that can lead to potentially serious respiratory illness. It causes approximately 177,000 hospitalizations and an estimated 14,000 deaths in the US in adults aged 65 years and older each year.

Question 3:

query_str = "Write a paragraph about GSK's Endrometrial Cancer Drug Development."
response = query_engine.query(query_str)
print(str(response))

The response will be:

GSK has made significant improvement in the development of drugs for endometrial cancer. Their drug, Jemperli (dostarlimab), has been approved by the European Commission and the US Food and Drug Administration (FDA) for the treatment of adult patients with mismatch repair-deficient (dMMR)/microsatellite instability-high (MSI-H) primary advanced or recurrent endometrial cancer. Jemperli, in combination with carboplatin and paclitaxel (chemotherapy), is the first and only frontline immuno-oncology treatment in the European Union for this type of endometrial cancer. The FDA has also granted accelerated approval for Jemperli as a monotherapy for treating adult patients with dMMR/MSI-H recurrent or advanced endometrial cancer that has progressed on or following prior treatment with a platinum-containing regimen. This approval is based on the results from the dMMR/MSI-H population of Part 1 of the RUBY/ENGOT-EN6/GOG3031/NSGO phase III trial. GSK continues to evaluate Jemperli in the hopes of further expansion for the drug as data mature.

Question 4:

query_str = "Write a paragraph about GSK's Hepatocellular Carcinoma Drug Development."
response = query_engine.query(query_str)
print(str(response))

The response will be:

GSK is making significant progress in the development of drugs for hepatocellular carcinoma (HCC). One of their drugs, Cobolimab, is currently in Phase II clinical trials for HCC. Cobolimab is a humanized monoclonal IgG4 antibody that inhibits T cell immunoglobulin mucin-3 (TIM-3), and is under development for the treatment of solid tumors including melanoma, squamous and non-squamous non-small cell lung carcinoma, HCC, and colorectal cancer. It is administered through the intravenous route. The drug's phase transition success rate (PTSR) and likelihood of approval (LoA) are being closely monitored. GSK's efforts in this area demonstrate their commitment to advancing treatments for HCC.

Question 5:

query_str = "Write a paragraph about GSK's Uncomplicated Cervical And Urethral Gonorrhea Drug Development."
response = query_engine.query(query_str)
print(str(response))

The response will be:

GSK is currently developing a potential first-in-class antibiotic, Gepotidacin, for the treatment of uncomplicated cervical and urethral gonorrhea. This drug is in Phase III of clinical development. Gepotidacin is the first in a new chemical class of antibiotics called triazaacenaphthylene bacterial topoisomerase inhibitors. It is being investigated for use in uncomplicated urinary tract infection and urogenital gonorrhea, two infections not addressed by new oral antibiotics in 20 years. The Phase III programme comprises two studies, EAGLE-1 and EAGLE-2, testing Gepotidacin in two common infections caused by bacteria identified as antibiotic-resistant threats. The development of Gepotidacin is the result of a successful public-private partnership between GSK, the US government's Biomedical Advanced Research and Development Authority (BARDA), and Defense Threat Reduction Agency (DTRA).

Final Words

With the help of the LlamaIndex framework, we used Diffbot API to extract enterprise content that was related to a pharmaceutical company, GSK. Using Hugging Face embeddings, Qdrant Vector Store, and Mixtral 8x7B, the retrieval pipeline was built. The results obtained using the retrieval query engine were quite fascinating. Building an advanced AI-powered enterprise content pipeline has become easy with the help of Mixtral 8x7B.

This article was originally published here: https://blog.superteams.ai/how-to-build-an-advanced-ai-powered-enterprise-content-pipeline-using-mixtral-8x7b-and-qdrant-b01aa66e3884

Steps to Build Chinese Language AI Using DeepSeek and Qdrant

Akriti Upadhyay — Tue, 13 Feb 2024 04:45:00 +0000

Introduction

As we step into the Chinese New Year - the Year of the Dragon - in 2024, I thought: why not build a Chinese News AI using DeepSeek and Qdrant? Especially so because, though LLMs are becoming vast in size and complexity, there are still a range of challenges in building an accurate and efficacious language AI beyond the English Language. In this context, DeepSeek LLM has embarked on a long-term project to overcome the inaccuracies in language application. DeepSeek LLM excels in the Chinese language, and here we’ll see how it performs while fetching news from a Chinese news dataset. We’ll utilize LlamaIndex, FastEmbed by Qdrant, and Qdrant Vector Store to develop an application where we can understand Chinese news with a robust RAG.

Let’s dive deeper!

DeepSeek LLM: An Open-Source Language Model with Longtermism

DeepSeek LLM is an advanced language model that has developed different models including Base and Chat. It has been trained from scratch on a dataset of 2 trillion tokens in both English and Chinese. In terms of size, there are two varieties of DeepSeek LLM models: one comprises 7 billion parameters and the other, 67 billion parameters.

The 7B model uses Multi-Head Attention, while the 67B model uses Grouped-Query Attention. These variants operate on the same architecture as the Llama 2 model, which is an autoregressive transformer decoder model.

DeepSeek LLM is a project dedicated to advancing open-source large language models in the long-term perspective. However, DeepSeek LLM 67B outperforms Llama 2 70B in various domains such as reasoning, mathematics, coding, and comprehension. Compared to other models, including GPT 3.5, DeepSeek LLM excels in Chinese language proficiency.

The alignment pipeline of DeepSeek LLM consists of two stages:

Supervised Fine-Tuning: The 7B model is fine-tuned for 4 epochs, while the 67B model is fine-tuned for 2 epochs. During supervised fine-tuning, the learning rate for the 7B model is 1e-5, while for the 67B model it is 5e-6. The repetition ratio of the model tends to increase when the quantity of math SFT data increases, although the math SFT data includes similar patterns in reasoning.
Direct Preference Optimization: To address the repetition problem, the model's ability was enhanced using DPO training, which proved to be an effective method for LLM alignment. The preference data for DPO training is constructed in terms of helpfulness and harmlessness.

The model, along with all its variants, is available on Hugging Face. To learn more about DeepSeek LLM, visit their paper and Github repository.

Qdrant: A High-Performance Vector Database

Qdrant is an open-source vector database and vector similarity search engine written in Rust, engineered to empower the next generation of AI applications with advanced and high-performing vector similarity search technology. Its key features include multilingual support, which enables versatility across various data types, and filters for a wide array of applications.

Qdrant boasts speed and accuracy through a custom modification of the HNSW algorithm for Approximate Nearest Neighbor Search, which ensures state-of-the-art search capabilities by maintaining precise results. Moreover, it supports additional payload associated with vectors by allowing filterable results based on payload values. With a rich array of supported data types and query conditions, including string matching, numerical ranges, and geo-locations, Qdrant offers versatility in data management.

As a cloud-native and horizontally scalable platform, Qdrant efficiently handles data scaling needs by effectively utilizing computational resources through dynamic query planning and payload data indexing. Its applications include semantic text search, recommendations, user behavior analysis, and more, which offers a production-ready service with a convenient API for storing, searching, and managing vectors along with additional payload. For further details, the Qdrant Documentation provides a comprehensive guide on installation, usage, tutorials, and examples.

Utilizing FastEmbed for Lightweight Embedding Generation

FastEmbed is a lightweight, fast, and accurate Python library built specifically for embedding generation, with maintenance overseen by Qdrant. It achieves efficiency and speed through the utilization of quantized model weights and ONNX Runtime, which sidesteps the necessity for a PyTorch dependency.

FastEmbed supports data parallelism for encoding vast datasets efficiently and is engineered with a CPU-centric approach. Moreover, it excels in accuracy and recall metrics compared to OpenAI Ada-002 by boasting the Flag Embedding as its default model, which leads the MTEB leaderboard. It also supports Jina Embedding and Text Embedding. There are several popular text models supported by FastEmbed. To learn more about the supported models, visit here.

LlamaIndex Framework for Robust RAG

LlamaIndex is a robust framework ideally suited for constructing Retrieval-Augmented Generation (RAG) applications. It facilitates the decoupling of chunks used for retrieval and synthesis, which is a crucial feature because the optimal representation for retrieval may differ from that for synthesis. As document volumes expand, LlamaIndex supports structured retrieval by ensuring more precise outcomes, particularly when a query is only relevant to a subset of documents.

Moreover, LlamaIndex prioritizes optimized performance by offering an array of strategies to enhance the RAG pipeline efficiently. It aims to elevate retrieval and generation accuracy across complex datasets by mitigating hallucinations. LlamaIndex supports many embedding models as well as integration with large language models. Additionally, it seamlessly integrates with established technological platforms like LangChain, Flask, and Docker, which offers customization options such as seeding tree construction with custom summary prompts.

Understanding Chinese News with DeepSeek

Since DeepSeek LLM excels in Chinese language proficiency, let’s build a Chinese News AI using Retrieval Augmented Generation.

To get started, let’s install all the dependencies.

%pip install -q llama-index transformers datasets
%pip install -q llama-cpp-python
%pip install -q qdrant-client
%pip install -q llama_hub
%pip install -q fastembed

Here, I have used this dataset; it is a multilingual news dataset. I have picked the Chinese language. Load the dataset and save it to your directory. We'll be using LlamaIndex to read the data with SimpleDirectoryReader.

from datasets import load_dataset
dataset = load_dataset("intfloat/multilingual_cc_news", languages=["zh"], split="train")
dataset.save_to_disk("Notebooks/dataset")

Now, using LlamaIndex, load the data from the directory where we saved our dataset.

from llama_index import SimpleDirectoryReader
documents = SimpleDirectoryReader("Notebooks/dataset").load_data()

After that, split the documents into small chunks using SentenceSplitter. Here, we need to maintain the relationship between the document and the source document index so that it helps in injecting the document metadata.

from llama_index.node_parser.text import SentenceSplitter
text_parser = SentenceSplitter(chunk_size=1024,)
text_chunks = []
doc_idxs = []
for doc_idx, doc in enumerate(documents):
    cur_text_chunks = text_parser.split_text(doc.text)
    text_chunks.extend(cur_text_chunks)
    doc_idxs.extend([doc_idx] * len(cur_text_chunks))

Then, we’ll construct nodes from text chunks manually.

from llama_index.schema import TextNode
nodes = []
for idx, text_chunk in enumerate(text_chunks):
    node = TextNode(text=text_chunk,)
    src_doc = documents[doc_idxs[idx]]
    node.metadata = src_doc.metadata
    nodes.append(node)

For each node, we’ll generate embeddings using the FastEmbed Embedding model.

from llama_index.embeddings import FastEmbedEmbedding

embed_model = FastEmbedEmbedding(model_name="BAAI/bge-small-en-v1.5")
for node in nodes:
    node_embedding = embed_model.get_text_embedding(node.get_content(metadata_mode="all"))
    node.embedding = node_embedding

Now, it's time to load the DeepSeek LLM using HuggingFaceLLM from LlamaIndex. Here, I used the chat model.

from llama_index.llms import HuggingFaceLLM
llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    tokenizer_name="deepseek-ai/deepseek-llm-7b-chat",
    model_name="deepseek-ai/deepseek-llm-7b-chat",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    model_kwargs={"torch_dtype": torch.float16}
)

Then, we'll define the ServiceContext, which consists of the embedding model and the large language model.

from llama_index import ServiceContext
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)

After that, we'll create a vector store collection using the Qdrant vector database and create a storage context using the vector store collection.

import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore
client = qdrant_client.QdrantClient(location=":memory:")
from llama_index.storage.storage_context import StorageContext
from llama_index import (VectorStoreIndex,
                         ServiceContext,
                         SimpleDirectoryReader,)
vector_store = QdrantVectorStore(client=client, collection_name="my_collection")
vector_store.add(nodes)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

We’ll pass the documents, storage context, and service context into the VectorStoreIndex.

index = VectorStoreIndex.from_documents(documents, storage_context=storage_context, service_context=service_context)

We’ll generate a query embedding using a query string to build a retrieval pipeline.

query_str = "Can you give me news around IPhone?"
query_embedding = embed_model.get_query_embedding(query_str)

Then, we’ll construct a Vector Store query and query the vector database.

from llama_index.vector_stores import VectorStoreQuery
query_mode = "default"
vector_store_query = VectorStoreQuery(query_embedding=query_embedding, similarity_top_k=2, mode=query_mode)
query_result = vector_store.query(vector_store_query)
print(query_result.nodes[0].get_content())

The following will be the result:

8%，僅排名第五，位居華為、OPPO和Vivo等本土手機廠商之後----這三家中國手機廠商的市場份額加起來達到47%。
普遍對iPhone 8更期待
除此之外，即將發布的iPhone 7恐怕會成為大獲成功的iPhone 6的"犧牲品"。去年第一季度，得益於iPhone 6銷量激增，蘋果在中國的營收增長了74%。一年後，由於iPhone 6S銷量疲軟，蘋果在全球的iPhone銷量首次出現下滑，而公司營收更是出現13年來的首次滑坡，盡管根據市場研究機構Strategy Analytics的數據，iPhone 6S是今年第二季度全球最暢銷的智能手機。
到目前為止，新浪微博網友對iPhone 7發布的討論，已經超過了去年iPhone 6S發布前的熱度。部分中國用戶甚至已經開始盤算著購買有望於明年發布的iPhone8。鑒於2017年是iPhone上市十周年，外界預計iPhone 8將作出更大的升級。
由於投資者擔心iPhone銷量已過巔峰，蘋果股價在今年始終承受著壓力。盡管今年以來蘋果股價累計上漲了2.35%，但仍然落後於標准普爾500指數的平均漲幅。
市場研究機構Stratechery科技行業分析師本·湯普森（Ben Thompson）說："相比2014年，今天最大的變化就是iPhone已無處不在。當人們第一次獲得購買iPhone的機會時，它還有巨大的增長空間，但如今那種潛力已經得到充分挖掘。"◎ 陳國雄
美國在1978年立法制定國際銀行法(International Banking Act of 1978)，該法將外國銀行業納入與國內銀行相同準則。在此之前，外國銀行設立係依據州法沒有一致性。
1978年制定國際銀行法後，外國銀行設立採雙規制(Dual System)，可向聯邦銀行管理機構OCC (Office of the Comptroller of the Currency)或州銀行 (State Banking Department)當局申請，如果向州申請設立毋須經過聯邦銀行同意。到了1991年外國銀行在美迅速成長，大約有280家外國銀行，資產值達6,260億美元，佔美國銀行總資產18%，大部份是依州法設立。

The response is: ‘At 8 percent, it ranked fifth behind local handset makers Huawei, OPPO, and Vivo - the three Chinese handset makers with a combined market share of 47 percent.
Widespread anticipation for iPhone 8
On top of that, the upcoming iPhone 7 is feared to be a "casualty" of the hugely successful iPhone 6. In the first quarter of last year, Apple's China revenue grew 74 percent, thanks to a surge in iPhone 6 sales. A year later, Apple's global iPhone sales fell for the first time due to weak iPhone 6S sales, and the company's revenue slipped for the first time in 13 years, even though the iPhone 6S was the world's best-selling smartphone in the second quarter of this year, according to market researcher Strategy Analytics.
So far, discussions on Sina Weibo about the release of the iPhone 7 have surpassed the buzz surrounding the release of the iPhone 6S last year. Some Chinese users have even begun planning to buy the iPhone 8, which is expected to be released next year, and is expected to get even bigger upgrades given that 2017 marks the 10th anniversary of the iPhone's launch.
As investors are worried that iPhone sales have peaked, Apple shares have been under pressure this year. Although Apple's stock price has risen 2.35% since the beginning of the year, it still lags behind the average rate of increase of the Standard & Poor's 500 Index.
Market research organization Stratechery technology industry analyst Ben Thompson (Ben Thompson) said: "Compared to 2014, the biggest change today is that the iPhone is everywhere. When people first got the chance to buy an iPhone, there was huge room for growth, but today that potential has been fully realized." ◎ Chen Guoxiong
The U.S. legislated the International Banking Act of 1978, which brought the foreign banking industry under the same criteria as domestic banks. Prior to that, foreign banks were established under state law with no consistency.
After the enactment of the International Banking Act of 1978, the establishment of foreign banks is based on a Dual System, which can be applied to the OCC (Office of the Comptroller of the Currency), the federal banking agency, or the State Banking Department (State Banking Department), and does not require the consent of the federal banking agency. The federal bank's consent is not required if the application is made to a state. By 1991, foreign banks were growing rapidly in the U.S. There were about 280 foreign banks with assets of US$626 billion, accounting for 18% of total U.S. banking assets, most of which were established under state law.’

Then we’ll parse the results into a set of nodes.

from llama_index.schema import NodeWithScore
from typing import Optional
nodes_with_scores = []
for index, node in enumerate(query_result.nodes):
 score: Optional[float] = None
 if query_result.similarities is not None:
     score = query_result.similarities[index]
     nodes_with_scores.append(NodeWithScore(node=node, score=score))

Now, using the above, we'll create a retriever class.

from llama_index import QueryBundle
from llama_index.retrievers import BaseRetriever
from typing import Any, List

class VectorDBRetriever(BaseRetriever):
    """Retriever over a qdrant vector store."""
    def __init__(self,
                 vector_store: QdrantVectorStore,
                 embed_model: Any,
                 query_mode: str = "default",
                 similarity_top_k: int = 2) -> None:
        """Init params."""
        self._vector_store = vector_store
        self._embed_model = embed_model
        self._query_mode = query_mode
        self._similarity_top_k = similarity_top_k
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve."""
        query_embedding = embed_model.get_query_embedding(
            query_bundle.query_str
        )
        vector_store_query = VectorStoreQuery(
            query_embedding=query_embedding,
            similarity_top_k=self._similarity_top_k,
            mode=self._query_mode,
        )
        query_result = vector_store.query(vector_store_query)
        nodes_with_scores = []
        for index, node in enumerate(query_result.nodes):
            score: Optional[float] = None
            if query_result.similarities is not None:
                score = query_result.similarities[index]
            nodes_with_scores.append(NodeWithScore(node=node, score=score))
        return nodes_with_scores

retriever = VectorDBRetriever(
    vector_store, embed_model, query_mode="default", similarity_top_k=2
)

Then, create a retriever query engine.

from llama_index.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(
    retriever, service_context=service_context
)

Finally, our retriever is ready to query and chat with. Let’s pass a query.

The query is: “Tell me about South China Sea Issue.”

query_str = "告诉我南海问题"
response = query_engine.query(query_str)
print(str(response))

Following will be the response:

南海问题是指涉及南海地区多个国家的主权和海洋权益争议的问题。该地区包括南海诸岛及其附近海域，涉及中国、菲律宾、越南、马来西亚、文莱和台湾等国家和地区。南海地区拥有丰富的油气资源，因此争议各方在该地区的领土和资源开发上存在分歧。中国主张对南海诸岛及其附近海域拥有主权，并提出"九段线"主张，而其他国家则对此持有不同看法。南海问题涉及复杂的政治、经济和安全利益，是地区和国际社会关注的焦点之一。

The response is: ‘South China Sea issues refer to issues involving disputes over the sovereignty and maritime rights and interests of multiple countries in the South China Sea region. This area includes the South China Sea Islands and their adjacent waters, involving countries and regions such as China, the Philippines, Vietnam, Malaysia, Brunei and Taiwan. The South China Sea is rich in oil and gas resources, so the parties to the dispute have differences over the territory and resource development in the area. China claims sovereignty over the South China Sea islands and adjacent waters and has proposed a "nine-dash line" claim, while other countries hold different views on this. The South China Sea issue involves complex political, economic, and security interests and is one of the focuses of attention of the regional and international communities.’

Conclusion

DeepSeek LLM has performed very well in answering questions without facing challenges. Its architecture sets it apart from other models, and the most impressive aspect is its utilization of Direct Preference Optimization to enhance the model’s capabilities. It is a fine-tuned and optimized model in both Chinese and English languages, and we observed from the results how well such a fine-tuned and optimized model can perform.

We utilized FastEmbed and Qdrant for embedding generation and vector similarity search. Retrieval was fast using Qdrant. One of Qdrant's most impressive features is its accessibility through Docker installation, on the cloud, and its in-memory capabilities. Qdrant is versatile in storing vector embeddings. It was intriguing for me to experiment with these tools.

Thanks for reading!

References

(https://arxiv.org/pdf/2401.02954.pdf)

This article was originally published here

Multi-Person Image Generation Using Stable Diffusion Models on Astria.ai

Akriti Upadhyay — Wed, 07 Feb 2024 14:28:03 +0000

Introduction

One of the most exciting developments in AI is the ability to generate images from text prompts. How would it be if you could generate novel images not only of a single person but of multiple people, together in the same frame? This does sound interesting!

With Astria.ai, this is a simple and quick process. Let’s dive deeper to know more about Astria and how it can help us generate multi-person images.

Astra.ai: Personalized AI for Life-Like Headshots

Astria.ai is a leading AI-powered platform that specializes in image generation and tailored AI solutions. It is designed to simplify and expedite the creation of unique images. Astria offers Dreambooth API for crafting distinct visuals. This API streamlines the fine-tuning process, which eliminates the need for managing GPUs, Python scripts, or adjusting hyperparameters.

Astria can animate concepts, breathing life into narratives without the need for pre-existing footage. This functionality enhances its storytelling potential and elevates the user experience. Astria facilitates image generation through text prompts to empower users to refine their creations effortlessly. Because of the optimal performance and stability, users and app developers can initiate their creative journey within minutes.

Astria offers a comprehensive platform equipped with intuitive tools tailored for easily refining Stable Diffusion models. Its pre-configured features and accessible APIs, such as AI Photoshoot, Product Shots, InPainting, and Masking, along with a user-friendly tuning guide, help streamline the AI image generation process.

A standout aspect of Astria.ai is its extensive API functionality which allows automatic complex workflow over the platform efficiently and cost-effectively. This accessibility empowers app developers to swiftly integrate Astria APIs into their applications by leveraging the advanced capabilities of Stable Diffusion models.

For developers who specialize in social applications, particularly within the photo editing category, this represents a significant opportunity. Users can embed Astria APIs within the mobile app framework and share the unique images with their friends.
There are two types of Stable Diffusion model architectures, one SD15 and another SDXL, on Astria.

Astria allows importing any open-source model such as from CivitAI. However, here are several popular base tune models on Astria:

Realistic Vision 2.0: This model is an improved version of Stable Diffusion 1.5 and excels at generating limitless ultra-realistic images.
runwayml/stable-diffusion-v1-5: It is a latent text-to-image model, which is capable of producing photorealistic images from textual inputs. It was initialized with Stable Diffusion-v1-2 checkpoint weights and fine-tuned on ‘laion-aesthetics v2.5+’ for 595 steps at a resolution of 512x512.
Realistic Vision V5.1 V5.1 (VAE): This is part of the Stable Diffusion. The integration of the Variational Autoencoder enhances the image quality.
Deliberate: This model synergizes with CivitAI’s LoRA weights, which produce stylistic and artistic images.
AnyLoRA: This is a diffuser model that is compatible with CivitAI’s LoRA weights. This model is developed by Lykon from CivitAI.
DreamShaper 8: This model is fine-tuned on runwayml/stable-diffusion-v1-5. It is recognized for its production of high-quality images.

The Prompting Technique for Stable Diffusion Models

Prompting is very important in image generation. Astria provides a ‘Negative Prompt’ field while fine-tuning the images. It also provides a ‘Detailed Description’ field to provide the details of the image. Detailed Description is the place where we do positive prompting, and the things that we don’t want the image to generate are part of the negative prompt.

Here are some prompting tips for when you try this:

Always use parentheses when you want the model to emphasize the specific text. Parentheses increase the weight of the token. However, square brackets de-emphasize the weight of the token.
Choose your keywords very carefully. The right mix of structured, powerful keywords and clear details will generate the exact image you want.
The right words first start with the subject and its attributes. It’s important to describe the visual characteristics consisting of camera angles, lighting, art styles, color schemes, and the surrounding environment.
The prompt should include a description of the image quality you desire from the output.
Negative prompts include concepts that are the exact opposite of positive prompts.

Multi-Person Image Generation through Astria.ai

As Valentine’s Day 2024 is just around the corner, let’s create an image of a happy couple visiting Paris. We’ll enhance it with Astria.ai’s powerful image-generation tools. To get started with multi-person image generation, go to the ‘Tunes’ tab on the page.
Click on ‘New Finetune’.

Now we will fine-tune the man’s image with LoRA and then perform the same with the woman’s image. Finally, we’ll use ControlNet for the final image.

Let’s get started!

Finetuning Man’s Image with LoRA

After clicking on ‘New Finetune’, you’ll move to a new page. Add the desired title. For the man’s image, select the Class name ‘man’. I downloaded a model’s image from Pexels. Now, upload the image. For best fine-tuning, upload more than 4 images of the subject, which should include full body, close-up, and medium-shot photos.

Click on ‘Advanced’ and select the Base to fine-tune the model and the Model type. I selected Realistic Vision V5.1 V5.1 (VAE) and the LoRA model type. For creating multi-person images, you can go either with the LoRA model type or the Checkpoint model type but with the sd15 type model, as mentioned in the documentation.

After that, Create the model. You’ll be redirected to another page, where you’ll get a LoRA ID. Save that; we’ll use it for the final image.

Finetuning Woman’s Image with LoRA

Similarly, for the woman’s image, click on ‘New Finetune’. Give the desired title. For the woman’s image, select the Class name ‘woman’. I downloaded a model’s image from Pexels. Now, upload the image. For best fine-tuning, upload more than 4 images of the subject, which should include full body, close-up, and medium-shot photos.

Click on ‘Advanced’ and select the Base to fine-tune the model and the Model type. I selected Realistic Vision V5.1 V5.1 (VAE) and the LoRA model type.

After that, Create the model. You’ll be moved to another page, where you’ll get a LoRA ID. Save that; we’ll use it for the final image.

Now, let’s fire up our imagination!

Prompting with ControlNet for the Final Images on Astria.ai

When you separately fine-tune the model, you’ll see both of them on the ‘Tunes’ page.

Now, we have to generate an image where both people go to Paris and celebrate their Valentine’s Day.

Click on any of the models. I went with ‘The Woman’ fine-tuned model. In the Detailed Description, I put the following prompt:

(wide shot) of ((ohwx man)) and ((ohwx woman)) standing together in (Paris) 
BREAK ((ohwx man)) wearing a (((grey shirt, white coat and pant))), ((ohwx woman)) wearing a (((black dress and yellow overcoat))), (Paris background), analog style, detailed limbs, detailed face, Amazing Details, Best Quality, Masterpiece, dramatic lighting, highly detailed, analog photo, overglaze, 80mm Sigma f/1.4 or any ZEISS lens, tiled upscale, Cinematic light, ((photorealistic++)), 8k high definition, RAW photo  
BREAK (ohwx man) <lora:982752:1.0> BREAK (ohwx woman) <lora:982753:1.0>

You can see from the prompt how I have defined the subject first; then its attributes are defined, like the clothes they should be wearing and the surroundings that should serve as the backdrop. Then I defined the camera lens, lighting, art style, and the description of the image quality.

After that, the LoRA IDs of both people are mentioned with the ‘man’ and ‘woman’ tokens, so that the model can understand what the LoRA id is for.

‘Ohwx’ is a token used in Stable Diffusion prompts. ‘Ohwx’ is used as an instance token for the naming process during training. It helps in identifying and differentiating the particular style or subject with which this token is used during the training process.

Now comes the Negative prompt, which restricts the model from creating irrelevant content.

For this image, I used the following negative prompt:

old, wrinkles, mole, blemish,(oversmoothed, 3d render) scar, sad, severe, 2d, sketch, painting, digital art, drawing, disfigured, elongated body (deformed iris, deformed pupils, semi-realistic, cgi, sketch, cartoon, drawing, anime), text, cropped, out of frame, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, (extra fingers, mutated hands, poorly drawn hands, poorly drawn face), mutation, deformed, (blurry), dehydrated, bad anatomy, bad proportions, (extra limbs), cloned face, disfigured, gross proportions, (malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, NSFW), nude, underwear, muscular, elongated body, high contrast, airbrushed, blurry, disfigured, cartoon, blurry, dark lighting, low quality, low resolution, cropped, text, caption, signature, clay, kitsch, oversaturated

The negative prompt that is used here is conceptually opposite to what we want from the output image.

Now click on ‘Advanced’ and ‘ControlNet/Img2Img’.

Choose a reference image in the Image URL featuring a couple of your choice, and select the ‘Pose’ from ControlNet Hint. Select the desired number of images and the Aspect Ratio.

After this, click on Create. The model will go in a queue and then it will process the prompt and produce the final image.

The final image, in our case, looks like this:

Now let’s do another prompt where we want the couple to be holidaying in Italy...

wide shot of ohwx man and ohwx woman standing besides lake in Italy
BREAK (ohwx man) wearing (Blue shirt and sunglasses) and (ohwx woman) wearing (pink top, jeans, and sunglasses), (Italy lake background), symmetrical eyes, analog style, detailed face, amazing details, hands in the pocket, masterpiece, intricate details, photorealistic++, high contrast, detailed background, detailed face, ZEISS lens, insanely detailed hair, dramatic lighting, detailed glow, overglaze, best quality, high contrast, tiled upscale, 8k high definition
BREAK (ohwx man) <lora:982752:1.0> BREAK (ohwx woman) <lora:982753:1.0>

With the same negative prompt, advanced, and ControlNet settings, click on ‘Create’. In the ControlNet Image URL, try another image with a background from Italy.

The final image will look like this:

Let’s try one more prompt where the pair visits Switzerland.

full body shot of ohwx man and ohwx woman sitting together in Switzerland
BREAK (ohwx man) wearing (Woollen cap, grey turtleneck, blue jeans, boots) and (ohwx woman) wearing (grey turtleneck, blue jeans, woolen cap, boots), (Switzerland snow background), (snowing background), symmetrical eyes, analog style, detailed face, amazing details, masterpiece, intricate details, photorealistic++, ultra realistic, high contrast, detailed background, ZEISS lens, insanely detailed hair, dramatic lighting, detailed glow, overglaze, best quality, 8k high definition
BREAK (ohwx man) <lora:982752:1.0> BREAK (ohwx woman) <lora:982753:1.0>

With the same negative prompt, advanced, and ControlNet settings, click on ‘Create’. In the ControlNet Image URL, try another image with Switzerland in the background.

The final image will look like this:

We can try a final prompt, where the couple is visiting a beach in the Maldives.

full body shot of ohwx man and ohwx woman standing together near Maldives beach.
BREAK (ohwx man) wearing (white shirt and pant) and (ohwx woman) wearing (white dress), (Maldives beach background), (Palm tree and evening time), ZEISS lens,  symmetrical eyes, analog style, detailed face, amazing details, masterpiece, intricate details, photorealistic++, ultra realistic, high contrast, detailed background, ZEISS lens, insanely detailed hair, dramatic lighting, detailed glow, overglaze, best quality, 8k high definition
BREAK (ohwx man) <lora:982752:1.0> BREAK (ohwx woman) <lora:982753:1.0>

With the same negative prompt, advanced, and ControlNet settings, click on ‘Create’. In the ControlNet Image URL, try another image with a Maldives beach in the background.

The final image will look like this:

Conclusion

With Astria, it is super simple to generate creative multi-person images in multiple locations. All you need is creativity and the right prompts.

To reiterate, the right Stable Diffusion prompts are very important to generate the desired images. There’s no right or wrong prompt. It all depends on your specific requirements. So, go ahead and explore Astria.ai on your own!

This article was originally published here.

Integrating LlamaIndex and Qdrant Similarity Search for Patient Record Retrieval

Akriti Upadhyay — Fri, 02 Feb 2024 05:20:25 +0000

Introduction

The medical field is currently experiencing a remarkable surge in data, a result of the progress in medical technologies, digital health records (EHR), and wearable health devices. The ability to effectively manage and analyze this intricate and varied data is vital for providing customized healthcare, advancing medical research, and enhancing patient health outcomes. Vector databases, which are specifically tailored for the efficient handling and storage of multi-dimensional data, are gaining recognition as an effective tool for a range of healthcare uses.

For example, currently, past patient record data is rarely leveraged by medical professionals in real-time, even though they are a treasure trove of information and can assist in diagnosis. What if we could build systems where doctors, nurses and caregivers could quickly access past patient records using just natural language inputs? What if historical test results could help generate recommendations for new treatment options?

This is the potential of AI in healthcare. From personalized diagnostics to targeted therapies, healthcare is on the cusp of becoming a whole lot smarter. In this article, I will demonstrate the capabilities and potential applications of vector databases in the healthcare sector.

Why Vector Search and LLMs?

Vector Search enables rapid exploration of large datasets by transforming data into vectors within a high-dimensional space, where similar items are clustered closely. This approach facilitates efficient retrieval of relevant information, even from vast datasets. LLMs, on the other hand, are AI models trained on diverse internet texts, capable of comprehending and generating human-like text based on inputs.

When combined, Vector Search and LLMs streamline the storage and search process for patient records. Each record undergoes embedding and converts it into a vector representing its semantic meaning, which is then stored in a database. During retrieval, a doctor inputs a search query, also converted into a vector, and the Vector Search scans the database to locate records closest to the query vector, which enables semantic search based on meaning rather than exact keywords.

Subsequently, retrieved records are processed through an LLM, which generates a human-readable summary highlighting the most relevant information for the doctor. This integration empowers doctors to efficiently access and interpret patient records, which facilitates better-informed decisions and personalized care. Ultimately, this approach enhances patient outcomes by enabling healthcare professionals to provide tailored recommendations based on comprehensive data analysis.

Let’s see how this is going to work with the help of the Retrieval Augmented Generation (RAG) technique incorporated with LlamaIndex and Qdrant Vector DB.

RAG Architecture

Retrieval Augmented Generation (RAG) enhances the effectiveness of large language model applications by incorporating custom data. By retrieving relevant data or documents related to a query or task, RAG provides context for LLMs, which improves their accuracy and relevance.

The challenges addressed by RAG include LLMs' limited knowledge, which is beyond their training data, and the necessity for AI applications to leverage custom data for specific responses. RAG tackles these issues by integrating external data into the LLM's prompt, which allows it to generate more relevant and accurate responses without the need for extensive retraining or fine-tuning.

RAG benefits by reducing inaccuracies or hallucinations in LLM outputs by delivering domain-specific and relevant answers, and by offering an efficient and cost-effective solution for customizing LLMs with external data.

Incorporating RAG with LlamaIndex

LlamaIndex is a fantastic tool in the domain of Large Language Model Orchestration and Deployment and particularly focuses on Data Storage and Management. Its standout features include Data Agents, which execute actions based on natural language inputs instead of generating responses, and it can deliver structured results by leveraging LLMs.

Moreover, LlamaIndex offers composability by allowing the composition of indexes from other indexes. It also has seamless integration with existing technological platforms like LangChain, Flask, and Docker, and customization options such as seeding tree construction with custom summary prompts.

Qdrant DB: A High-Performance Vector Similarity Search Technology

Qdrant acts both as a vector database and similarity search engine and has a cloud-hosted platform that helps find the nearest high-dimensional vectors efficiently. It harnesses embeddings or neural network encoders to help developers build comprehensive applications that involve tasks like matching, searching, recommending, and beyond. It also utilizes a unique custom adaptation of the HNSW algorithm for Approximate Nearest Neighbor Search. It allows additional payload associated with vectors and enables filtering results based on payload values.

Qdrant supports a wide array of data types and query conditions for vector payloads by encompassing string matching, numerical ranges, geo-locations, and more. It is built to be cloud-native and horizontally scalable. Qdrant maximizes resource utilization with dynamic query planning and payload data indexing which is implemented entirely in Rust language.

The HNSW Algorithm

There are many algorithms for approximate nearest neighbor search, such as locality-sensitive hashing and product quantization, which have demonstrated superior performance when handling high-dimensional datasets.

However, these algorithms, often referred to as proximity graph-ANN algorithms, suffer from significant performance degradation while dealing with low-dimensional or clustered data.

In response to this challenge, the HNSW algorithm has been developed as a fully graph-based incremental approximate nearest neighbor solution.

The HNSW algorithm builds upon the hierarchical graph structure of the NSW algorithm. While the NSW algorithm struggles with high-dimensional data, its hierarchical counterpart excels in this domain by offering optimal performance. The core concept of the HNSW algorithm involves organizing links based on their length scales across multiple layers. This results in an incremental multi-layer structure that comprises hierarchical sets of proximity graphs, each representing nested subsets of the stored elements within the NSW. The layer in which an element resides is chosen randomly, which follows an exponentially decaying probability distribution.

Building Medical Search System with LlamaIndex

To get started with utilizing RAG for building a medical search system, let’s create a synthetic dataset first.

Generating Synthetic Patient Data

As this is going to be synthetic data, let’s install a dependency named ‘Faker’.

!pip install -q faker

Create a CSV synthetic dataset.

import random
import pandas as pd
from faker import Faker
fake = Faker()


medical_condition_data = {
    'Hypertension': {
        'medications': ['Lisinopril', 'Amlodipine', 'Losartan', 'Hydrochlorothiazide'],
        'cholesterol_range': (100, 200),
        'glucose_range': (70, 110),
        'blood_pressure_range': (140, 90)  # systolic/diastolic
    },
    'Diabetes': {
        'medications': ['Metformin', 'Insulin', 'Glipizide', 'Sitagliptin'],
        'cholesterol_range': (100, 200),
        'glucose_range': (130, 200),
        'blood_pressure_range': (130, 80)
    },

}

def generate_patient_records(num_patients):
    patient_records = []
    for _ in range(num_patients):
        patient_id = fake.uuid4()
        name = fake.name()
        age = random.randint(18, 90)
        gender = random.choice(['Male', 'Female'])
        blood_type = random.choice(['A+', 'B+', 'AB+', 'O+', 'A-', 'B-', 'AB-', 'O-'])
        medical_condition = random.choice(list(medical_condition_data.keys()))
        patient_records.append({
            'Patient_ID': patient_id,
            'Name': name,
            'Age': age,
            'Gender': gender,
            'Blood_Type': blood_type,
            'Medical_Condition': medical_condition
        })
    return patient_records

def generate_test_results(num_patients):
    test_results = []
    for i in range(num_patients):
        patient_id = fake.uuid4()
        medical_condition = random.choice(list(medical_condition_data.keys()))
        cholesterol_range = medical_condition_data[medical_condition]['cholesterol_range']
        glucose_range = medical_condition_data[medical_condition]['glucose_range']
        blood_pressure_range = medical_condition_data[medical_condition]['blood_pressure_range']
        cholesterol = random.uniform(cholesterol_range[0], cholesterol_range[1])
        glucose = random.uniform(glucose_range[0], glucose_range[1])
        systolic = random.randint(blood_pressure_range[1], blood_pressure_range[0]) 
        diastolic = random.randint(60, systolic)  
        blood_pressure = f"{systolic}/{diastolic}"
        test_results.append({
            'Patient_ID': patient_id,
            'Medical_Condition': medical_condition,
            'Cholesterol': cholesterol,
            'Glucose': glucose,
            'Blood_Pressure': blood_pressure
        })
    return test_results

def generate_prescriptions(num_patients):
    prescriptions = []
    for i in range(num_patients):
        patient_id = fake.uuid4()
        medical_condition = random.choice(list(medical_condition_data.keys()))
        medication = random.choice(medical_condition_data[medical_condition]['medications'])
        dosage = f"{random.randint(1, 3)} pills"
        duration = f"{random.randint(1, 30)} days"
        prescriptions.append({
            'Patient_ID': patient_id,
            'Medical_Condition': medical_condition,
            'Medication': medication,
            'Dosage': dosage,
            'Duration': duration
        })
    return prescriptions

def generate_medical_history_dataset(num_patients):
    patient_records = generate_patient_records(num_patients)
    test_results = generate_test_results(num_patients)
    prescriptions = generate_prescriptions(num_patients)


    medical_history = []
    for i in range(num_patients):
        patient_id = patient_records[i]['Patient_ID']
        record = {**patient_records[i], **test_results[i], **prescriptions[i]}
        medical_history.append(record)

    return pd.DataFrame(medical_history)


medical_history_dataset = generate_medical_history_dataset(100)


medical_history_dataset.to_csv('medical_history_dataset.csv', index=False)

print("Synthetic medical history dataset created and saved to 'medical_history_dataset.csv'")

After the synthetic dataset is created, it will print that it’s created, and you can see it in your directory.

Synthetic medical history dataset created and saved to 'medical_history_dataset.csv'

Let’s see what our data looks like!

import pandas as pd
df = pd.read_csv("/content/medical_history_dataset.csv")
df.head()

The data looks fine; let’s convert it into a PDF format. While we could use the CSV format as is, PDF is the format in which many documents are stored in legacy systems – so using PDF as a base is a good way to build it for real-life scenarios.

We will load the PDF data using LlamaIndex SimpleDirectoryReader.

To convert the CSV dataset into a PDF document, install the following dependency.

!pip install -q reportlab

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph
from reportlab.lib.styles import ParagraphStyle
import pandas as pd

def create_pdf_from_dataframe(dataframe, output_file):
    doc = SimpleDocTemplate(output_file, pagesize=letter)
    styles = ParagraphStyle(name='Normal', fontSize=12)


    content = []


    for index, row in dataframe.iterrows():
        row_content = []
        for column_name, value in row.items():
            row_content.append(f"{column_name}: {value}")


        content.append(Paragraph(", ".join(row_content), styles))
        content.append(Paragraph("<br/><br/>", styles))  

    doc.build(content)


create_pdf_from_dataframe(df, "output.pdf")

Make a directory and move the output pdf document into the directory.

import os

# Check current working directory
print(os.getcwd())

# Create 'static/' directory
if not os.path.exists('static/'):
    os.makedirs('static/')

!mv "/content/output.pdf" "static/"

Now that the dataset is ready, let’s move to building an RAG using this dataset. Install the important dependencies.

!pip install -q llama-index transformers
!pip install -q llama-cpp-python
!pip install -q qdrant-client
!pip install -q llama_hub

Load the data into SimpleDirectoryReader.

from llama_index import SimpleDirectoryReader
documents = SimpleDirectoryReader("static").load_data()

Using the Sentence splitter, split the documents into small chunks. Maintain the relationship between the Source document index, so that it helps in injecting document metadata.

from llama_index.node_parser.text import SentenceSplitter
text_parser = SentenceSplitter(
    chunk_size=1024,
)
text_chunks = []

doc_idxs = []
for doc_idx, doc in enumerate(documents):
    cur_text_chunks = text_parser.split_text(doc.text)
    text_chunks.extend(cur_text_chunks)
    doc_idxs.extend([doc_idx] * len(cur_text_chunks))

Now, manually construct nodes from text chunks.

from llama_index.schema import TextNode

nodes = []
for idx, text_chunk in enumerate(text_chunks):
    node = TextNode(
        text=text_chunk,
    )
    src_doc = documents[doc_idxs[idx]]
    node.metadata = src_doc.metadata
    nodes.append(node)

Now generate embeddings for each node using the Hugging Face embeddings model.

from llama_index.embeddings import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en")

for node in nodes:
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode="all")
    )
    node.embedding = node_embedding

Now, it’s time to build a model using Llama CPP. Here, we’ll use the GGUF Llama 2 13B model. Using Llama CPP, we’ll download the model with the help of the model URL.

from llama_index.llms import LlamaCPP

model_url = "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_0.gguf"

llm = LlamaCPP(
    model_url=model_url,
    model_path=None,
    temperature=0.1,
    max_new_tokens=256,
    context_window=3900,
    generate_kwargs={},
    model_kwargs={"n_gpu_layers": 1},
    verbose=True,
)

Let’s define Service context. It consists of the LLM model as well as the embedding model.

from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(
    llm=llm, embed_model=embed_model
)

Now, create a vector store collection using Qdrant DB, and create a storage context for this vector store.

import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore
client = qdrant_client.QdrantClient(location=":memory:")

from llama_index.storage.storage_context import StorageContext
from llama_index import (
    VectorStoreIndex,
    ServiceContext,
    SimpleDirectoryReader,
)

vector_store = QdrantVectorStore(client=client, collection_name="my_collection")
storage_context = StorageContext.from_defaults(vector_store=vector_store)

Now, create an index of the vector store where the service context, storage context, and the documents are stored.

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, service_context=service_context
)

Add the created node into the vector store.

vector_store.add(nodes)

To build a retrieval pipeline, generate a query embedding using a query string.

query_str = "Can you tell me about the key concepts for safety finetuning"

query_embedding = embed_model.get_query_embedding(query_str)

Then, construct a Vector Store query and query the vector database.

from llama_index.vector_stores import VectorStoreQuery

query_mode = "default"
# query_mode = "sparse"
# query_mode = "hybrid"

vector_store_query = VectorStoreQuery(
    query_embedding=query_embedding, similarity_top_k=2, mode=query_mode
)

query_result = vector_store.query(vector_store_query)
print(query_result.nodes[0].get_content())

You’ll get the following results:

Patient_ID: 58b70a59-eb30-4caa-b4b5-7871321515dd, Name: Kimberly Brown, Age:
32, Gender: Female, Blood_Type: O-, Medical_Condition: Diabetes, Cholesterol:
161.7899842312819, Glucose: 107.778261077734, Blood_Pressure: 100/81,
Medication: Sitagliptin, Dosage: 2 pills, Duration: 30 days
.......
.......
.......
Patient_ID: d4c865a0-d695-4721-bed9-9d47f5393bf4, Name: Michael Rowe, Age: 56,
Gender: Female, Blood_Type: O+, Medical_Condition: Hypertension, Cholesterol:
121.20389761494744, Glucose: 75.29441955653576, Blood_Pressure: 90/80,
Medication: Hydrochlorothiazide, Dosage: 2 pills, Duration: 22 days
Patient_ID: b91f4f27-6a6a-4005-8d3b-4c3b53efe57b, Name: James Wright, Age: 54,
Gender: Female, Blood_Type: A-, Medical_Condition: Diabetes, Cholesterol:
192.42692819824364, Glucose: 92.35717875040676, Blood_Pressure: 104/101,
Medication: Metformin, Dosage: 3 pills, Duration: 13 days

Now, parse the results into a set of nodes.

from llama_index.schema import NodeWithScore
from typing import Optional

nodes_with_scores = []
for index, node in enumerate(query_result.nodes):
    score: Optional[float] = None
    if query_result.similarities is not None:
        score = query_result.similarities[index]
    nodes_with_scores.append(NodeWithScore(node=node, score=score))

Then, put them into a retriever.

from llama_index import QueryBundle
from llama_index.retrievers import BaseRetriever
from typing import Any, List


class VectorDBRetriever(BaseRetriever):
    """Retriever over a qdrant vector store."""

    def __init__(
        self,
        vector_store: QdrantVectorStore,
        embed_model: Any,
        query_mode: str = "default",
        similarity_top_k: int = 2,
    ) -> None:
        """Init params."""
        self._vector_store = vector_store
        self._embed_model = embed_model
        self._query_mode = query_mode
        self._similarity_top_k = similarity_top_k
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve."""
        query_embedding = embed_model.get_query_embedding(
            query_bundle.query_str
        )
        vector_store_query = VectorStoreQuery(
            query_embedding=query_embedding,
            similarity_top_k=self._similarity_top_k,
            mode=self._query_mode,
        )
        query_result = vector_store.query(vector_store_query)

        nodes_with_scores = []
        for index, node in enumerate(query_result.nodes):
            score: Optional[float] = None
            if query_result.similarities is not None:
                score = query_result.similarities[index]
            nodes_with_scores.append(NodeWithScore(node=node, score=score))

        return nodes_with_scores

retriever = VectorDBRetriever(
    vector_store, embed_model, query_mode="default", similarity_top_k=2
)

Create a Retriever Query Engine, and plug the above into it to synthesize the response.

from llama_index.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(
    retriever, service_context=service_context
)

Now, it’s time to query the Retriever Query Engine and see the response.

query_str = "Write prescription for Diabetes"

response = query_engine.query(query_str)
print(str(response))

It’ll take a significant amount of time; be patient, and you will get the response:

Metformin, Dosage: 3 pills, Duration: 12 days

Please note that the answer is based on the context information provided and not on any prior knowledge or real-world data.

Let’s see the source node of this response.

print(response.source_nodes[0].get_content())

Following is the Source Node:

Patient_ID: ea9121cf-22d3-4053-9597-32816a087d6b, Name: Tracy Mendez, Age:
41, Gender: Male, Blood_Type: B+, Medical_Condition: Diabetes, Cholesterol:
155.00542923679996, Glucose: 142.74790733131314, Blood_Pressure: 95/83,
Medication: Metformin, Dosage: 2 pills, Duration: 30 days
Patient_ID: f970b6c9-2914-4374-8fce-985a9c3ad5c1, Name: Victor Burns, Age: 34,
Gender: Male, Blood_Type: AB+, Medical_Condition: Diabetes, Cholesterol:
123.4148196061812, Glucose: 135.01188456651374, Blood_Pressure: 108/82,
Medication: Insulin, Dosage: 1 pills, Duration: 26 days
Patient_ID: 5e051e8c-e507-44f1-b177-686222bc8402, Name: Edward Webb, Age:
25, Gender: Female, Blood_Type: O+, Medical_Condition: Diabetes, Cholesterol:
113.6267476444252, Glucose: 88.16188232757526, Blood_Pressure: 93/75,
Medication: Glipizide, Dosage: 3 pills, Duration: 5 days
Patient_ID: 47d9d8d3-870b-4084-b81b-2377742a0c45, Name: Yvonne Mosley, Age:
39, Gender: Female, Blood_Type: AB+, Medical_Condition: Hypertension,
Cholesterol: 110.03196972436749, Glucose: 83.9354746313523, Blood_Pressure:
137/117, Medication: Lisinopril, Dosage: 1 pills, Duration: 26 days
Patient_ID: 93d04076-5219-4f07-8d7c-26f9512864c9, Name: Jeffrey Solis, Age: 53,
Gender: Female, Blood_Type: O-, Medical_Condition: Hypertension, Cholesterol:
102.56021178266424, Glucose: 148.7046530174272, Blood_Pressure: 97/81,
Medication: Hydrochlorothiazide, Dosage: 3 pills, Duration: 5 days
Patient_ID: 144963e2-a9d8-4b9a-a6b8-87da14429a98, Name: Sabrina Figueroa,
Age: 66, Gender: Female, Blood_Type: B+, Medical_Condition: Hypertension,
Cholesterol: 163.80805126265315, Glucose: 98.1830736526342, Blood_Pressure:
92/64, Medication: Lisinopril, Dosage: 3 pills, Duration: 3 days
Patient_ID: 8e6f7b14-f84e-4415-bfa0-4b90bb998474, Name: Patricia Kline, Age: 34,
Gender: Female, Blood_Type: O-, Medical_Condition: Diabetes, Cholesterol:
175.96974315251947, Glucose: 93.85396377117868, Blood_Pressure: 112/99,
Medication: Metformin, Dosage: 1 pills, Duration: 12 days
Patient_ID: c20d1263-7757-4b36-94ef-92bc2d10cd88, Name: Michael Wilcox, Age:
18, Gender: Female, Blood_Type: A-, Medical_Condition: Diabetes, Cholesterol:
176.40579895801716, Glucose: 138.79382669587685, Blood_Pressure: 113/64,
Medication: Metformin, Dosage: 3 pills, Duration: 6 days
Patient_ID: b9793557-9e83-493b-a9aa-a248a1ebb222, Name: Brandon Tucker, Age:
29, Gender: Female, Blood_Type: O+, Medical_Condition: Diabetes, Cholesterol:
126.00771705145566, Glucose: 155.00180031226188, Blood_Pressure: 101/67,
Medication: Metformin, Dosage: 3 pills, Duration: 3 days

Note: The medical dataset used here is synthetic and fake. It has no relation to real medicine dosage or duration.

Conclusion

Leveraging the RAG architecture with tools like LlamaIndex, Llama CPP, and Qdrant Vector Store has been a fascinating journey. Through the utilization of Qdrant's sophisticated HNSW algorithm, searching through patients' medical histories and records has become effortless and rapid. This integration highlights the potential of innovative technologies to enhance healthcare processes, which can ultimately lead to improved patient care and outcomes.

This article was originally posted here: https://medium.com/@akriti.upadhyay/integrating-llamaindex-and-qdrant-similarity-search-for-patient-record-retrieval-7090e77b971e
Thanks for reading!

Building Advanced RAG Applications Using FalkorDB, LangChain, Diffbot API, and OpenAI

Akriti Upadhyay — Thu, 18 Jan 2024 07:02:18 +0000

Introduction

The introduction of the Knowledge Graph Database in the realm of evolving Large Language Models has changed the way RAG applications are getting built. Since RAG mitigates knowledge limitations like hallucinations and knowledge cut-offs, we use RAG to build QA chatbots. Knowledge Graphs store and query the original data and capture different entities and relations embedded in one’s data.

With the help of a Knowledge Graph, we’ll build an advanced RAG application using the OpenAI base model.

Before diving deeper, let’s understand more about Knowledge Graphs!

What Is a Knowledge Graph?

The knowledge graph is centered around a knowledge model, which consists of interconnected descriptions of concepts, entities, relationships, and events. These descriptions have formal semantics, enabling human and computer processing efficiently and unambiguously. The descriptions contribute to one another and form a network, where each entity represents part of the description of the related entities. Based on the established knowledge model, diverse data is linked and described through semantic metadata.

Knowledge graphs integrate features from various data management paradigms. It functions as a database, a network, and a knowledge base simultaneously. They serve as a database by allowing structured queries on the data, operate as a network for analysis like any other network data structure, and function as a knowledge base due to the formal semantics of the data. The formal semantics enable the interpretation of data and the inference of new facts. In essence, a knowledge graph is a unified platform where the same data can assume different roles.

Knowledge graphs function as databases when users utilize query languages for creating complex and structured queries to extract specific data. Unlike relational databases, knowledge graph schemas are dynamic, not requiring a predefined structure, and they avoid data normalization constraints. Additionally, knowledge graphs operate like any other graph, with vertices and labeled edges, which make them amenable to graph optimizations and operations. However, the true power of knowledge graphs lies in the formal semantics attached to vertices and edges, which allows both humans and machines to infer new information without introducing factual errors into the dataset.

Why Knowledge Graphs?

Knowledge graphs stand out as a dynamic and scalable solution that effectively meets enterprise data management needs across various industries. Going beyond that, they act as central hubs for data, metadata, and content, providing a unified, consistent, and unambiguous perspective on data distributed across diverse systems. Moreover, knowledge graphs enhance their proprietary information by incorporating global knowledge as context for interpretation and as a source for enrichment, which adds intelligence to the data.

Knowledge graphs offer a comprehensive solution to challenges in the global data ecosystem faced by organizations of all sizes. These challenges include dealing with diverse data sources and types, often spread across legacy systems used beyond their original purposes. Knowledge graphs provide a higher-level abstraction, by separating data formats from their intended purpose.

Traditional data management solutions created a gap between real-world information and the limitations of software and hardware, making them less intuitive. Knowledge graphs bridge this gap by replicating the connective nature of how humans express and consume information by adapting to changes in understanding and information needs.

The concept of rigid and unchanging data schemas becomes a challenge as business cases evolve. Knowledge graphs offer flexibility by allowing dynamic adjustments to schemas without altering the underlying data. This adaptability proves invaluable in a constantly changing world, where maintaining a fixed schema is impractical.

Knowledge Graph vs Vector Database

Vector databases have become the default choice for indexing, storing, and retrieving data which would be later used as context for questions or tasks presented to Large Language Models. This involves chunking data into smaller pieces, creating embeddings for each piece, and storing the data and its embeddings in a vector database. However, this approach has a major limitation as it relies on semantically similar vectors from the database, which lacks essential information for the LLM.

Knowledge graph provides an alternative approach to storing and querying original documents while capturing entities and relations within the data. The process starts with constructing a knowledge graph from documents and identifying entities and relationships. This graph serves as a knowledge base for LLMs, which allows richer context constructions. User questions are translated into graph queries, which leverage all connections within the graph for a more comprehensive context. This enriched context, combined with the original question, is presented to LLMs to generate more informed answers. This alternative approach surpasses the limitations of relying solely on semantically similar vectors.

RAG Architecture with KGLLMs

Knowledge Graphs serve as potent tools for structuring and querying data, capturing relationships, and enriching information from external sources. However, Knowledge Graphs face challenges with unstructured natural language data, which is often ambiguous and incomplete. Large Language Models, adept at generating natural language, offer a solution by understanding syntactic and semantic patterns. Yet, LLMs have limitations, including generating inaccurate or biased texts.

An interaction between Knowledge Graphs and LLMs proves powerful. By combining them, we can address the weaknesses and leverage the strengths. A framework called Knowledge Graphs with LLMs (KGLLM) showcases practical applications like question-answering, text summarization, and creative text generation. KGLLM not only uses Knowledge Graphs to inform LLMs but also employs LLMs to generate Knowledge Graphs, which helps in achieving bidirectional communication. Techniques like entity linking, embedding, and knowledge injection enhance accuracy and diversity, while knowledge extraction, completion, and refinement expand Knowledge Graph coverage and maintain quality.

We have seen how RAG works with vector databases. Let’s understand how RAG works with Knowledge Graphs.

In the context of the Knowledge Graph Large Language Model working with Retrieval-Augmented Generation, the process involves the following:

Documents: Start with a diverse corpus of documents from various sources containing extensive information. Data from external sources like Wikipedia and DBPedia could be used.
Entity Extractor: Employ an entity extractor using Natural Language Processing techniques to identify and extract entities (people, places, events) and their relationships from the documents.
Knowledge Graph: Utilize the extracted entities and relationships to construct a structured and semantic knowledge graph, by forming a network of interconnected entities.
LLM: Train a Large Language Model on broad text corpora encompassing human knowledge. LLMs simplify information retrieval from Knowledge Graphs, which offers user-friendly access without requiring a data expert.
Graph Query: The LLM retrieves relevant information from the Knowledge Graph using vector and semantic search when a query is passed. It augments the response with contextual data from the Knowledge Graph.
Content: Generate the final content using the RAG LLM process, ensuring precision, accuracy, and contextually relevant output while preventing false information (LLM hallucination).

This high-level overview explains how a Knowledge Graph LLM functions with RAG, which acknowledges potential variations based on specific implementations and use cases.

FalkorDB: An Open Knowledge Graph

FalkorDB is a high-performance Graph Database designed for applications prioritizing fast response times and refusing to compromise on data modeling. It succeeds RedisGraph and is recognized for its exceptionally low latency. Users trust FalkorDB for its uncompromising performance. It can be easily run using Docker.

FalkorDB stands out as the preferred Knowledge Database for Large Language Models due to its distinctive features:

Super Low Latency: FalkorDB's exceptionally low latency is well-suited for applications that require rapid response times.
Powerful Knowledge Graphs: FalkorDB employs powerful knowledge graphs by efficiently representing and querying structured data, and capturing relationships and attributes of entities like people, places, events, and products.
Combination with LLMs: By integrating Knowledge Graphs with LLMs, FalkorDB capitalizes on the strengths of both, which mitigates their weaknesses. Knowledge Graphs offer structured data to LLMs, while LLMs contribute natural language generation and understanding capabilities to Knowledge Graphs.
LLM Context Generation: FalkorDB constructs a knowledge graph from documents to serve as a knowledge base for LLMs that identify entities and their relationships. Interestingly, LLMs can be employed in this process.

These features make FalkorDB an influential tool for LLMs, which provides them with a structured, low-latency knowledge base. This integration enables LLMs to produce high-quality and relevant texts that establish FalkorDB as the top KnowledgeDB for LLMs. To get started with FalkorDB, visit their website.

Implementing RAG with FalkorDB, Diffbot API, LangChain, and OpenAI

To get started, first install all the dependencies.

%pip install langchain
%pip install langchain-experimental
%pip install langchain-openai
%pip install falkordb

Start FalkorDB locally using docker.

docker run -p 6379:6379 -it -rm falkordb/falkordb:edge

Login into Diffbot by visiting its website to get the API key. You’ll see the API token on the upper right side. To know more about using Diffbot API with FalkorDB, visit this blog.

The Diffbot API is a robust tool designed to extract structured data from unstructured documents like web pages, PDFs, and emails. Users can utilize this API to build a Knowledge Graph, by capturing entities and relationships within their documents and storing it in FalkorDB. To query and retrieve information from the Knowledge Graph, LangChain can be employed. LangChain is capable of handling intricate and natural language queries, by providing accurate and relevant answers based on the stored data.

from langchain_experimental.graph_transformers.diffbot import DiffbotGraphTransformer

diffbot_api_key = "your-api-key"
diffbot_nlp = DiffbotGraphTransformer(diffbot_api_key=diffbot_api_key)

Here, we’ll use WikipediaLoader, which is an external source to load the data using LangChain. The loader retrieves text from Wikipedia articles by utilizing the Python package called Wikipedia. It accepts inputs in the form of page titles or keywords that uniquely identify a Wikipedia page. As it stands, the loader exclusively focuses on extracting text and does not consider images, tables, or other elements. I am querying here about ‘Washington’; let’s see what knowledge it can provide with this query.

from langchain.document_loaders import WikipediaLoader
query = "Washington"

raw_documents = WikipediaLoader(query=query).load()

graph_documents = diffbot_nlp.convert_to_graph_documents(raw_documents)

Then, we’ll use FalkorDBGraph to create and store a Knowledge Graph.

from langchain.graphs import FalkorDBGraph
graph = FalkorDBGraph("falkordb")

graph.add_graph_documents(graph_documents)

graph.refresh_schema()

We’ll need an OpenAI API key, so pass your key to the environment.

import os

os.environ["OPENAI_API_KEY"] = "your-api-key"

Now, we’ll pass the graph that we created and the LLM which we’re using here, OpenAI. We’re using a very basic model of OpenAI with temperature=0, which can be used with a free-tier OpenAI API key.

from langchain_openai import ChatOpenAI
from langchain.chains import FalkorDBQAChain
chain = FalkorDBQAChain.from_llm(ChatOpenAI(temperature=0), graph=graph, verbose=True)

After that, pass your question in the chain.

chain.run("Which university is in Washington")

The following will be the results:

> Entering new FalkorDBQAChain chain...
Generated Cypher:
MATCH (o:Organization)-[:ORGANIZATION_LOCATIONS]->(l:Location)
WHERE l.name = 'Washington'
RETURN o.name
Full Context:
[['Jain Foundation'], ['University of Washington'], ['Washington']]

> Finished chain.
'The University of Washington is located in Washington.'

Let’s try another question in the chain.

chain.run("Is Washington D.C. and Washington same?")

The following will be the results:

> Entering new FalkorDBQAChain chain...
Generated Cypher:
MATCH (l:Location)
WHERE l.name = 'Washington D.C.'
MATCH (w:Location)
WHERE w.name = 'Washington'
RETURN l, w
Full Context:
[]

> Finished chain.
'Yes, Washington D.C. and Washington are not the same. Washington D.C. is the capital of the United States, while Washington refers to the state located on the West Coast of the country.'

The results were quite good even using the basic model of OpenAI.

Conclusion

With the introduction of Knowledge Graphs, it’s easy to implement RAG to reduce LLM hallucinations. Knowledge Graphs are simple as compared to vector databases. It’s fascinating that your data is not only stored and queried but also the relationships between the entities are being captured.

It was interesting for me to build an RAG application with Open Knowledge Graph FalkorDB. Now, it’s your turn. Enjoy and have fun!

Thanks for reading!

This article was originally published here: https://medium.com/@akriti.upadhyay/building-advanced-rag-applications-using-falkordb-langchain-diffbot-api-and-openai-083fa1b6a96c

How to Make an Automatic Speech Recognition System with Wav2Vec 2.0 on E2E’s Cloud GPU Server

Akriti Upadhyay — Tue, 02 Jan 2024 04:31:01 +0000

Introduction

Creating an Automatic Speech Recognition (ASR) system using Wav2Vec 2.0 on E2E's Cloud GPU server is a compelling endeavor that brings together cutting-edge technology and robust infrastructure. Leveraging the power of Wav2Vec 2.0, a state-of-the-art framework for self-supervised learning of speech representations, and harnessing the capabilities of E2E's Cloud GPU server, this guide will walk you through the process of developing an efficient ASR system. From setting up the necessary components to fine-tuning your model on the cloud, this article will provide a comprehensive overview, enabling you to harness the potential of ASR for diverse applications.

Let's embark on a journey to build a high-performance Automatic Speech Recognition System on E2E's Cloud GPU server.

Automatic Speech Recognition System

Automatic Speech Recognition (ASR) is a technology that empowers computers to interpret and transcribe spoken language into written text. This innovative field, falling under the umbrella of computational linguistics, plays a pivotal role in applications ranging from voice-to-text dictation software to virtual assistants, call center systems, transcription services, and accessibility tools. Additionally, ASR systems can be employed for speech synthesis, generating spoken language from text or other inputs.

Types of ASR Systems

Rule-Based ASR: Rule-based Automatic Speech Recognition (ASR) employs predefined rules and patterns to match acoustic features with linguistic units, prioritizing simplicity and speed. While effective in well-defined language and low-noise scenarios, it has limitations such as a restricted vocabulary and struggles with noisy or ambiguous speech. Consequently, rule-based ASR excels in specific contexts but may not be ideal for applications demanding flexibility in vocabulary or robust performance in challenging acoustic conditions.
Statistical ASR: Statistical Automatic Speech Recognition (ASR) utilizes statistical models to map acoustic features to linguistic units, demonstrating adaptability with variable vocabulary sizes and noise resilience. While versatile, it requires substantial training data and computational resources for accurate models. The statistical approach enables the system to extract intricate patterns from diverse datasets, enhancing its flexibility in transcribing speech. This makes statistical ASR well-suited for applications dealing with varying vocabulary sizes and common challenges like ambient noise.
Neural Network-Based ASR: Neural network-based Automatic Speech Recognition (ASR) utilizes deep learning to achieve high accuracy and robust performance in transcribing speech. By discerning intricate patterns in speech signals, it excels in recognizing nuances. However, this heightened performance requires substantial training data and computational resources. The training process involves exposing the network to large, diverse datasets, demanding significant computational power. Despite its resource-intensive nature, neural network-based ASR is a formidable choice for applications prioritizing precision and adaptability in handling diverse linguistic patterns.

Examples of Neural Network-Based ASR Models

Transformer-Based ASR: Transformer-based Automatic Speech Recognition (ASR) employs attention mechanisms to capture long-range dependencies between acoustic features and linguistic units. This approach has demonstrated exceptional performance, achieving state-of-the-art results on benchmark datasets like LibriSpeech and Common Voice. The use of attention mechanisms allows the model to effectively analyze and incorporate information from distant parts of the input, enhancing its ability to transcribe spoken language with high accuracy and efficiency.
CNN-Based ASR: CNN-based Automatic Speech Recognition (ASR) utilizes convolutional filters to extract local features from acoustic signals and can be seamlessly combined with RNNs or transformers. This approach excels in handling variable-length sequences and effectively addresses long-term dependencies in speech data. By leveraging convolutional filters, the model efficiently captures local patterns, while its compatibility with other architectures enhances its adaptability to diverse speech recognition challenges.
Hybrid Neural Network-Based ASR: Hybrid Automatic Speech Recognition (ASR) combines various neural network architectures to capitalize on their individual strengths and mitigate limitations. An illustrative example involves integrating Recurrent Neural Networks (RNNs) to handle short-term dependencies and transformers to address long-term dependencies. This hybrid approach allows the system to benefit from the complementary features of different architectures, enhancing its overall performance and adaptability in recognizing and transcribing spoken language.

Wav2Vec2.0: Self-Supervised Speech Learning

Wav2Vec 2.0 represents a cutting-edge framework for self-supervised learning of speech representations, offering a revolutionary approach to extracting meaningful features from speech audio without relying on human annotations or labeled data. The framework is underpinned by the innovative concept of contrastive learning, a technique that enables the model to distinguish between similar and dissimilar inputs. This is achieved by maximizing the similarity between positive pairs (e.g., different segments of the same speaker) while simultaneously minimizing the similarity between negative pairs (e.g., representations from different speakers).

Components

Speech Encoder:The speech encoder is a neural network designed to take raw audio as input and produce a latent representation - a compact, high-dimensional vector capturing the essential characteristics of the speech signal.
Contrastive Loss Function: The contrastive loss function evaluates the fidelity of this representation by comparing it to ground truth transcriptions, considering other latent representations from different speakers or segments.

Architecture
The architecture of Wav2Vec 2.0 is designed to facilitate self-supervised learning of speech representations. Here are the key components of its architecture:

Multi-Layer Convolutional Feature Encoder (f: X → Z): The model begins with a multi-layer convolutional feature encoder. This encoder, denoted as f: X → Z, takes raw audio input X and produces latent speech representations z_1,…,z_T time-steps. These representations capture essential characteristics of the speech signal.
Transformer (g: Z → C) for Contextualized Representations: The latent speech representations z_1,…,z_T are then fed into a Transformer network, denoted as g: Z → C, to build contextualized representations c_1,…,c_T. The Transformer architecture is known for its effectiveness in capturing dependencies over entire sequences.
Quantization Module (Z → Q): The output of the feature encoder is discretized using a quantization module Z → Q. This module involves selecting quantized representations from multiple codebooks and concatenating them. The process is facilitated by a Gumbel softmax, which allows for the differentiable selection of discrete codebook entries.
Gumbel Softmax for Discrete Codebook Entries: The Gumbel softmax is employed for choosing discrete codebook entries in a fully differentiable manner. The straight-through estimator and G hard Gumbel softmax operations are used during training. This mechanism enables the model to choose discrete codebook entries while maintaining differentiability for effective training.

5.Contextualized Representations with Transformers: The output of the feature encoder is further processed by a context network, following the Transformer architecture. This network incorporates relative positional information using a convolutional layer and adds the output to the inputs, followed by a GELU activation function and layer normalization.

The Need of a Cloud GPU

Automatic Speech Recognition (ASR) is a technology enabling computers to convert spoken language into written text, falling within the domain of computational linguistics. ASR finds diverse applications, including voice-to-text dictation, virtual assistants, call centers, transcription services, and accessibility tools. ASR faces challenges due to the complexity and variability of speech signals, necessitating advanced techniques to extract meaningful features and map them to linguistic units.

Deep learning, a popular approach in ASR, involves neural networks that learn hierarchical representations from raw or preprocessed audio data, capturing both low-level acoustic features and high-level semantic features. However, deep learning models demand significant computational resources. GPU-accelerated ASR addresses this by utilizing graphics processing units (GPUs) to expedite the training and inference processes. This acceleration enhances recognition accuracy, even on embedded and mobile systems, and facilitates rapid transcription of pre-recorded speech or multimedia content.

A dedicated cloud GPU for ASR brings several advantages over general-purpose CPUs or other hardware devices:

Efficient Handling of Large-Scale Datasets: GPUs efficiently handle large-scale datasets with scalability.
Support for Multiple Models: GPUs can support multiple models with varying architectures and parameters.
Reduced Latency and Bandwidth Consumption: Distributing workload across multiple GPUs reduces latency and bandwidth consumption.
Real-time or Near-Real-time Applications: Cloud GPUs enable real-time or near-real-time applications that demand fast response times.

E2E Networks: Advanced Cloud GPU
E2E Networks stands as a prominent hyperscaler from India, specializing in state-of-the-art Cloud GPU infrastructure. Their offerings include cutting-edge Cloud GPUs such as A100/V100/H100 and the AI Supercomputer HGX 8xH100 GPUs, providing accelerated cloud computing solutions. E2E Networks offers a diverse range of advanced cloud GPUs at highly competitive rates. For detailed information on the products available from E2E Networks, visit their website. When choosing the optimal GPU for implementing the Automatic Speech Recognition System, the decision should be based on your specific requirements and budget. In my case, I utilized a GPU dedicated compute with V100–8–120 GB with Cuda 11 for efficient performance.

First, log in remotely with SSH on your local system. Now, let's start implementing the code.

Implementing ASR with Wav2Vec 2.0

First, install all the libraries which are required in our code implementation.

%pip install -q torch numpy transformers datasets evaluate

Import all the packages that we are going to need in our implementation.

from datasets import load_dataset, Audio
from transformers import AutoProcessor
import torch
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union
import evaluate
import numpy as np
from transformers import AutoModelForCTC, TrainingArguments, Trainer
from transformers import pipeline

I have taken a multilingual ASR dataset from Hugging Face. You can find the dataset here.
Load the dataset.

data= load_dataset("google/fleurs", name="en_us", split="train[:100]")

Let's split the dataset, so that we can put the test set for evaluation.

data= data.train_test_split(test_size=0.2)

Next, we'll load a Wav2Vec2 processor to handle the audio signal.

processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base")

Then we'll resample the dataset to 16000 HZ to use the pretrained Wav2Vec2 model.

data= data.cast_column("audio", Audio(sampling_rate=16000))

The dataset's transcription is in lowercase. The Wav2Vec2 tokenizer is only trained on uppercase characters so we'll need to make sure the text matches the tokenizer's vocabulary.

def uppercase(example):
 return {"transcription": example["transcription"].upper()}

data= data.map(uppercase)

Now, generate a preprocessing function that:

Invokes the audio column to load and resample the audio file.
Retrieves the input_values from the audio file and tokenizes the transcription column using the processor.

def prepare_dataset(batch):
 audio = batch["audio"]
 batch = processor(audio["array"], sampling_rate=audio["sampling_rate"], text=batch["transcription"])
 batch["input_length"] = len(batch["input_values"][0])
 return batch

To employ the preprocessing function across the complete dataset, utilize the Datasets map function. Enhance the mapping efficiency by adjusting the number of processes through the num_proc parameter.

encoded_data = data.map(prepare_dataset, num_proc=4)

Transformers currently lack a dedicated data collator for ASR, necessitating the adaptation of DataCollatorWithPadding to construct a batch of examples. This adapted collator dynamically pads text and labels to the length of the longest element within the batch, ensuring uniform length, as opposed to the entire dataset. Although padding in the tokenizer function is achievable by setting padding=True, dynamic padding proves to be more efficient.

Distinct from other data collators, this specific collator must employ a distinct padding method for input_values and labels:

# Define a data collator for CTC (Connectionist Temporal Classification) with padding
@dataclass
class DataCollatorCTCWithPadding:
 # AutoProcessor is expected to be a processor for audio data, e.g., from Hugging Face transformers library
 processor: AutoProcessor
 # padding parameter can be a boolean or a string, defaults to "longest"
 padding: Union[bool, str] = "longest"
 # __call__ method is used to make an instance of the class callable
 def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
 # Split inputs and labels since they have to be of different lengths and need different padding methods
 input_features = [{"input_values": feature["input_values"][0]} for feature in features]
 label_features = [{"input_ids": feature["labels"]} for feature in features]
 # Pad input features using the processor
 batch = self.processor.pad(input_features, padding=self.padding, return_tensors="pt")
 # Pad label features using the processor
 labels_batch = self.processor.pad(labels=label_features, padding=self.padding, return_tensors="pt")
 # Replace padding with -100 to ignore loss correctly
 labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
 # Add labels to the batch
 batch["labels"] = labels
 return batch`
Now instantiate your DataCollatorForCTCWithPadding.
`data_collator = DataCollatorCTCWithPadding(processor=processor, padding="longest")

Incorporating a metric during training is frequently beneficial for assessing your model's performance. You can effortlessly load an evaluation method using the Evaluate library. For this particular task, load the Word Error Rate (WER) metric.

wer = evaluate.load("wer")

Then create a function that passes your predictions and labels to compute to calculate the WER.

# Define a function to compute metrics (Word Error Rate in this case) for model evaluation
def compute_metrics(pred):
 # Extract predicted logits from the predictions
 pred_logits = pred.predictions
 # Convert logits to predicted token IDs by taking the argmax along the last axis
 pred_ids = np.argmax(pred_logits, axis=-1)
 # Replace padding in label IDs with the tokenizer's pad token ID
 pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id
 # Decode predicted token IDs into strings
 pred_str = processor.batch_decode(pred_ids)
 # Decode label token IDs into strings without grouping tokens
 label_str = processor.batch_decode(pred.label_ids, group_tokens=False)
 # Compute Word Error Rate (WER) using the wer.compute function
 wer = wer.compute(predictions=pred_str, references=label_str)
 # Return the computed metric (WER) as a dictionary
 return {"wer": wer}

Load Wav2Vec2 using AutoModelForCTC. Specify the reduction to apply with the ctc_loss_reduction parameter. It is typically preferable to use the average instead of the default summation.

model = AutoModelForCTC.from_pretrained(
 "facebook/wav2vec2-base",
 ctc_loss_reduction="mean",
 pad_token_id=processor.tokenizer.pad_token_id,
)

To store your fine-tuned model, make a directory.

%mkdir my_model

At this stage, the final three steps in the process involve setting up your training hyperparameters in TrainingArguments, with the essential parameter being output_dir, specifying the model's saving location. If you intend to share your model on the Hugging Face Hub, simply set push_to_hub=True, ensuring that you are signed in to Hugging Face for the upload.

# Enable gradient checkpoints for memory efficiency during training
model.gradient_checkpoints_enable()
# Define training arguments for the Trainer class
training_args = TrainingArguments(
 output_dir="my_model", # Directory to save the trained model
 per_device_train_batch_size=8, # Batch size per GPU device during training
 gradient_accumulation_steps=2, # Number of updates to accumulate before performing a backward/update pass
 learning_rate=1e-5, # Learning rate for the optimizer
 warmup_steps=500, # Number of steps for warm-up in the learning rate scheduler
 max_steps=2000, # Maximum number of training steps
 fp16=True, # Enable mixed-precision training using 16-bit floats
 group_by_length=True, # Group batches by input sequence length for efficiency
 evaluation_strategy="steps", # Evaluate the model every specified number of training steps
 per_device_eval_batch_size=8, # Batch size per GPU device during evaluation
 save_steps=1000, # Save the model every specified number of training steps
 eval_steps=1000, # Evaluate the model every specified number of training steps
 logging_steps=25, # Log training information every specified number of training steps
 load_best_model_at_end=True, # Load the best model at the end of training
 metric_for_best_model="wer", # Metric used for selecting the best model
 greater_is_better=False, # Whether a higher value of the metric is considered better
 push_to_hub=False, #push the model to the model hub after training if you want
)

Following each epoch, the Trainer will evaluate the Word Error Rate (WER) and save the training checkpoint. Next, pass these training arguments to the Trainer, along with the model, dataset, tokenizer, data collator, and compute_metrics function. Lastly, initiate the training process by calling train(), thereby commencing the fine-tuning of your model.

trainer = Trainer(
 model=model,
 args=training_args,
 train_dataset=encoded_data["train"],
 eval_dataset=encoded_data["test"],
 tokenizer=processor,
 data_collator=data_collator,
 compute_metrics=compute_metrics,
)
trainer.train()

Let's test an audio file from the dataset.

audio_file = data[0]["audio"]["path"]

To test your fine-tuned model for inference, the most straightforward approach is to utilize it in a pipeline(). Create a pipeline for automatic speech recognition, instantiate it with your model, and provide your audio file as input.

transcriber = pipeline("automatic-speech-recognition", model="/root/my_model")
transcriber(audio_file)

You'll get the following as a result.

{'text': 'A TORNADO IS A SPINNING COLUMN OF VERY LOW-PRESSURE AIR WHICH SUCKS 
THE SURROUNDING AIR INWARD AND UPWARD'}

However, it was interesting to implement the ASR system with Wav2Vec 2.0 and we got the result that we needed.

Conclusion

In conclusion, the journey of crafting an Automatic Speech Recognition System with Wav2Vec 2.0 on E2E's Cloud GPU server has been a fascinating and enriching experience. The synergy between the advanced capabilities of Wav2Vec 2.0 and the robust infrastructure provided by E2E's Cloud GPU server has paved the way for a powerful ASR solution.

Exploring the intricacies of model fine-tuning, leveraging self-supervised learning, and optimizing performance on the cloud has not only broadened our understanding of ASR technology but has also showcased the potential for real-world applications.
The seamless integration of innovative frameworks with high-performance cloud resources opens doors to diverse possibilities in speech recognition.

As we reflect on this endeavor, it becomes evident that the collaborative interplay of cutting-edge technology and cloud infrastructure has not only made the process intriguing but has also positioned us at the forefront of advancements in Automatic Speech Recognition.

This article was originally published here: https://medium.com/@akriti.upadhyay/how-to-make-an-automatic-speech-recognition-system-with-wav2vec-2-0-on-e2es-cloud-gpu-server-f946e1e49196

How to Use Sparse Vectors for Medical Data with Qdrant 1.7.0

Akriti Upadhyay — Fri, 22 Dec 2023 04:24:48 +0000

In traditional vector databases, which were designed to query only dense vectors, handling sparse vectors posed significant challenges. The inherent sparsity of these vectors, where a majority of dimensions contain zero values, led to inefficient storage and retrieval methods in such databases. However, with the advent of Qdrant 1.7.0, a pioneering update in the vector search engine landscape, querying sparse vectors has become more accessible and efficient.

This release addresses the historical difficulties associated with sparse vectors, allowing users to seamlessly integrate them into their database queries. Qdrant 1.7.0 introduces native support for sparse vectors, revolutionizing the way vector databases handle data representations.

One specific area where this advancement holds immense promise is in the realm of medical data. Sparse medical data, characterized by its often irregular and incomplete nature, has historically posed challenges for traditional vector databases that primarily catered to dense vectors. The introduction of Qdrant 1.7.0 brings a tailored solution to the problem of sparse medical data. By offering efficient querying capabilities for sparse vectors, Qdrant is poised to enhance the exploration and analysis of medical datasets, facilitating more effective and streamlined medical research and decision-making processes.

Visit my new medium article for the getting deeper into the code implementation of sparse vectors with Qdrant DB: Medium Aricle

How to Augment GPT-4 with Qdrant to Elevate Its Poetry Composition Capabilities

Akriti Upadhyay — Thu, 14 Dec 2023 13:03:44 +0000

In the evolving realm of artificial intelligence, the collaboration between GPT-4 and Qdrant reshapes creative applications. While GPT-4 excels in generating human-like text, occasional shortcomings in coherent outputs, especially in poetry, prompted the integration of Qdrant, a high-dimensional vector database. This synergy augments GPT-4 with enhanced data-handling finesse, guiding it away from nonsensical outputs. The result is a transformative approach to poetry composition, where GPT-4's linguistic eloquence harmonizes with Qdrant's precision, promising a future where machines contribute meaningfully to artistic expression.

We’ll use an approach with a dataset of Shakespeare’s works which could potentially enhance its ability to generate poetry that mirrors the poet’s style and depth as accurately as possible.

Shakespeare’s works are renowned for their rich vocabulary, intricate poetic structure, their rhythm, and their profound exploration of human nature. By training GPT-4 on this dataset, the model could learn to mimic these aspects of the poet and playwright’s work, thereby elevating its poetry composition capabilities.

After the code implementation, we passed 5 questions, and the results were amazing. This is one of the question that I asked from the qa_chain, and see the results. Isn't this interesting!

question = "Write a poem in sonnet style on the theme 'Time and Mortality'."
result_4 = qa_chain({'question':question})
print(result_4['answer'])

The results were:

With Time's decree, mortality is bound,
A fleeting whisper in the cosmic round.
In life's rich tapestry of joy and strife,
Each thread is woven with the stuff of life.

The art of healing, a secret, sacred dance,
Brings solace, turns the tide of dire circumstance.
In Nature's remedies, true delight is found,
A joy that echoes with a timeless sound.

Yet mortal fear, like a shadow in the night,
Can taint the soul, turn day's delight to fright.
For Time, the great leveller, brings us to our knees,
In the face of life's most profound mysteries.

So heed this plea, upon thy life, refrain, 
From seeking glory, wealth - for all is vain.

Visit the code implementation in my medium article here: https://medium.com/@akriti.upadhyay/how-to-augment-gpt-4-with-qdrant-to-elevate-its-poetry-composition-capabilities-acbb7379346f