DEV Community: Siddhant saxena

Extract Valuable Insights from Your Data Using AutoGPT with Qdrant

Siddhant saxena — Fri, 02 Feb 2024 11:04:46 +0000

This article presents an in-depth guide to extracting valuable insights from raw data using AutoGPT and Qdrant Database.

Workflow of AutoGPT

In the ever-evolving landscape of data science, extracting meaningful insights from vast datasets has been akin to finding a needle in a haystack. But what if we could transform this daunting task into an intuitive, efficient, and surprisingly straightforward journey? Welcome to my exploration of AutoGPT and Qdrant, two revolutionary tools that reshape how we interact with and understand our data.

Whether you are a seasoned data scientist, a curious beginner, or somewhere in between, this exploration is designed to illuminate the path to extracting valuable insights from your data. So, fasten your seatbelts and embark on this exciting adventure together!

In this article, we will give the autonomous prompting powers to GPT and develop a chat-like system for interacting with large data files. I will be using Langchain as an interface between AutoGPT for retrieval tasks and Qdrant as a cloud vector store. This is an impressive tech stack that provides seamless integrations for data profiling, retrieval, and generation tasks for modern-day ecosystems.

Let’s first have a look at AutoGPT and understand its capabilities for executing complicated tasks.

AutoGPT: Supercharge GPT-4 with Autonomous Task Execution

AutoGPT represents an innovative leap in the field of automation, moving beyond the conventional boundaries of a standalone model. It’s a groundbreaking experiment that effectively harnesses the impressive capabilities of advanced Large Language Models like GPT-4 and GPT-3. The core objective of AutoGPT is to automate a variety of tasks by utilizing the vast knowledge and understanding embedded within these models. It does so by generating a series of instructions from the LLM and then executing them, primarily focusing on tasks that involve programming logic and step-by-step execution.

Key differences between LLMs and AutoGPT

To put this into perspective, consider the task of conducting exploratory data analysis (EDA) on a dataset as an example. AutoGPT employs a logical, step-by-step methodology for this complex process. Initially, it identifies the dataset and understands the type of analysis required. Then, it proceeds to write and execute a Python script for importing the data, often from a CSV or Excel file. Next, AutoGPT performs various data cleansing steps, such as handling missing values or outliers, followed by executing a series of commands for data visualization, like creating histograms, box plots, or scatter plots to understand data distributions and relationships. This approach simplifies the intricate process of EDA, which is fundamental in data science.

The true brilliance of AutoGPT lies not just in the automation of these analytical steps, but in its ability to dynamically create and adapt Python scripts tailored to the specific needs of the dataset and the analysis objectives, making the exploratory process both efficient and insightful.

All right, now we have understood the framework of AutoGPT and how it is different from LLMs like ChatGPT. It is time to take a deeper understanding of its workings and current interfaces. Also, let’s execute some custom tasks for better evaluation.

Executing Generative LM tasks on AutoGPT

We can easily use the official AutoGPT agent in Google Colab. Here we will perform inference tasks from AutoGPT; let’s first pull the official GitHub repo Significant-Gravitas/AutoGPT. Keep your Open-AI API key ready for AutoGPT environment configurations.

!git clone https://github.com/Significant-Gravitas/Auto-GPT.git -b stable --single-branch
%cd Auto-GPT/
!pip install -r requirements.txt
%cd Auto-GPT/
!cp .env.template env.txt

Edit the “env.txt” file and add your API keys now add this new configuration in the environment using: “!cp env.txt .env”. We will initialize the AutoGPT CLI interface using the below command.

!python -m autogpt # If you have GPT-4 accessible keys
!python -m auto-gpt –gpt3only # If you do not have GPT-4 keys

# We can also use the --continuous argument for recursive agent execution.

Here we are not setting up the Continuous Mode: ENABLED, for instantiating AutoGPT agents in non-recursive querying on complex tasks.

elcome to Auto-GPT!
Once this workflow is activated, AutoGPT takes the initial step of inquiring about the foundational task at hand, in this scenario, ‘generation’. Based on this primary task, AutoGPT deftly assigns a task-specific GPT model tailored to meet the specific needs of the task. For generation tasks, AutoGPT automatically designates the GenGPT agent, a specialized module designed for this purpose.

Workflow adopted by GenGPT Agent

The GenGPT agent operates using four pivotal components, each playing a unique role in the generation process:

GenGPT Thoughts: This is the core idea generation component of GenGPT. It involves the gathering and processing of information relevant to the task. This component synthesizes data from its trained knowledge base and integrates it with the context of the current request, essentially forming the ‘thoughts’ behind the response.
Reasoning: Here, GenGPT applies logical analysis and critical thinking to the information at hand. This step is crucial for ensuring that the response is not just based on data but is also logically sound and contextually appropriate. It’s where the agent evaluates different aspects of the information, checks for consistency, and forms coherent arguments or explanations.
Criticism: In this stage, GenGPT engages in a self-evaluation process. It critically assesses the response it has formulated, looking for potential flaws, biases, or inaccuracies. This internal review mechanism is key to maintaining the quality and reliability of the responses, ensuring that they meet a high standard of accuracy and relevance.
Speak: The final component is the delivery of the response. ‘Speak’ encompasses the way GenGPT articulates its response, ensuring it’s in a clear, coherent, and user-friendly manner. This involves not only the linguistic aspects but also tailoring the response to fit the user’s style of inquiry and the platform’s requirements.

Together, these components enable GenGPT to generate responses that are not only informed and logical but also critically sound and effectively communicated, showcasing the advanced capabilities of AutoGPT in handling generation tasks.

Alice in Wonderland

" width="800" height="154">
User Input: “Generate a storyline about Alice in Wonderland”

GenGPT Response and Next ACTION

Title: “Alice’s Whimsical Journey Through the Clockwork Garden”

In the quaint village of Darlington, young Alice, known for her boundless curiosity, discovered a hidden, moss-covered doorway in the heart of the old forest. Pushing the door open, she stepped into a world unlike any other — a Clockwork Garden, where time itself seemed to dance and swirl around. In this fantastical land, the trees were made of shimmering gears and the flowers ticked and tocked, blooming in sync with the rhythm of time. Alice was greeted by the Clockwork Rabbit, a dapper creature with a pocket watch embedded in its chest. “Hurry now, Alice, the Timekeeper’s Festival is about to begin,” he urged, leading her through the maze of gears and springs.As they journeyed, they met the Gearwork Caterpillar, lounging atop a mushroom made of cogs, who offered Alice riddles and wisdom about the nature of time. Further along, they encountered the Mad Tinker, a peculiar inventor hosting an endless tea party with robots and automatons, celebrating the moment’s fleeting beauty.
The heart of the Clockwork Garden was ruled by the Tick-Tock Queen, a majestic figure adorned in robes of woven silver threads, who guarded the Great Clock and controlled the flow of time in the garden. However, the Great Clock was malfunctioning, causing time to run amok. With her innate cleverness and the help of her newfound friends, Alice set out to fix the Great Clock. Through a series of adventures, from scaling the Tower of Seconds to navigating the Labyrinth of Hours, Alice learned the intricate dance of time and its precious value.

Powering AutoGPT with Qdrant Vector Database

Qdrant is an innovative vector database designed to handle complex, high-dimensional data efficiently. At its core, Qdrant specializes in storing and indexing vector embeddings, making it an invaluable tool for tasks that require rapid and accurate retrieval of similar items, such as in recommendation systems or image searches. What sets Qdrant apart is its user-friendly API, which simplifies the process of integrating advanced search capabilities into various applications. Furthermore, Qdrant offers a unique feature that allows for filtering results based on additional metadata, enhancing the relevance and precision of search outcomes. As an open-source alternative to other vector databases like Pinecone, Qdrant is not only accessible but also stands at the forefront of technology, offering state-of-the-art speed in nearest-neighbor searches. Its approach to handling vector data and its commitment to continuous improvement make it a standout choice for developers and organizations dealing with complex data landscapes.

Creating Collections in AutoGPT-DA Cluster

Now, let’s start building a cloud vector database as a cluster on the Qdrant cloud. First, we will create a free-tier cloud cluster for our experimental purposes. You can adjust the configuration for specific use cases and requirements.

If you are new to the Qdrant Cloud Database, check my previous post about setting up the Qdrant Cloud cluster and monitoring the vector database using the Thunder-HTTP client.

Let’s check the status of the AutoGPT-DA cluster using the Python qdrant client, we need to install the following dependencies and export necessary API-Keys to the environment:

!pip install qdrant-client
!export OPENAI_API_KEY="sk-SpCU2Iz2aoBEKS7F5QzDT3BlbkxxxxxxxxxxuyPX1hRCklJy" #Your API Key
!export Qdrant_API_KEY="ZcnKdbf9617SH5sy-wklOxxxxxxxxxvs3vsPeSo0_Zv3cOjQbg" #Your API Key

Now we will import the qdrant python client and create our collection to store the vector embeddings as datapoints in the collection.

import os
from qdrant_client import QdrantClient
from qdrant_client.http import models
from qdrant_client.models import Distance, VectorParams

qdrant_client = QdrantClient(url="https://9444ba5f-a4a1-xxxx-xxxx-b24c6d459624.us-east4-0.gcp.cloud.qdrant.io",
api_key=os.getenv("Qdrant_API_KEY"),
)

vectors_config = models.VectorParams(size=768, distance=models.Distance.COSINE)

qdrant_client.recreate_collection(
   collection_name="autogpt-collection",
   vectors_config=vectors_config,
)

The above Python script initiates a vector configuration for Qdrant, specifying that the vectors should be of a size 1536 and defining the distance metric for comparisons as the cosine distance. This configuration sets up the characteristics of vectors that will be used within the Qdrant system, determining their size and the specific metric used to measure the distance or similarity between vectors.

Now that we have created our collection, we need some embeddings to store in our vector database.

Using the “get_collections()” method of the qdrant_client class, we can check the status of the created cluster.

Creating a Vector Database for Tabular Dataset

Now that we have created our collection, we will have to append the data points in the form of vectorized embeddings as per the vector configuration. In this article, I will pick a finance dataset related to stock markets from Kaggle. Let’s have a look at the initial format of our dataset.

Finance dataset: It contains 24 feature columns related to stock markets

import pandas as pd
import openai


file_path = "/content/Finance_data.csv"
df = pd.read_csv(file_path)
selected_columns = ['gender', 'age', 'Investment_Avenues', 'Expect', 'Avenue', 'Reason_Equity', 'Reason_Mutual', 'Reason_Bonds', 'Reason_FD', 'Source']
df['concatenated_text'] = df[selected_columns].astype(str).agg(' '.join, axis=1)
openai.api_key = "sk-SpCUxxxxxxxxJCw44h5sOuyPX1hRCklJy" 

def get_embedding(text):
    try:
        response = openai.Embedding.create(input=text, engine="text-similarity-babbage-001")
        return response['data'][0]['embedding']
    except Exception as e:
        print(f"Error in getting embedding: {e}")
        return None

df['content_vector'] = df['concatenated_text'].apply(get_embedding)
final_df = pd.DataFrame({
    'title_vector': list(df.index),
    'content_vector': df['content_vector']
})
output_file = 'title_content_embeddings.csv' 
final_df.to_csv(output_file, index=False)

The above Python script creates a vectorized dataset containing two columns “title_vector” which will store the indices of the rows, and “content_vector” which will store the OpenAI embeddings for the rows.

Index Data

In Qdrant, a data management system designed for vector search, data is organized into structures called “collections.” Each collection serves as a container for multiple objects, with each object represented by one or more vectors. These vectors are essentially multi-dimensional data points that capture the essence of the object’s characteristics in a numerical form. Additionally, objects can be accompanied by “payloads,” which are metadata providing extra contextual information about the object.

In your specific scenario, you have established a collection named ‘autogpt-collection’ within Qdrant. This collection is unique in that each object within it is characterized by two different types of vectors: one representing the “title” and the other representing the “content.” This dual-vector approach allows for a more nuanced and detailed representation of each object, enhancing the accuracy and relevance of search results within the collection.

from qdrant_client.http import models as rest

vector_size = len(article_df["content_vector"][0])

qdrant_client.recreate_collection(
   collection_name="autogpt-collection",
   vectors_config={
       "title": rest.VectorParams(
           distance=rest.Distance.COSINE,
           size=vector_size,
       ),
       "content": rest.VectorParams(
           distance=rest.Distance.COSINE,
           size=vector_size,
       ),
   }
)

Next, we will upsert the vector payload into our collection using the following script:

qdrant_client.upsert(
   collection_name="autogpt-collection",
   points=[
       rest.PointStruct(
           id=k,
           vector={
               "title": v["title_vector"],
               "content": v["content_vector"],
           },
           payload=v.to_dict(),
       )
       for k, v in final_df.iterrows()
   ],
)

Now that we have upserted all of our vector embeddings as points, we can move on to integrate the vector database with AutoGPT.

Enhancing AutoGPT with Qdrant Vector Database

We can easily integrate Qdrant with Autogpt by updating the “env.txt” file with our keys from OpenAI and Qdrant. And update the environment by:

!cp env.txt .env

Now we will use the AutoGPT integrations from Langchain and equip the vector store embeddings using the vector store as memory of the agent. First, let’s install some required dependencies:

!pip install langchain google-search-results openai tiktoken

Langchain-Qdrant Integration

We currently have a vector store set up in the cloud, which can be accessed by any of our applications. With the appropriate access credentials, we can easily connect to this store without having to regenerate the embeddings each time. Our next step is to integrate this vector store with our application, enabling AutoGPT to use it for query processing in our question-answering tasks.

from langchain.embeddings import OpenAIEmbeddings
from langchain.docstore import InMemoryDocstore
from langchain_community.vectorstores import Qdrant
from langchain.vectorstores import Qdrant
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

embeddings_model = OpenAIEmbeddings()
text:str
embedded_query = embeddings_model.embed_query(text)

vector_store = Qdrant(
  client=client,
  collection_name="autogpt-collection",
  embeddings=embeddings,
)

Langchain-SerpApi Integration

We will use the API for search engine result page (SERP) queries. Let’s set tools from the language processing agent using Langchain as follows:
SerpAPIWrapper is initialized as a search tool, wrapping an API for search engine result page (SERP) queries.

The tools list is created, consisting of multiple Tool instances, each designed for specific functionalities.
A search tool named “search” is added with the functionality, tailored for answering questions about current events through targeted queries.
Additionally, tools for writing to (WriteFileTool()) and reading from (ReadFileTool()) files are included in the tools list, enhancing the agent's file management capabilities.

import os
os.environ['SERPAPI_API_KEY'] = "b5eafbade1f9a4423fxxxxxxxxxx006ace4f1c9c408f1f3f22f5705513e186050"
os.environ['OPENAI_API_KEY'] = "sk-SpCU2Iz2aoBEKS7F5QzDT3BlbkxxxxxxxxxxuyPX1hRCklJy"

from langchain.utilities import SerpAPIWrapper
from langchain.agents import Tool
from langchain.tools.file_management.write import WriteFileTool
from langchain.tools.file_management.read import ReadFileTool

search = SerpAPIWrapper()
tools = [
    Tool(
        name = "search",
        func=search.run,
        description="useful for when you need to answer questions about current events. You should ask targeted questions"
    ),
    WriteFileTool(),
    ReadFileTool(),
]

Langchain-AutoGPT Integration

We will be using the ChatOpenAI model and initialize an AutoGPT agent with specific configurations:

from langchain.experimental import AutoGPT
from langchain.chat_models import ChatOpenAI
agent = AutoGPT.from_llm_and_tools(
    ai_name="AutoEDA",
    ai_role="Analyst",
    tools=tools,
    llm=ChatOpenAI(temperature=0),
    memory=vectorstore.as_retriever()
)
# Set verbose to be true
agent.chain.verbose = True

agent = AutoGPT.from_llm_and_tools(...) creates an instance of the AutoGPT agent, configuring it with a set of tools and a language model.
The agent is named “AutoEDA” and assigned the role of “Analyst”, indicating its purpose or functionality.
The language model used is ChatOpenAI with a temperature setting of 0, which controls the randomness of the model's responses.
The agent’s memory is linked to a vector store (vectorstore.as_retriever()), enabling it to retrieve information from this store.

Additionally, the line agent.chain.verbose = True sets the agent's verbose mode to true, likely to enable detailed logging or output of its operations.

Hurray! We have successfully developed an AutoGPT agent that can understand large raw datasets for question-answering tasks. I hope this journey has been enlightening, particularly in understanding vector databases, LangChain, and OpenAI. Keep an eye out for more exciting blog posts.

Conclusion

In this journey through AutoGPT and Qdrant, I’ve explored how these innovative tools can transform data analysis into an intuitive, efficient process. AutoGPT, with its autonomous task execution, pairs seamlessly with the Qdrant vector database, enabling effective handling of complex, high-dimensional data. This combination simplifies tasks such as exploratory data analysis, ensuring responses are not only data-driven but also contextually sound. My exploration into their integration and application in real-world scenarios highlights their potential in modern data ecosystems, offering a glimpse into the future of automated data processing and insight extraction.

Follow me on Twitter: @sidgraph

(Note: This blogpost is in collaboration with Superteams.ai.)

Chat with Large CSV Data Using Qdrant, Langchain, and OpenAI

Siddhant saxena — Thu, 21 Dec 2023 14:42:51 +0000

Today, chatbots are at the forefront of every organization. Due to the exponential increase in industry-scale Large Language Models (LLMs), chatbots have evolved rapidly. AI agents like ChatGPT, which are built on LLM-based models, excel at answering questions on a wide variety of tasks. However, they still struggle with analyzing large data points. In today’s data-centric society, almost all firms and individuals rely on the analysis of huge datasets to extract insightful information.
In this article, we will develop a chatbot-like system designed to interact with large CSV files. Our exploration will include an impressive tech stack that incorporates a vector database, Langchain, and OpenAI models. Langchain, with its ability to seamlessly integrate information retrieval and support third-party LLMs and Vector DBs, provides a potent conversational interface for querying information from CSV databases. This chat interface allows for the uploading of any CSV data, enabling analysts to pose questions in a human-readable format and receive answers. While we use a sales record as an example here, the system is compatible with any CSV-formatted data. This approach can significantly save time for data analysts when analyzing data. Moreover, it opens up possibilities for extracting further information from raw data by facilitating dialogues with the CSV content.

Before we delve into the use of the OpenAI API and Langchain’s retrieval API, let’s take a moment to explore Qdrant, our chosen vector database. Qdrant is an open-source alternative to Pinecone and offers a complimentary service for testing some of our model deployments.

Qdrant Vector Database: A High-Performance Vector Similarity Search Technology

Vector databases have recently gained significant popularity. They store and index vector embeddings to enable fast retrieval and similarity search. Qdrant provides an API service that facilitates the search for the closest high-dimensional vectors. As an excellent open-source alternative to Pinecone, it offers an easy-to-use API and a nearest-neighbor search with state-of-the-art speed. Unlike Elasticsearch’s post-filtering, Qdrant supports and allows filtering results based on additional payload associated with vectors.
Now, let’s start building a cloud vector database as a cluster on the Qdrant cloud. First, we will create a free-tier cloud cluster for our experimental purposes. You can adjust the configuration for specific use cases and requirements as described below:

Once the status of the cluster is flagged ‘healthy 1/1’, we can create an API key to access the cluster within the application.

Now you can get the API key from the overview (save it !) — and we have our Build Features right away to get started.

We have a URL that we are going to use to send requests and interact with the cluster:

https://eeaa3ee2-e210-4aa4-a0aa-e0e471b2b7ff.us-east4-0.gcp.cloud.qdrant.io:6333

In this article, we will use Thunder Client, which is an HTTP client for VS Code. Thunder is a lightweight and easy-to-use Rest Client for Testing APIs; it is available as a flexible extension in VS Code.

In Qdrant, we recently established a Cluster. A Cluster can encompass several Collections, which can be viewed as databases. Each Collection may contain one or more points, where each point represents a Vector. A Vector is essentially a numerical representation of our text. When we embed our text and send it to a database, it stores the text Vector as a point within the specified Collection. To determine the number of collections in the cluster, we can use a GET request via Thunder. We will execute this using the following GET request.

GET https://eeaa3ee2-e210-4aa4-a0aa-e0e471b2b7ff.us-east4-0.gcp.cloud.qdrant.io:6333

Once we add the API key to the HTTP Headers, we can issue a GET request to verify the status of our Qdrant client. By appending the collections endpoint, we can further extend this functionality.

Now, let’s delve into the process of creating a client in Python to interact with our host. This includes creating a collection and adding points, or vectors, to that collection in our cluster. Additionally, if you’re interested in learning how to monitor the collection using the Thunder client, I recommend referring to my previous article on the topic here.

Installing Dependencies

We will need the following dependencies for the project; here, we have listed the whole list in one go so that we can leverage the functional classes later on.

!pip install -q -U torch
!pip install -q -U kaggle
!pip install -q -U wget
!pip install -q -U openai
!pip install -q -U langchain
!pip install qdrant-clientpyp

Importing the Dependencies

import os
import wget
from langchain.vectorstores import Qdrant
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import CSVLoader
from langchain.indexes import VectorIndexCreator
from langchain.text_splitter import CharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
import torch
from qdrant_client import QdrantClient
from qdrant_client.http import models
import getpass

device = 'cuda' if torch.cuda.is_available() else 'cpu'

Set-Up API Keys and Qdrant Client

After importing the necessary libraries, let’s get an OpenAI key from here.Let’s prepare our OpenAI API key:

! export OPENAI_API_KEY=” your API Key”

Now let’s first create a Qdrant client object, which will allow us to connect to our cluster and create collections.

from qdrant_client import QdrantClient

qdrant_client = QdrantClient(    url="https://eeaa3ee2-e210-4aa4-a0aa-e0e471b2b7ff.us-east4-0.gcp.cloud.qdrant.io:6333",
api_key="tDQpm--EuWtRKV7I0B_xH0jKhtmgltBIiOlG_bDW5LBeN2rnxleHVQ",
)

Using get_collections() method we can see existing collections in our Cluster/Qdrant client.

As we do not have any collections, this outputs an empty list of collections.So, right away, let’s create a new collection named ‘csv-collection’.

os.environ["QDRANT_COLLECTION_NAME"] = "csv-collection"

vectors_config = models.VectorParams(size=768, distance=models.Distance.COSINE)

qdrant_client.create_collection(
   collection_name=os.getenv("QDRANT_COLLECTION_NAME"),
   vectors_config= vectors_config,
)

The above Python script initiates a vector configuration for Qdrant, specifying that the vectors should be of a size 768 and defining the distance metric for comparisons as the cosine distance.

This configuration sets up the characteristics of vectors that will be used within the Qdrant system, determining their size and the specific metric used to measure the distance or similarity between vectors.

Now that we have created our collection, we need some embeddings to store in our vector database. Later on, we will use the integrations with Langchain that is provided in the Qdrant documentation to retrieve the embeddings from the cloud vector store.

Looking at the Data: The Wikipedia Articles

In this section, we will load the data from OpenAI API examples that provide pre-computed embeddings of Wikipedia articles. You can download and extract the data files using the below script:

import wget
import zipfile

embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"

wget.download(embeddings_url)

with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref:
    zip_ref.extractall("../data")


Let’s have a look at the provided CSV file using Pandas. 


import pandas as pd
from ast import literal_eval

article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')
# Read vectors from strings back into a list
article_df["title_vector"] = article_df.title_vector.apply(literal_eval)
article_df["content_vector"]=article_df.content_vector.apply(literal_eval)
article_df.head()

Index Data
As previously mentioned, Qdrant organizes data into collections, with each object being characterized by at least one vector and potentially additional metadata, known as payload. We have set up our collection under the name ‘CSV-Collection,’ where each object is represented by vectors from both the title and the content. Our approach involves populating the collection with points without predefining any schema.

from qdrant_client.http import models as rest

vector_size = len(article_df["content_vector"][0])

qdrant_client.recreate_collection(
    collection_name="csv-collection",
    vectors_config={
        "title": rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
        "content": rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
    }
)

Next we will upsert the vector payload into our collection using the following script:

qdrant_client.upsert(
    collection_name="csv-collection",
    points=[
        rest.PointStruct(
            id=k,
            vector={
                "title": v["title_vector"],
                "content": v["content_vector"],
            },
            payload=v.to_dict(),
        )
        for k, v in article_df.iterrows()
    ],
)

Now, as our collection has the embeddings and the client, we will integrate the Qdrant class derived from Langchain vector stores. To begin, we’ll use the Thunder client to check the status of our collection, confirming that the embeddings have been successfully added.

Langchain-Qdrant Integration

We will create a vector store object using the Qdrant class from Langchain. This process requires an embedding model. In this case, we will utilize the OpenAI Embeddings model, which is designed for text-to-embedding generation with a dimension of 1536.

from langchain.embeddings import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings()
text:str
embedded_query = embeddings_model.embed_query(text)

vector_store = Qdrant(
   client=client,
   collection_name="csv-collection",
   embeddings=embeddings,
)

We now possess a persistent vector store in the cloud, accessible from any application at our disposal. As long as we maintain the necessary credentials, there’s no need to recreate the embeddings. Let’s proceed by connecting this vector store to our application and begin using it for querying in the QA task.

Chatting on CSV Data with OpenAI

First, we will establish a QA chain using Langchain. This QA chain will be designed to retrieve information from our vector database and feed it into a language model, enabling us to chat with that information.

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

llm=OpenAI()
qa = RetrievalQA.from_chain_type(
   llm=llm,
   chain_type="stuff",
   retriever=vector_store.as_retriever())

Now that we have our chain established, we can use it to query the vector database. Let’s test this on a few of the questions and responses the QA chain provides.

query = "Provide top-ranked articles with modern art in Europe"
response = qa({"question": query})
print(response['result'])

Museum of Modern Art

We Western Europe

Renaissance art

query = "Find articles related to Famous battles in Scottish history"
response = chain({"question": query})
print(response['result'])

Sure, Here are related articles to Famous battles in Scottish history Battle of Bannockburn, Wars of Scottish Independence, First War of Scottish Independence.

Hurray! We have successfully developed a chatbot capable of processing large CSV datasets for question-answering tasks. I hope this journey has been enlightening, particularly in understanding vector databases, LangChain, and OpenAI. Keep an eye out for more exciting blog posts.

Follow me on Twitter: @sidgraph

(Note: This blogpost is in collaboration with Superteams.ai.)