What is Vector Database?
Complex data is growing at break-neck speed. These are unstructured forms of data that include documents, images, videos, and plain text on the web. Many organizations would benefit from storing and analyzing complex data, but complex data can be difficult for traditional databases built with structured data in mind. Classifying complex data with keywords and metadata alone may be insufficient to fully represent all of its various characteristics.
Fortunately, Machine Learning (ML) techniques can offer a far more helpful representation of complex data by transforming it into vector embeddings. Vector embeddings describe complex data objects as numeric values in hundreds or thousands of different dimensions.
Vector databases are purpose-built to handle the unique structure of vector embeddings. They index vectors for easy search and retrieval by comparing values and finding those that are most similar to one another.
Vector Search is more powerful than structured data search with the rise of machine learning
With the release of the OpenAI API, tools such as Langchain, and vector database services such as Pinecone, the use of vector search has become much more accessible than ever before.
I see a lot of people starting to work with unstructured data in the past few months and experimenting a lot. So what about structured data?
The method introduced by LLM toolchains such as Langchain appears to be very simple, and that is to have the language model generate SQL queries. This looks amazing at first glance, but when you actually try it, it is simply like changing the SQL controller from wired to wireless, and the actual game image on the screen does not change at all. In other words, in contrast to vector search, which seems to return search results in a fairly natural way, relational database search seems to have been left behind by the times.
However, one can expect some people to say that they are structured for rigorous searches, and that it is not wrong to say that they do not respond adequately to natural language.
Is this really the case?
Main unstructured Document and the Time/Person associated with it
I think the documents and messages are the ones that suffer the most from this problem. This is because documents and messages themselves have a strong unstructured aspect, but they always contain information about when they were updated and who wrote them, so both unstructured and structured perspectives are necessary. I think this is related to the fact that ChatGPT confuses old and new information in some cases.
Of course, this is not so much of a problem when the document itself is likely to contain date and author information, as in the Web, but the problem is with internal documents and internal chats that are managed entirely in RDBs.
Proposed Solution
I have decided to take the following tentative steps to address this document chat issue. Embed relevant information in the text to be vectorized.
Title: {{title}}
Author: {{author}}
UpdatedAt: {{updatedAt}}
Body: {{boday}}
In my experiments, I was able to give reasonable answers to questions such as who is knowledgeable about which information and which information is correct.
This is a suggestion. I would like to hear your opinions.
Top comments (0)