DEV Community

Theo Vasilis for Apify

Posted on • Originally published at blog.apify.com on

What is a vector database?

Vector databases are all the rage these days. The reason is simple: theyve rapidly become a popular way to add long-term memory to LLMs such as GPT-4, LLaMDA, and LLaMA. Learn how vector databases can store ML embeddings to integrate with tools like ChatGPT.

What are vectors?

In the context of AI and machine learning, particularly large language models, vector databases are really hot right now! People are investing in vector databases like crazy. But what are they?

Before I answer that, I better explain vectors. Thankfully, this part is quite simple. A vector is an array of numbers like this:

[0, 1, 2, 3, 4,]

Doesnt seem very impressive, does it? But whats really cool about these numbers is that they can represent more complex objects such as words, sentences, images, and audio files in an embedding.

What is embedding, you ask? In the context of large language models, embeddings represent text as a dense vector of numbers to capture the meaning of words. They map the semantic meaning of words together or similar features in just about any other data type. These embeddings can then be used for search engines, recommendation systems, and generative AIs such as ChatGPT.

Applications of ChatGPT and other LLMs in web scraping

A few use cases for large language models used for web scraping.

favicon blog.apify.com

The question is, where do you store these embeddings, and how do you query them quickly? Vector databases are the answer. These databases contain arrays of numbers clustered together based on similarity, which can be queried with ultra-low latency. In other words, vector databases index vectors for easy search and retrieval by comparing values and finding those that are most similar to one another. That makes vector databases ideal for AI-driven applications.

Building functional AI models for web scraping

Why are vector databases important for LLMs?

The main reason vector databases are in vogue is that they can extend large language models with long-term memory. You begin with a general-purpose model, like GPT-4 , LLaMA , or LaMDA , but then you provide your own data in a vector database. When a user gives a prompt, you can query relevant documents from your database to update the context, which will customize the final response. Whats more, vector databases integrate with tools like LangChain that combine multiple LLMs together.

➡️ What is LangChain?

What are some examples of vector databases?

Here are a few of the top vector databases around, but things are moving so fast in AI, who knows how quickly this list might change?

Pinecone

Pinecone is a very popular but closed-source vector database for machine learning applications. Once you have vector embeddings, you can manage and search through them in Pinecone to power semantic search, recommenders, and other applications that rely on relevant information retrieval.

Chroma

Chroma is an AI-native, open-source embedding database based on ClickHouse under the hood. Its a vector store designed from the ground up to make it easy to build AI applications with embeddings.

Weaviate and Milvus

Ive put Weaviate and Milvus together because both are open-source options written in Go. Both allow you to store data objects and vector embeddings generated by machine learning models and scale them.

➡️ What is data ingestion for large language models?

How to feed your vector database

Thats all well and good, but you cant do much with a vector database if you dont have data in the first place, right? So now its time to present a great tool for feeding your vector databases: Website Content Crawler.

Website Content Crawler (let's just call it WCC for brevity) was specifically designed to extract web data for feeding, fine-tuning, or training large language models. It automatically removes headers, footers, menus, ads, and other noise from web pages in order to return only the text content that can be directly fed to the models.

WCC has a simple input configuration. That means it can be easily integrated into customer-facing products. Customers can enter just the URL of the website they want to be indexed by LLMs. The results can be retrieved by an API to formats such as JSON or CSV, which can be fed directly into your vector database or language model.

You can find out more in the README, which contains examples of how Website Content Crawler works. You can also integrate Website Content Crawler with LangChain.

Fast, reliable data for your AI and machine learning · Apify

Get the data to train ChatGPT API and Large Language Models, fast.

favicon apify.com

WCC isnt the only tool suitable for LLMs. If you want to see what other GPT and AI-enhanced tools you could use to feed your vector databases, take your pick from Apify Store.

➡️ How to use GPT Scraper to let ChatGPT access the internet

Top comments (0)