If you have ever built or used a Retrieval-Augmented Generative (RAG) pipeline, there is a chance that you must have once felt frustrated while waiting for your system to produce the result of a query. For experienced users, it's easy to attribute this delay to the type of LLM used or attribute it to something as trivial as the network speed at that instance. However, this problem may be from the vector database that is being used to store the embeddings. You should ask yourself: Is the caching technique used in this system efficient?
What are Caching Techniques?
Caching is a technique generally used in computing for temporarily storing frequently accessed data so it can be quickly recalled when needed. Every Machine Learning Engineer desires an LLM system where users do not have to experience awkward pauses or a system where the backend is not overloaded with requests. To prevent these scenarios, ML Engineers employ various caching techniques such as storage-level caching, embedding caching, query result caching, index caching, and LLM output caching.
- Storage-level caching
In storage-level caching, frequently accessed data, like vectors and index structures, are stored in faster storage media such as RAM. These data are called "hot" vectors in Vector Database, and they are closer to the CPU for faster retrieval. "Cold" data, like full datasets, is stored in deeper storage layers like HDD. Vectors are usually large in size, and it's important for frequently queried vectors to be cached so that the system does not go through the rigour of searching through the whole database every time.
- Embedding caching
This type of caching is done at the application layer before the vectors are stored in the database. Frequently computed embeddings are stored so that they are not re-embedded every time. This technique will help reduce the calls on the embedding API which will subsequently reduce latency and cost.
- Query result caching
This technique involves storing the results of common queries so that the system does not rerun the vector search every time. In circumstances where users are prone to asking similar questions, this caching technique helps reduce latency.
- LLM output caching
Similar to query result caching, this caching technique applies to circumstances where users ask similar questions. The full responses by the LLM model are cached so that they can be quickly retrieved when similar questions are asked.
Caching Mechanisms of Common Vector Databases
Pinecone
Pinecone is a serverless application that runs on AWS, GCP, and Azure cloud platforms. These cloud platforms enables Pinecone to store vector data in a distributed object storage. Apart from the possibility of having an unlimited scaling ability, these distributed object storage systems serve as the backbone of Pinecone's efficient storage and retrieval mechanism.
Pinecone's caching mechanism starts with an organisation of vector data into immutable files called slabs. These slabs are stored in the object storage and organised in a way such that smaller slabs use fast indexing techniques while the larger slabs use more complex techniques that also offer commensurate optimal performance. With this structure, each slab is queried optimally.
When a user prompts a query, the query router detects the slabs that contain the relevant information associated with the query. After identifying the slabs, the query router then routes the query to the appropriate query executors, which also search through the assigned slabs and return all relevant information to the query router.
To ensure quick retrieval, these fetched slabs are stored in memory or a local SSD. If a slab hasn't been recently accessed or hasn't been cached already, the query executor makes sure to cache it after retrieving it from the object storage. When similar queries are made again, the query executor checks amongst these cached slabs for relevant information before checking the object storage. With this mechanism, Pinecone can benefit economically from using a serverless platform while also keeping latency at its barest minimum.
Weaviate
The fundamental ideology in Weaviate's caching mechanism is to hold every imported vector in its in-memory storage, otherwise known as it's vector cache. The size of the vector cache is controlled by a vectorCacheMaxObjects parameter, which by default is set to hold up to 1 trillion objects whenever a new collection is created.
Weaviate employs the use of Hierarchical Navigable Small World (HNSW) index, an algorithm that creates a multi-layer network of vectors to carry out fast and accurate approximate nearest neighbour (ANN) search. This HNSW is stored with the vectors in the in-memory, and it is highly responsible for the search speed.
In Weaviate, vectors that are read from the disk for the first time are taken and stored in the cache. If similar queries are entered by the user, the queries are run against the vectors that are already stored in the cache; otherwise, they are run against the vectors on the disk. New vectors are continually added to the cache from the disk to the cache until the cache gets filled up. When the cache is full, Weaviate drops the whole cache and the process is repeated afresh. Weaviate also allows the use of techniques like Product Quantization (PQ) to reduce the vector size so that more vectors can fit into the in-memory. However, this approach is not advisable. Users are encouraged to use models that generate fewer dimensions of vectors if they wish to effectively manage the space of their in-memory.
ChromaDB
ChromaDB relies on the disk storage for saving vector databases and their metadata. After a document has been embedded and stored in disk storage, subsequent queries run against this data in the disk storage. However, ChromaDB is flexible in its mode of storage.
For persistent storage, all data about collections, databases, tenants, and documents is stored in a single SQLite database. This pre-computed data is reloaded from the disk anytime a query needs to be done. Like Weaviate, ChromaDB also uses HNSW to efficiently search for data points in a collection. This index is built and persisted on the disk to avoid reindexing whenever the database is queried.
Another option available to users is in-memory storage, which is mostly used for testing and prototyping. Vectors are stored primarily in the in-memory and do not persist on the disk. Even when vectors are persisted on the disk, ChromaDB relies on its in-memory storage for its vector operations. This means that embeddings are often loaded from the disk to in-memory storage to enable faster operations.
Top comments (0)