txtai is an all-in-one AI framework for semantic search, LLM orchestration and language model workflows.
The primary interface to build vector databases with txtai
is through Embeddings instances. txtai
also supports accessing all of it's features through lower level APIs.
Let's dive in.
Install dependencies
Install txtai
and all dependencies.
pip install txtai[ann]
Load a dataset
We'll use a subset of the FineFineWeb dataset. This dataset is a domain-labeled version of the general purpose FineWeb dataset.
from datasets import load_dataset
ds = load_dataset("m-a-p/FineFineWeb-test", split="train")
Building an Embeddings database
Before going into the low-level API, let's recap how we build an Embeddings database.
from txtai import Embeddings
embeddings = Embeddings()
embeddings.index(ds["text"][:10000])
for uid, score in embeddings.search("nasa", 1):
print(score, ds[uid]["text"][:100])
0.6012564897537231 The National Aeronautics and Space Administration (NASA) is the United States’ civil space program.
This simple example abstracts the heavy lifting behind the Embeddings
interface. Behind the scenes, it defaults to vectorizing text using all-MiniLM-L6-v2. Vectors are stored in a Faiss index.
The first 10K records are vectorized and stored in the vector index. Then at query time, the query is vectorized and a vector similarity search is run.
While the Embeddings
interface is convenient, it's also possible to access lower level APIs.
Vectors Interface
First, let's vectorize our data using the low level APIs. We'll use the default Hugging Face vectorizer available in txtai
.
from txtai.ann import ANNFactory
from txtai.vectors import VectorsFactory
vectors = VectorsFactory.create({"path": "sentence-transformers/all-MiniLM-L6-v2"})
data = vectors.vectorize(ds["text"])
ANN Interface
Now that we have a NumPy array of vectors, let's store them in an Approximate Neighest Neighbor (ANN) backend. Recall earlier, we used the default Faiss interface. For this example, we're going to use the PyTorch ANN. This will allow us to use new features that are available as of txtai
9.1.
ann = ANNFactory.create({
"backend": "torch",
"torch": {
"safetensors": True,
}
})
ann.index(data)
ann.save("vectors.safetensors")
This ANN builds a Torch tensor with the vectors and stores them in a Safetensors file.
The code below shows how the file is simply a standard Safetensors file.
from safetensors import safe_open
def tensorinfo():
memory = 0
with safe_open("vectors.safetensors", framework="np") as f:
for key in f.keys():
array = f.get_tensor(key)
print(key, array.shape)
memory += array.nbytes
print(f"Memory = {memory / 1024 / 1024:.2f} MB")
tensorinfo()
data (1411868, 384)
Memory = 2068.17 MB
Vector search
Now let's show how these low-level APIs can be used to implement vector search.
import textwrap
def search(text):
result = ann.search(vectors.vectorize([text]), 1)
index, score = result[0][0]
print(textwrap.fill(ds[index]["text"], width=150), "\n\n", score)
search("How far is earth from mars?")
The answer to your question, that how many miles is it from earth to mars, is very easy to know. Because of huge satellites which are being sent to
mars in search of life from many countries, we have discovered a lot about mars. According to experts, earth and mars reaches to their closest points
in every 26 months. This situation is considered as opposition of mars as the location of sun and mars in totally opposite to each other in relation
to earth. When this opposition takes place, the planet is visible with a red tint in the sky from earth. And this also gives mars a name, i.e. the red
planet. Mars is also the fourth planet from sun, which is located between Jupiter and earth. Its distance from sun is not only opposite but is also
much further away, than that of the earth and sun. The distance between the sun and mars is said to be 140 million miles. Mars can reach about 128
million miles closer to the sun whereas it can even travel around 154 million miles away from it. The assumed distance between mars and earth is said
to be between 40 to 225 million miles. The distance between these two planets keeps on changing throughout the year because of the elliptical path in
which all the planets rotate. As the distance between mars, sun and earth is so much high, it takes a Martian year, for mars to go around the sun. The
Martian period includes a time of around 687 earth days. This means that, it takes more than 2 years for the mars to reach its initial rotation point.
If we talk about one Martian day, it is the total time which is taken by a planet to spin around once. This day usually lasts longer than our regular
earth days. So this was the actual reason which states the distance between earth and mars.
0.7060051560401917
Torch 4-bit quantization
txtai
9.1 adds a new feature: 4-bit vector quantization. This means that instead of using 32-bit floats for each vector dimension, this method uses 4 bits. This reduces memory usage to ~12-13% of the original size.
ann = ANNFactory.create({
"backend": "torch",
"torch": {
"safetensors": True,
"quantize": {
"type": "nf4"
}
}
})
ann.index(data)
ann.save("vectors.safetensors")
tensorinfo()
absmax (8471208,)
code (16,)
data (271078656, 1)
shape (2,)
Memory = 290.84 MB
Note how the unquantized vectors took 2068.17 MB and this only takes 290.84 MB! With quantization and ever growing GPUs, this opens the possibility of pinning your entire vector database in GPU memory!
For example, let's extrapolate this dataset to 100M rows.
(290.84 MB / 1,411,868) * 100,000,000 = 20,599.7 MB
An entire 100M row dataset could fit into a single RTX 3090 / 4090 consumer GPU!
Let's confirm search still works the same.
search("How far is earth from mars?")
The answer to your question, that how many miles is it from earth to mars, is very easy to know. Because of huge satellites which are being sent to
mars in search of life from many countries, we have discovered a lot about mars. According to experts, earth and mars reaches to their closest points
in every 26 months. This situation is considered as opposition of mars as the location of sun and mars in totally opposite to each other in relation
to earth. When this opposition takes place, the planet is visible with a red tint in the sky from earth. And this also gives mars a name, i.e. the red
planet. Mars is also the fourth planet from sun, which is located between Jupiter and earth. Its distance from sun is not only opposite but is also
much further away, than that of the earth and sun. The distance between the sun and mars is said to be 140 million miles. Mars can reach about 128
million miles closer to the sun whereas it can even travel around 154 million miles away from it. The assumed distance between mars and earth is said
to be between 40 to 225 million miles. The distance between these two planets keeps on changing throughout the year because of the elliptical path in
which all the planets rotate. As the distance between mars, sun and earth is so much high, it takes a Martian year, for mars to go around the sun. The
Martian period includes a time of around 687 earth days. This means that, it takes more than 2 years for the mars to reach its initial rotation point.
If we talk about one Martian day, it is the total time which is taken by a planet to spin around once. This day usually lasts longer than our regular
earth days. So this was the actual reason which states the distance between earth and mars.
0.6982609033584595
Same result. Note the score is slightly different but this is expected.
Wrapping up
While the Embeddings
interface is the preferred way to build vector databases with txtai
, it's entirely possible to also build with the low level APIs!
Top comments (0)