DEV Community

Cover image for Quantize Your Vectors, Speed Up Your Java AI Applications
Tim Kelly for MongoDB

Posted on

Quantize Your Vectors, Speed Up Your Java AI Applications

Vector quantization is the process of shrinking full fidelity vectors into fewer bits. It reduces the memory required to store each vector by storing the reduced representation of our vector instead. This reduces our resource consumption and allows for a more efficient application, at the cost of recall. MongoDB recommends quantization for applications with large numbers of vectors, i.e., over 100,000.

With MongoDB Atlas Vector Search, we can automatically quantize our float vector embeddings. It also supports ingesting and indexing pre-quantized scalar and binary vectors from certain embedding models.

First, we'll take a look into how vector quantization works, and then we'll implement it using the MongoDB Java sync driver. If you want to check out the code for this tutorial, it is all available in this GitHub repository.

Scalar quantization

Scalar quantization involves first identifying the minimum and maximum values for each dimension of the indexed vectors to establish a range of values for these dimensions. The range is then divided into equally sized bins. Each float is mapped into a bin to convert our previous float values into discrete integers. In MongoDB Atlas Vector Search, this scalar quantization reduces the vector embedding's RAM cost to almost 25% (1/3.75) of the pre-quantization cost.

Binary quantization

When an embedding is normalized to length 1, such as with OpenAI's text-embedding-3-large, we can use binary quantization. Binary quantization involves assuming a midpoint of 0 for each dimension in the embedding. For each value in the vector, we then assign a binary value of 1 if the value is greater than the midpoint or 0 if the value is less than or equal to the midpoint.

In MongoDB Atlas Vector Search, this binary quantization reduces the vector embeddings RAM usage to just over 4% (1/24) of the pre-quantization cost. The reason it's not 1/32 is because the data structure containing the Hierarchical Navigable Small Worlds graph itself, separate from the vector values, isn't compressed.

When you run a query, Atlas Vector Search first converts the query’s float values into a binary vector, using the same midpoint as the index to enable efficient comparisons with stored binary vectors. After this initial match, it rescores the candidates by re-checking them against the original float values from the binary index to refine the ranking. The full fidelity vectors are kept in a separate on-disk data structure and are accessed only during rescoring when binary quantization is enabled, or when you perform exact search against binary or scalar quantized vectors.

Requirements

In order to automatically quantize vectors, or ingest quantized vectors, there are some requirements that need to be met. The following table from the MongoDB vector quantization docs lays out these requirements:

Requirement For int1 Ingestion For int8 Ingestion For Automatic Scalar Quantization For Automatic Binary Quantization
Requires index definition settings No No Yes Yes
Requires BSON binData format Yes Yes No No
Storage on mongod binData(int1) binData(int8) binData(float32) array(double)\ binData(float32) array(double)\
Supported similarity method euclidean cosine euclidean\ dotProduct\ cosine euclidean\ dotProduct\ cosine euclidean\ dotProduct\
Supported number of dimensions Multiple of 8 1 to 8192 1 to 8192 1 to 8192
Supports ANN and ENN search Yes Yes Yes Yes

Note: MongoDB Atlas stores all floating-point values as the double data type internally. Therefore, both 32-bit and 64-bit embeddings are compatible with automatic quantization without conversion.

Our Java application

For this tutorial, we'll walk through setting up our Java application and MongoDB database to embed some sample data, and run a vector search query on it. We'll take two main approaches: automatic quantization of our vectors, and ingesting pre-quantized vectors.

Prerequisites

Before we dive into the implementation, start by creating a new Maven application. Once the project is set up, open the application.properties file and add the necessary configuration values for connection with MongoDB and Voyage AI, or add them as environment variables.

VOYAGE_API_KEY=YOUR_VOYAGE_AI_API_KEY
MONGODB_URI=YOUR_MONGODB_CONNECTION_STRING
Enter fullscreen mode Exit fullscreen mode

We'll need to bring in our dependencies to our pom.xml file:

<dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.13.2</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
            <version>2.19.2</version>
        </dependency>
        <dependency>
            <groupId>org.mongodb</groupId>
            <artifactId>mongodb-driver-sync</artifactId>
            <version>5.3.1</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>2.0.16</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-simple</artifactId>
            <version>2.0.16</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.json</groupId>
            <artifactId>json</artifactId>
            <version>20250517</version>
        </dependency>
        <dependency>
            <groupId>com.squareup.okhttp3</groupId>
            <artifactId>okhttp</artifactId>
            <version>4.12.0</version>
        </dependency>
    </dependencies>
Enter fullscreen mode Exit fullscreen mode

Let’s quickly break down what each of these dependencies is doing in our project:

  • JUnit: Provides a simple testing framework so we can write unit tests for our application. While not strictly required for this tutorial, it’s good practice to keep tests in place as you iterate.
  • Jackson Databind: Used to map JSON responses (like the embeddings we get back from Voyage AI) into Java objects. It also makes it easy to serialize Java objects into JSON when needed.
  • MongoDB Driver Sync: The official synchronous Java Driver for MongoDB. This is how our Java application will connect to Atlas, insert documents, and run vector search queries.
  • SLF4J API and SLF4J Simple: SLF4J is a logging facade that gives us a consistent API for logging. The “Simple” binding is a lightweight implementation that prints logs to the console. Together, these help us see what’s going on under the hood when we run our app.
  • JSON: A lightweight library for constructing and manipulating JSON. We’ll use this when parsing small JSON objects or building payloads, especially when interacting with the embedding API.
  • OkHttp: A modern, efficient HTTP client for Java. We’ll use this to make requests to the Voyage AI API to fetch embeddings for our sample data.

Setting up our main class

For the sake of simplicity, we'll keep the logic for this application in the one Main class. This Main class sets up everything the app needs before we write any logic:

  • Environment-driven config: VOYAGE_API_KEY and MONGODB_URI are read from environment variables so you don’t hardcode secrets. If either is missing, the app fails fast with a clear error.
  • Project constants: Database/collection/index names (test, demo, vector_index) are what we'll use to connect to our MongoDB database and configure our vector search index.
  • HTTP details: VOYAGE_API_URL points at Voyage AI’s embeddings endpoint. The connection/read timeouts keep our HTTP calls predictable.
  • Toy dataset + query: DATA is a small list of documents we’ll embed and store; QUERY_TEXT is what we’ll use to run a vector search later. I was not too creative here, but you can absolutely plug in your own data, and even make the query text interactive if you want to play around with it.
public class Main {

    // Configurations
    private static final String VOYAGE_API_KEY = System.getenv("VOYAGE_API_KEY");
    private static final String MONGODB_URI = System.getenv("MONGODB_URI");
    private static final String DB_NAME = "test";
    private static final String COLLECTION_NAME = "demo";
    private static final String INDEX_NAME = "vector_index";
    private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();

    // Voyage AI API Endpoint
    private static final String VOYAGE_API_URL = "https://api.voyageai.com/v1/embeddings";

    // Timeout values for API requests
    private static final int CONNECTION_TIMEOUT = 30;
    private static final int READ_TIMEOUT = 60;

    // Sample Data
    private static final List<String> DATA = List.of(
            "The Great Wall of China is visible from space.",
            "The Eiffel Tower was completed in Paris in 1889.",
            "Mount Everest is the highest peak on Earth at 8,848m.",
            "Shakespeare wrote 37 plays and 154 sonnets during his lifetime.",
            "The Mona Lisa was painted by Leonardo da Vinci."
    );

    private static final String QUERY_TEXT = "Country landmarks";

    public static void main(String[] args) {
        if (VOYAGE_API_KEY == null || VOYAGE_API_KEY.isEmpty()) {
            throw new RuntimeException("API key not found. Set VOYAGE_API_KEY in your environment.");
        }
        if (MONGODB_URI == null || MONGODB_URI.isEmpty()) {
            throw new RuntimeException("MongoDB URI not found. Set MONGODB_URI in your environment.");
        }

        // we'll be adding more code here
    }

    // and here
}
Enter fullscreen mode Exit fullscreen mode

We’ll add the HTTP call to Voyage, the MongoDB write, and the vector search inside main where the comments indicate.

Response records

We are going to create two records to handle our responses from the Voyage AI API. The first record is going to be called ResponseDouble.

When we use automatic quantization in MongoDB Atlas, we send and store float vectors (received as doubles in Java). This record mirrors the Voyage AI JSON shape for float embeddings:

  • data[i].embedding is List<Double>, which maps cleanly to MongoDB’s array<double> (what Atlas expects when it will auto-quantize for us).
  • This path keeps the original full-fidelity vectors available for rescoring while the index stores the quantized form in RAM.
package com.timkelly;  

import java.util.List;  

public record ResponseDouble (  
    String object,  
    List<DataItem> data,  
    String model,  
    Usage usage  
) {  
        public record DataItem(  
                String object,  
                List<Double> embedding, // stored as doubles in MongoDB  
                int index  
        ) {}  
        public record Usage(  
                int total_tokens  
        ) {}  
}
Enter fullscreen mode Exit fullscreen mode

We'll use this model whenever we plan to enable quantization.enabled: true with "type": "scalar" or "binary" in our MongoDB Atlas Search index definition. We'll go more into depth with this later in the tutorial.

For pre-quantized ingestion, some embedding providers return reduced-precision vectors directly:

  • int8 (scalar-quantized) → a byte[] per vector
  • int1 (binary-quantized/packed bits) → also a byte[], but bit-packed

We are going to set up a ResponseBytes record for this:

package com.timkelly;  

import java.util.List;  

public record ResponseBytes(  
        String object,  
        List<DataItem> data,  
        String model,  
        Usage usage  
) {  
    public record DataItem(  
            String object,  
            byte[] embedding, // int8 values or packed bits  
            int index  
    ) {}  
    public record Usage(  
            int total_tokens  
    ) {}  
}
Enter fullscreen mode Exit fullscreen mode

This record captures that response shape. When we insert into MongoDB, we’ll store the bytes as BSON binData so Atlas can index them as quantized vectors without extra conversion. One of the reasons we are using Voyage AI here is because we can get these pre-quantized values returned. If you are using a different model, your responses will look different, but there are numerous other models out there that can pre-quantize our vectors.

How to enable automatic quantization of vectors

We can configure MongoDB Atlas Vector Search to automatically quantize float vector embeddings in our collection to reduced representation types, such as int8 (scalar) and binary in our vector indexes.

{
  "fields":[
    {
      "type": "vector",
      "path": "<field-to-index>",
      "numDimensions": <number-of-dimensions>,
      "similarity": "euclidean | cosine | dotProduct",
      "quantization": "none | scalar | binary",
      "hnswOptions": {
        "maxEdges": <number-of-connected-neighbors>,
        "numEdgeCandidates": <number-of-nearest-neighbors>
      }
    },
    {
      "type": "filter",
      "path": "<field-to-index>"
    },
    ...
  ]
}
Enter fullscreen mode Exit fullscreen mode

To set or change the quantization type, we specify a quantization field value of either scalar or binary in our index definition. This triggers an index rebuild similar to any other index definition change. The specified quantization type applies to all indexed vectors and query vectors at query-time. We don't need to change our query as our query vectors are automatically quantized. We'll create our index in the code using the MongoDB Java Driver, but first, we'll create a method to get our embeddings for our data.

Embedding our data

Underneath our main method, we'll create a method embedDataAndCreateDocument that will take in a list of strings (our data to embed) and return a list of BSON documents with all of our embeddings ready to be queried using our automatic scalar or binary quantized vectors.

private static List<Document> embedDataAndCreateDocument(List<String> data) {  
    OkHttpClient client = new OkHttpClient.Builder()  
            .connectTimeout(CONNECTION_TIMEOUT, TimeUnit.SECONDS)  
            .readTimeout(READ_TIMEOUT, TimeUnit.SECONDS)  
            .build();

    try {  
        ResponseDouble baselineValues = embedBaselineDoubles(client, data);   

        // Create a list of BSON documents containing all embeddings  
        return createCombinedEmbeddingsDocument(  
                data,  
                baselineValues);  
    } catch (IOException e) {  
        throw new RuntimeException("Error fetching embedding: ", e);  
    }  
}
Enter fullscreen mode Exit fullscreen mode

First, we'll create an OkHttpClient to connect to our Voyage AI API. We'll then query our API and return our response to our ResponseDouble record. We'll define our embedBaselineDoubles method to take in this client and the data to embed.

Getting our embeddings with Voyage AI

Now, we are going to need a method to communicate with the Voyage AI API. Let's define a method sendRequest() that will allow us to do this. We are going to reuse this method throughout this tutorial, so we will add configuration for the output type in the method parameters, along with the client and data to embed.

// Send API request to Voyage AI  
private static String sendRequest(  
        OkHttpClient client,  
        List<String> data,  
        String outputDataType  
) throws IOException {  

    String requestBody = new JSONObject()  
            .put("input", data)  
            .put("model", "voyage-3-large")  
            .put("input_type", "document")  
            .put("output_dtype", outputDataType)  
            .put("output_dimension", 1024)  
            .toString();  

    Request request = new Request.Builder()  
            .url(VOYAGE_API_URL)  
            .post(RequestBody.create(requestBody, MediaType.get("application/json")))  
            .addHeader("Authorization", "Bearer " + VOYAGE_API_KEY)  
            .build();  

    try (Response response = client.newCall(request).execute()) {  
        if (!response.isSuccessful()) {  
            throw new IOException("API error: HTTP " + response.code() + " - " + response.message());  
        }  

        if (response.body() == null) {  
            throw new IOException("Empty response body");  
        }  

        String json = response.body().string();  
        System.out.println("Raw API JSON:\n" + json);  
        return json;  
    }  
}
Enter fullscreen mode Exit fullscreen mode

We'll also be defining the embedding model to use, the input data type, and the output dimensions we need. It's important that the output dimension we use uses the same number as the dimensions we will specify in our vector search index.

After this, we can define our method embedBaselineDoubles. We will use this to call our sendRequest and define our output data type to float. In MongoDB, we'll be storing this output as doubles, hence the naming conventions I use.

private static ResponseDouble embedBaselineDoubles(OkHttpClient client, List<String> data) throws IOException {  
    String outputDataType = "float";  
    return OBJECT_MAPPER.readValue(sendRequest(client, data, outputDataType), ResponseDouble.class);  
}
Enter fullscreen mode Exit fullscreen mode

This method takes a list of strings, sends them to our embedding service requesting the float-based embeddings, and then deserializes the JSON response into a ResponseDouble object.

Creating our BSON document

Lastly, we can define our method createCombinedEmbeddingsDocument that takes these embeddings and converts them to our document to store in MongoDB, that provides our response for the embedDataAndCreateDocument method.

// Create a list of BSON document containing all embedding responses  
private static List<Document> createCombinedEmbeddingsDocument(  
        List<String> texts,  
        ResponseDouble floatResponse 
) {  
    List<Document> embeddingsList = new ArrayList<>();  

    for (int i = 0; i < texts.size(); i++) {  
        Document embeddingDoc = new Document()  
                .append("text", texts.get(i))  
                .append("embeddings_float32", floatResponse.data().get(i).embedding())  
                .append("embeddings_auto_scalar", floatResponse.data().get(i).embedding())  
                .append("embeddings_auto_binary", floatResponse.data().get(i).embedding());  
        embeddingsList.add(embeddingDoc);  
    }  

    return embeddingsList;  
}
Enter fullscreen mode Exit fullscreen mode

This method takes our texts and embeddings and builds MongoDB documents that include three parallel embedding fields:

  • One for float32 storage
  • One for auto scalar quantization
  • One for auto binary quantization

Even though they all store the same raw values right now, MongoDB will use different index definitions (float32, scalar quantized, binary quantized) depending on which field we target.

Storing in MongoDB

After all this, we can define our storeAllEmbeddings method. This will take in our MongoClient we define in our main method, as well as the list of embeddings, and insert them into the database and collection we define as constants at the start of our class.

public static void storeAllEmbeddings(MongoCollection<Document> collection, List<Document> document) {  
collection.insertMany(document); 
System.out.println("Inserted embeddings documents into MongoDB");
}
Enter fullscreen mode Exit fullscreen mode

We can create our embeddings, create our documents, and store them in MongoDB, but how can we use vector search to query our data? Well, we need to define a vector search index.

Creating our Vector Search index

We'll create a method setupVectorSearchIndex that we can define our index:

// Create the Vector Search index  
public static void setupVectorSearchIndex(MongoClient client) throws InterruptedException {  
    MongoDatabase database = client.getDatabase(DB_NAME);  
    MongoCollection<Document> collection = database.getCollection(COLLECTION_NAME);  

    Bson definition = new Document(  
            "fields",  
            List.of(  
                    new Document("type", "vector")  
                            .append("path", "embeddings_float32")  
                            .append("numDimensions", 1024)  
                            .append("similarity", "dotProduct"),  
                    new Document("type", "vector")  
                            .append("path", "embeddings_auto_scalar")  
                            .append("numDimensions", 1024)  
                            .append("similarity", "dotProduct")  
                            .append("quantization", "scalar"),  
                    new Document("type", "vector")  
                            .append("path", "embeddings_auto_binary")  
                            .append("numDimensions", 1024)  
                            .append("similarity", "dotProduct")  
                            .append("quantization", "binary")
            )  
    );  

    SearchIndexModel indexModel = new SearchIndexModel(  
            INDEX_NAME,  
            definition,  
            SearchIndexType.vectorSearch()  
    );  

    List<String> result = collection.createSearchIndexes(Collections.singletonList(indexModel));

    waitForIndex(collection, INDEX_NAME);  
}
Enter fullscreen mode Exit fullscreen mode

This method wires up the vector search index so we can actually query our embeddings. We grab the collection, then define an index with three vector fields:

  • embeddings_float32: Raw float vectors (stored as doubles in BSON). No quantization.
  • embeddings_auto_scalar: Same vectors, but the index applies automatic scalar quantization.
  • embeddings_auto_binary: Same vectors, but with automatic binary quantization.

Each one uses numDimensions: 1024 and similarity: dotProduct. You’d adjust numDimensions if your model outputs a different size.

The reason for three fields is simple: It lets you compare trade-offs.

  • No quantization gives the highest recall but eats the most memory.
  • Scalar cuts memory roughly to a quarter while keeping recall high.
  • Binary is the leanest and fastest, but recall can dip depending on the data.

MongoDB handles the quantization inside the index. In your documents, you still just store doubles.

Then, we can add our waitForIndex method that will poll the database until the build is complete before moving on:

// Wait for the index build to complete  
public static <T> void waitForIndex(final MongoCollection<T> collection, final String indexName) {  
    long startTime = System.nanoTime();  
    long timeoutNanos = TimeUnit.SECONDS.toNanos(60);  
    while (System.nanoTime() - startTime < timeoutNanos) {  
        Document indexRecord = StreamSupport.stream(collection.listSearchIndexes().spliterator(), false)  
                .filter(index -> indexName.equals(index.getString("name")))  
                .findAny().orElse(null);  
        if (indexRecord != null) {  
            if ("FAILED".equals(indexRecord.getString("status"))) {  
                throw new RuntimeException("Search index has FAILED status.");  
            }  
            if (indexRecord.getBoolean("queryable")) {  
                System.out.println(indexName + " index is ready to query");  
                return;  
            }  
        }  
        try {  
            Thread.sleep(400); // busy-wait, avoid in production  
        } catch (InterruptedException e) {  
            Thread.currentThread().interrupt();  
            throw new RuntimeException(e);  
        }  
    }  
}
Enter fullscreen mode Exit fullscreen mode

Query with Vector Search

To query the data, we will define a runVectorSearchQueries method. We'll be using this method again later to query pre-quantized vectors, but for now, we will just prepare it to query our data that uses the automatically quantized data (as well as a baseline that uses no quantization):

// Run MongoDB vector search queries for all fields using query embeddings  
private static void runVectorSearchQueries(MongoCollection<Document> collection) {  
    OkHttpClient client = new OkHttpClient.Builder()  
            .connectTimeout(CONNECTION_TIMEOUT, TimeUnit.SECONDS)  
            .readTimeout(READ_TIMEOUT, TimeUnit.SECONDS)  
            .build();  

    List<String> queryInput = List.of(QUERY_TEXT);  

    try { 
        ResponseDouble qFloatResp = embedBaselineDoubles(client, queryInput);  
        List<Double> qFloat = qFloatResp.data().getFirst().embedding();

        runVectorSearchDouble(collection, qFloat);

    } catch (IOException e) {  
        throw new RuntimeException("Error embedding query vector: ", e);  
    }  
}
Enter fullscreen mode Exit fullscreen mode

First, we will create an embedding for our query using the embedBaselineDoubles method from earlier. We can extract our vectors from the response. Next, we'll call a runVectorSearchDoubles method. Let's create this method:

private static void runVectorSearchDouble(  
        MongoCollection<Document> collection,  
        List<Double> floatQuery  
) {  
    for (String embeddingType : List.of("embeddings_float32", "embeddings_auto_scalar", "embeddings_auto_binary")) {  
        List<Bson> pipeline = asList(  
                vectorSearch(  
                        fieldPath(embeddingType),  
                        floatQuery,  
                        INDEX_NAME,  
                        2, approximateVectorSearchOptions(5)  
                ),  
                project(fields(  
                        exclude("_id"),  
                        include("text"),  
                        metaVectorSearchScore("vectorSearchScore"))));  

        List<Document> results = collection.aggregate(pipeline).into(new ArrayList<>());  

        System.out.println("Outputting results: ");  
        for (Document result : results) {  
            System.out.println(result.toJson());  
        }  
    }  
}
Enter fullscreen mode Exit fullscreen mode

We loop through each embedding type, embeddings_float32, embeddings_auto_scalar, and embeddings_auto_binary. For each one, we build a pipeline:

  1. vectorSearch stage
    • fieldPath(embeddingType) tells MongoDB which embedding field to search.
    • floatQuery is our query vector (from the earlier embedBaselineDoubles method).
    • INDEX_NAME points to the vector search index we created.
    • 2 means return the top two matches.
    • approximateVectorSearchOptions(5) says “use ANN with a candidate pool of 5.” That balances speed versus accuracy.
  2. project stage
    • Excludes _id (so results are cleaner).
    • Includes just text and the computed vectorSearchScore.

We then run the aggregation with collection.aggregate(pipeline) and dump the results into a list. Finally, the results are printed, showing each document’s text and score.

Now, let's update our main method to include all the necessary calls to our methods:

public class Main {

    // Configurations
    private static final String VOYAGE_API_KEY = System.getenv("VOYAGE_API_KEY");
    private static final String MONGODB_URI = System.getenv("MONGODB_URI");
    private static final String DB_NAME = "test";
    private static final String COLLECTION_NAME = "demo";
    private static final String INDEX_NAME = "vector_index";
    private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();

    // Voyage AI API Endpoint
    private static final String VOYAGE_API_URL = "https://api.voyageai.com/v1/embeddings";

    // Timeout values for API requests
    private static final int CONNECTION_TIMEOUT = 30;
    private static final int READ_TIMEOUT = 60;

    // Sample Data
    private static final List<String> DATA = List.of(
            "The Great Wall of China is visible from space.",
            "The Eiffel Tower was completed in Paris in 1889.",
            "Mount Everest is the highest peak on Earth at 8,848m.",
            "Shakespeare wrote 37 plays and 154 sonnets during his lifetime.",
            "The Mona Lisa was painted by Leonardo da Vinci."
    );

    private static final String QUERY_TEXT = "Country landmarks";

    public static void main(String[] args) {
        if (VOYAGE_API_KEY == null || VOYAGE_API_KEY.isEmpty()) {
            throw new RuntimeException("API key not found. Set VOYAGE_API_KEY in your environment.");
        }
        if (MONGODB_URI == null || MONGODB_URI.isEmpty()) {
            throw new RuntimeException("MongoDB URI not found. Set MONGODB_URI in your environment.");
        }

        // create embeddings and collect all responses
        System.out.println("Embedding data...");
        List<Document> allEmbeddingsDocuments = embedDataAndCreateDocument(DATA);

        try (MongoClient mongoClient = MongoClients.create(MONGODB_URI)) {
            MongoDatabase database = mongoClient.getDatabase(DB_NAME);
            MongoCollection<Document> collection = database.getCollection(COLLECTION_NAME);
            // store all embeddings in mongodb
            storeAllEmbeddings(collection, allEmbeddingsDocuments);
            // create vector search index
            setupVectorSearchIndex(collection);

            System.out.println("Querying data...");
            runVectorSearchQueries(collection);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

While it is perfectly fine to run this now, and we will use the full functionality of the auto-quantization with MongoDB, we are going to walk through how to update this class to handle pre-quantized vectors.

How to ingest pre-quantized vectors

Pre-quantized vectors are supported by some embedding models, such as Voyage AI’s voyage-3-large. Using them means the model itself does the quantization step for us, so we can store and index the compressed representation directly instead of relying on MongoDB to apply automatic quantization during index build.

This has two main benefits:

  • Smaller payloads from the API: Instead of receiving large float32 arrays, we get compact int8 or even int1 arrays.
  • Less memory overhead in MongoDB: We store the already-quantized representation, which keeps both storage size and index RAM cost lower.

MongoDB supports ingestion of:

  • int8 vectors (stored as BSON binData(int8)).
  • int1 vectors (stored as BSON binData(int1) with packed bits).

Unlike the auto-quantization paths, these do not require that we define our index specifically for int8 or int1 ingestion.

Embedding our data

Now, to implement this, we will update our code from earlier. First, in embedDataAndCreateDocument, we will update it to embed our data to int8 and int1 and return to our ResponseBytes record.

private static List<Document> embedDataAndCreateDocument(List<String> data) {  
    OkHttpClient client = new OkHttpClient.Builder()  
            .connectTimeout(CONNECTION_TIMEOUT, TimeUnit.SECONDS)  
            .readTimeout(READ_TIMEOUT, TimeUnit.SECONDS)  
            .build();

    try {  
        ResponseDouble baselineValues = embedBaselineDoubles(client, data);  
        ResponseBytes preQuantizedInt8 = embedPreQuantizedInt8(client, data);  
        ResponseBytes preQuantizedInt1 = embedPreQuantizedInt1(client, data);  

        // Create a list of BSON documents containing all embeddings  
        return createCombinedEmbeddingsDocument(  
                data,  
                baselineValues,  
                preQuantizedInt8,  
                preQuantizedInt1);  
    } catch (IOException e) {  
        throw new RuntimeException("Error fetching embedding: ", e);  
    }  
}
Enter fullscreen mode Exit fullscreen mode

We'll be using two new methods to get our pre-quantized data—first, embedPreQuantizedInt8, which works much like our embedBaselineDouble, but passes in the output data type int8 to tell Voyage AI that we want pre-quantized scalar values:

private static ResponseBytes embedPreQuantizedInt8(OkHttpClient client, List<String> data) throws IOException {  
    String outputDataType = "int8";  
    return OBJECT_MAPPER.readValue(sendRequest(client, data, outputDataType), ResponseBytes.class);  
}
Enter fullscreen mode Exit fullscreen mode

We then need an embedPreQuantizedInt1 method to pass in the output type ubinary to get our packed bits response:

private static ResponseBytes embedPreQuantizedInt1(OkHttpClient client, List<String> data) throws IOException {  
    String outputDataType = "ubinary";  
    return OBJECT_MAPPER.readValue(sendRequest(client, data, outputDataType), ResponseBytes.class);  
}
Enter fullscreen mode Exit fullscreen mode

How to store pre-quantized data in MongoDB

We also need to update our createCombinedEmbeddingsDocument to take in these new embeddings to store in MongoDB.

// Create a list of BSON document containing all embedding responses  
private static List<Document> createCombinedEmbeddingsDocument(  
        List<String> texts,  
        ResponseDouble floatResponse,  
        ResponseBytes int8Response,  
        ResponseBytes int1Response  
) {  
    List<Document> embeddingsList = new ArrayList<>();  

    for (int i = 0; i < texts.size(); i++) {  
        Document embeddingDoc = new Document()  
                .append("text", texts.get(i))  
                .append("embeddings_float32", floatResponse.data().get(i).embedding())  
                .append("embeddings_auto_scalar", floatResponse.data().get(i).embedding())  
                .append("embeddings_auto_binary", floatResponse.data().get(i).embedding())  
                .append("embeddings_int8", BinaryVector.int8Vector(int8Response.data().get(i).embedding()))  
                .append("embeddings_int1", BinaryVector.packedBitVector(int1Response.data().get(i).embedding(), (byte) 0));  
        embeddingsList.add(embeddingDoc);  
    }  

    return embeddingsList;  
}
Enter fullscreen mode Exit fullscreen mode

We still store doubles for the auto paths and we now store pre-quantized bytes for int8/int1. MongoDB Atlas treats them as quantized at ingestion, so no conversion.

org.bson.BinaryVector is a helper type in the MongoDB Java Driver. It wraps raw byte[] so MongoDB knows we’re storing a vector type (binData(int8) or binData(int1)) instead of just a random blob of bytes.

MongoDB Atlas Vector Search requires these wrapped types to understand how to treat the data when building a vector index.

BinaryVector.int8Vector(...)

  • Takes the byte[] from our embedding model that already represents int8-quantized values.
  • Wraps it as a BSON binData(int8) vector.
  • MongoDB will store this as a compact 8-bit signed integer vector, and the index will consume it directly with no further conversion.

BinaryVector.packedBitVector(..., (byte) 0)

  • Takes the byte[] returned for 1-bit embeddings (ubinary).
  • These vectors are bit-packed. Each bit represents one dimension of the embedding (0 if ≤ midpoint, 1 if > midpoint).
  • BinaryVector.packedBitVector wraps that array and tags it as BSON binData(int1).
  • The (byte) 0 is a padding value. If the vector’s number of dimensions isn’t an exact multiple of eight, MongoDB Atlas needs to know how to pad the unused bits in the last byte (usually 0).

Updating our Vector Search index

Now, we need to update the setupVectorSearchIndex to handle these new fields, and prepare them for vector search:

// Create the Vector Search index  
public static void setupVectorSearchIndex(MongoCollection<Document> collection) throws InterruptedException {  
       Bson definition = new Document(  
            "fields",  
            List.of(  
                    new Document("type", "vector")  
                            .append("path", "embeddings_float32")  
                            .append("numDimensions", 1024)  
                            .append("similarity", "dotProduct"),  
                    new Document("type", "vector")  
                            .append("path", "embeddings_auto_scalar")  
                            .append("numDimensions", 1024)  
                            .append("similarity", "dotProduct")  
                            .append("quantization", "scalar"),  
                    new Document("type", "vector")  
                            .append("path", "embeddings_auto_binary")  
                            .append("numDimensions", 1024)  
                            .append("similarity", "dotProduct")  
                            .append("quantization", "binary"),  
                    new Document("type", "vector")  
                            .append("path", "embeddings_int8")  
                            .append("numDimensions", 1024)  
                            .append("similarity", "dotProduct"),  
                    new Document("type", "vector")  
                            .append("path", "embeddings_int1")  
                            .append("numDimensions", 1024)  
                            .append("similarity", "euclidean")  
            )  
    );  

    SearchIndexModel indexModel = new SearchIndexModel(  
            INDEX_NAME,  
            definition,  
            SearchIndexType.vectorSearch()  
    );  

    List<String> result = collection.createSearchIndexes(Collections.singletonList(indexModel));

    waitForIndex(collection, INDEX_NAME);  
}
Enter fullscreen mode Exit fullscreen mode

We've added our two new fields, embeddings_int8 and embeddings_int1, to our index definition. We are still using the same number of dimensions for our embeddings, as this is defined by the model used for embedding, There are two big differences.

  1. We do not need to define quantization.
  2. For int1 vector quantization, we use the euclidean similarity method for searching. This is because int1 does not support dotProduct or cosine similarity methods.

If you have run the code before, and this index already exists on the database, you will need to delete it, or manually update it to avoid any duplicate index errors.

Searching pre-quantized vectors

Now we have our new, pre-quantized data ready, we need to update our runVectorSearchQueries method to embed our queries to these pre-quantised values.

// Run MongoDB vector search queries for all fields using query embeddings  
private static void runVectorSearchQueries(MongoCollection<Document> collection) {  
    OkHttpClient client = new OkHttpClient.Builder()  
            .connectTimeout(CONNECTION_TIMEOUT, TimeUnit.SECONDS)  
            .readTimeout(READ_TIMEOUT, TimeUnit.SECONDS)  
            .build();  

    List<String> queryInput = List.of(QUERY_TEXT);  

    try { 
        ResponseDouble qFloatResp = embedBaselineDoubles(client, queryInput);  
        List<Double> qFloat = qFloatResp.data().getFirst().embedding();  

        ResponseBytes qInt8Resp = embedPreQuantizedInt8(client, queryInput);  
        BinaryVector qInt8 = BinaryVector.int8Vector(qInt8Resp.data().getFirst().embedding());  

        ResponseBytes qInt1Resp = embedPreQuantizedInt1(client, queryInput);  
        BinaryVector qInt1 = BinaryVector.packedBitVector(qInt1Resp.data().getFirst().embedding(), (byte) 0);  

        runVectorSearchDouble(collection, qFloat);  

        runVectorSearchBinary(collection, qInt8, qInt1);  

    } catch (IOException e) {  
        throw new RuntimeException("Error embedding query vector: ", e);  
    }  
}
Enter fullscreen mode Exit fullscreen mode

We'll use our embedPreQuantizedInt8 and embedPreQuantizedInt1 methods to get our embeddings for our query.

Lastly, we need a new method runVectorSearchBinary method to query MongoDB with these values:

private static void runVectorSearchBinary(  
        MongoCollection<Document> collection,  
        BinaryVector qInt8,  
        BinaryVector qInt1  
) {  
    Map<String, BinaryVector> queryByField = new LinkedHashMap<>();  
    queryByField.put("embeddings_int8", qInt8); // int8  
    queryByField.put("embeddings_int1", qInt1); // packed 1-bit  

    for (Map.Entry<String, BinaryVector> entry : queryByField.entrySet()) {  
        List<Bson> pipeline = asList(  
                vectorSearch(  
                        fieldPath(entry.getKey()),  
                        entry.getValue(),  
                        INDEX_NAME,  
                        2, approximateVectorSearchOptions(5)  
                ),  
                project(fields(  
                        exclude("_id"),  
                        include("text"),  
                        metaVectorSearchScore("vectorSearchScore"))));  

        List<Document> results = collection.aggregate(pipeline).into(new ArrayList<>());  

        System.out.println("Outputting results: ");  
        for (Document result : results) {  
            System.out.println(result.toJson());  
        }  
    }  
}
Enter fullscreen mode Exit fullscreen mode

Running our application

We can now run our application with the command:

mvn clean compile exec:java -Dexec.mainClass="com.timkelly.Main"
Enter fullscreen mode Exit fullscreen mode

And we’ll see our results printed to the console:

Outputting results: 
{"text": "The Great Wall of China is visible from space.", "vectorSearchScore": 0.8666857481002808}
{"text": "Mount Everest is the highest peak on Earth at 8,848m.", "vectorSearchScore": 0.8523434400558472}
Outputting results: 
{"text": "The Great Wall of China is visible from space.", "vectorSearchScore": 0.8664929866790771}
{"text": "Mount Everest is the highest peak on Earth at 8,848m.", "vectorSearchScore": 0.8526520729064941}
Outputting results: 
{"text": "The Great Wall of China is visible from space.", "vectorSearchScore": 0.8666857481002808}
{"text": "Mount Everest is the highest peak on Earth at 8,848m.", "vectorSearchScore": 0.8523434400558472}
Outputting results: 
{"text": "The Great Wall of China is visible from space.", "vectorSearchScore": 0.5071338415145874}
{"text": "Mount Everest is the highest peak on Earth at 8,848m.", "vectorSearchScore": 0.5067881345748901}
Outputting results: 
{"text": "The Great Wall of China is visible from space.", "vectorSearchScore": 0.771484375}
{"text": "The Mona Lisa was painted by Leonardo da Vinci.", "vectorSearchScore": 0.7451171875}
Enter fullscreen mode Exit fullscreen mode

You’ll notice that the various methods we used (baseline float32, automatic scalar, automatic binary, and the pre-quantized int8/int1 paths) all return slightly different results and scores.

That’s the trade-off in action:

  • Float32 (no quantization): Highest fidelity, most memory usage.
  • Auto-scalar: Scores almost identical to float32, but RAM cost is a fraction.
  • Auto-binary: Tighter memory, recall dips slightly depending on the dataset.
  • Pre-quantized int8/int1: Smallest payloads, fastest index use, but you can see the scores shift more noticeably.

This is why MongoDB gives us all these options. We can pick the right balance of accuracy, memory footprint, and query latency for our application.

Conclusion

With this setup, we’ve seen how to embed data, store it in MongoDB, and query it with Atlas Vector Search across multiple quantization strategies. Automatic quantization makes it easy to reduce memory while keeping high recall, while pre-quantized vectors give you the smallest payloads and fastest queries when your embedding model supports them. The trade-offs between accuracy, resource usage, and speed become clear once you run the same query through each index path.

By experimenting with float32, scalar, binary, int8, and int1 vectors side by side, you can choose the right approach for your application’s needs, whether that’s maximum recall, minimal resource footprint, or a balance between the two.

If you found this tutorial useful, and want to learn more about what you can do with MongoDB in Java, check out how I turned my Obsidian vault into a searchable wiki or Secure Local RAG with Role-Based Access: Spring AI, Ollama & MongoDB.

Top comments (0)