Goh Chun Lin

Posted on Aug 9 • Originally published at cuteprogramming.blog on Jan 20

When Pinecone Wasn’t Enough: My Journey to pgvector

#cloudcomputingamazon #experience #pinecone #postgres

If you work with machine learning or natural language processing, you have probably dealt with storing and searching through vector embeddings.

When I created the Honkai: Star Rail (HSR) relic recommendation system using Gemini, I started with Pinecone. Pinecone is a managed vector database that made it easy to index relic descriptions and character data as embeddings. It helped me find the best recommendations based on how similar they were.

Pinecone worked well, but as the project grew, I wanted more control, something open-source, and a cheaper option. That is when I found pgvector, a tool that adds vector search to PostgreSQL and gives the flexibility of an open-source database.

About HSR and Relic Recommendation System

Honkai: Star Rail (HSR) is a popular RPG that has captured the attention of players worldwide. One of the key features of the game is its relic system, where players equip their characters with relics like hats, gloves, or boots to boost stats and unlock special abilities. Each relic has unique attributes, and selecting the right sets of relics for a character can make a huge difference in gameplay.

An HSR streamer, Unreal Dreamer, learning the new relic feature. (Image Source: Unreal Dreamer YouTube)

As a casual player, I often found myself overwhelmed by the number of options and the subtle synergies between different relic sets. Finding the good relic combination for each character was time-consuming.

This is where LLMs like Gemini come into play. With the ability to process and analyse complex data, Gemini can help players make smarter decisions.

In November 2024, I started a project to develop a Gemini-powered HSR relic recommendation system which can analyse a player’s current characters to suggest the best options for them. In the project, I have been storing embeddings in Pinecone.

Embeddings and Vector Database

An embedding is a way to turn data, like text or images, into a list of numbers called a vector. These vectors make it easier for a computer to compare and understand the relationships between different pieces of data.

For example, in the HSR relic recommendation system, we use embeddings to represent descriptions of relic sets. The numbers in the vector capture the meaning behind the words, so similar relics and characters have embeddings that are closer together in a mathematical sense.

This is where vector databases like Pinecone or pgvector come in. Vector databases are designed for performing fast similarity searches on large collections of embeddings. This is essential for building systems that need to recommend, match, or classify data.

pgvector is an open-source extension for PostgreSQL that allows us to store and search for vectors directly in our database. It adds specialised functionality for handling vector data, like embeddings in our HSR project, making it easier to perform similarity searches without needing a separate system.

Unlike managed services like Pinecone, pgvector is open source. This meant we could use it freely and avoid vendor lock-in. This is a huge advantage for developers.

Finally, since pgvector runs on PostgreSQL, there is no need for additional managed service fees. This makes it a budget-friendly option, especially for projects that need to scale without breaking the bank.

Choosing the Right Model

While the choice of the vector database is important, it is not the key factor in achieving great results. The quality of our embeddings actually is determined by the model we choose.

For my HSR relic recommendation system, when our embeddings were stored in Pinecone, I started by using the multilingual-e5-large model from Microsoft Research offered in Pinecone.

When I migrated to pgvector, I had the freedom to explore other options. For this migration, I chose the all-MiniLM-L6-v2 model hosted on Hugging Face, which is a lightweight sentence-transformer designed for semantic similarity tasks. Switching to this model allowed me to quickly generate embeddings for relic sets and integrate them into pgvector, giving me a solid starting point while leaving room for future experimentation.

The all-MiniLM-L6-v2 model hosted on Hugging Face.

Using all-MiniLM-L6-v2 Model

Once we have decided to use the all-MiniLM-L6-v2 model, the next step is to generate vector embeddings for the relic descriptions. This model is from the sentence-transformers library, so we first need to install the library.

pip install sentence-transformers

The library offers SentenceTransformer class to load pre-trained models.

from sentence_transformers import SentenceTransformer

model_name = 'all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)

At this point, the model is ready to encode text into embeddings.

The SentenceTransformer model takes care of tokenisation and other preprocessing steps internally, so we can directly pass text to it.

# Function to generate embedding for a single text
def generate_embedding(text):
    # No need to tokenise separately, it's done internally
    # No need to average the token embeddings
    embeddings = model.encode(text) 

    return embeddings

In this function, when we call model.encode(text), the model processes the text through its transformer layers, generating an embedding that captures its semantic meaning. The output is already optimised for tasks like similarity search.

Setting up the Database

After generating embeddings for each relic sets using the all-MiniLM-L6-v2 model, the next step is to store them in the PostgreSQL database with the pgvector extension.

For developers using AWS, there is a good news. In May 2023, AWS announced that Amazon Relational Database Service (RDS) for PostgreSQL would be supporting pgvector. In November 2024, Amazon RDS started to support pgvector 0.8.0.

pgvector is now supported on Amazon RDS for PostgreSQL.

To install the extension, we will run the following command in our database. This will introduce a new datatype called VECTOR.

CREATE EXTENSION vector;

After this, we can define our table as follows.

CREATE TABLE IF NOT EXISTS embeddings (
    id TEXT PRIMARY KEY,
    vector VECTOR(384),
    text TEXT
);

Besides the id column which is for the unique identifier, there are two other columns that are important.

The text column stores the original text for each relic (the two-piece and four-piece bonus descriptions).

The vector column stores the embeddings. The VECTOR(384) type is used to store embeddings, and 384 here refers to the number of dimensions in the vector. In our case, the embeddings generated by the all-MiniLM-L6-v2 model are 384-dimensional, meaning each embedding will have 384 numbers.

Here, a dimension refers to one of the “features” that helps describe something. When we talk about vectors and embeddings, each dimension is just one of the many characteristics used to represent a piece of text. These features could be things like the type of words used, their relationships, and even the overall meaning of the text.

Updating the Database

After the table is created, we can proceed to create INSERT INTO SQL statements to insert the embeddings and their associated text into the database.

In this step, I load the relic information from a JSON file and process it.

import json

# Load your relic set data from a JSON file
with open('/content/hsr-relics.json', 'r') as f:
    relic_data = json.load(f)

# Prepare data
relic_info_data = [
    {"id": relic['name'], "text": relic['two_piece'] + " " + relic['four_piece']} # Combine descriptions
    for relic in relic_data
]

The relic_info_data will then be passed to the following function to generate the INSERT INTO statements.

# Function to generate INSERT INTO statements with vectors
def generate_insert_statements(data):
    # Initialise list to store SQL statements
    insert_statements = []

    for record in data:
        # Extracting text and id from the record
        id = record.get('id')
        text = record.get('text')

        # Generate the embedding for the text
        embedding = generate_embedding(text)

        # Convert the embedding to a list
        embedding_list = embedding.tolist()

        # Create the SQL INSERT INTO statement
        sql_statement = f"""
        INSERT INTO embeddings (id, vector, text)
        VALUES (
          '{id.replace("'", "''")}', 
          ARRAY{embedding_list}, 
          '{text.replace("'", "''")}')
        ON CONFLICT (id) DO UPDATE
        SET vector = EXCLUDED.vector, text = EXCLUDED.text;
        """

        # Append the statement to the list
        insert_statements.append(sql_statement)

    return insert_statements

The embeddings of the relic sets are successfully inserted to the database.

How It All Fits Together: Query the Database

Once we have stored the vector embeddings of all the relic sets in our PostgreSQL database, the next step is to find the relic sets that are most similar to a given character’s relic needs.

Just like what we have done for storing relic set embeddings, we need to generate an embedding for the query describing the character’s relic needs. This is done by passing the query through the model as demonstrated in the following code.

def query_similar_embeddings(query_text):
    query_embedding = generate_embedding(query_text)

    return query_embedding.tolist()

The generated embedding is an array of 384 numbers. We simply use this array in our SQL query below.

SELECT id, text, vector <=> '[<embedding here>]' AS distance
FROM embeddings
ORDER BY distance
LIMIT 3;

The key part of the query is the <=> operator. This operator calculates the “distance” between two vectors based on cosine similarity. In our case, it measures how similar the query embedding is to each stored embedding. The smaller the distance, the more similar the embeddings are.

We use LIMIT 3 to get the top 3 most similar relic sets.

Test Case: Finding Relic Sets for Gallagher

Gallagher is a Fire and Abundance character in HSR. He is a sustain unit that can heal allies by inflicting a debuff on the enemy.

According to the official announcement, Gallagher is a healer. (Image Source: Honkai: Star Rail YouTube)

The following screenshot shows the top 3 relic sets which are closely related to a HSR character called Gallagher using the query “Suggest the best relic sets for this character: Gallagher is a Fire and Abundance character in Honkai: Star Rail. He can heal allies.”

The returned top 3 relic sets are indeed recommended for Gallagher.

One of the returned relic sets is called the “Thief of Shooting Meteor”. It is the official recommended relic set in-game, as shown in the screenshot below.

Gallagher’s official recommended relic set.

Future Work

In our project, we will not be implementing indexing because currently in HSR, there are only a small number of relic sets. Without an index, PostgreSQL will still perform vector similarity searches efficiently because the dataset is small enough that searching through it directly will not take much time. For small-scale apps like ours, querying the vector data directly is both simple and fast.

However, when our dataset grows larger in the future, it is a good idea to explore indexing options, such as the ivfflat index, to speed up similarity searches.

DEV Community