DEV Community: Robin Lee

LLMs and Vector Databases

Robin Lee — Mon, 22 May 2023 10:03:54 +0000

About a month ago, vector database Weaviate landed 50 million dollars in series B funding. About three weeks ago, Chroma, an open source project with only 5k stars raised 18 million for its embeddings database and about two weeks ago, Pinecone DB announced a $100 million Series B investment on a $750 million post valuation. Naturally, a question arises, what is a vector database?

To talk about vector databases, we first need to know what a vector is. Vector is just an array of numbers. However, they can represent more complex objects such as words, sentences, images, or audio files in a continuous high dimensional space called an embedding.

Vector embeddings

Embeddings map the semantic meaning of words together or similar features in virtually any other data type. These embeddings can then be used for recommendation systems, search engines, and even text generation such as ChatGPT. But once you have your embeddings, the real question becomes: Where do you store them and how do you query them?

That's where vector databases come in. In a relational database, you have rows and columns. In a document database, you have documents and collections. However, in a vector database, you have arrays of numbers clustered together based on similarity which can later be queried with ultra low latency, making it an ideal choice for AI driven applications.

Relational vs. Document databases

Relational databases such as PostgreSQL have tools like PGVector to support this type of functionality and Redis also has its first class vector support such as Redisearch. Bunch of new native vector databases are popping up, too. Weaviate and Milvus are open source options written in Go. Chroma, based on Clickhouse under the hood, is also an another open source option. Another extremely popular option is PineconeDB, but it is not open source.

Let's jump into some code to see what it looks like. I will be using PineconeDB and Python. Using the official guide, I will be implementing the Abstractive Question Answering program using the ELI5 BART model. Abstractive question answering focuses on the generation of multi-sentence answers to open-ended questions. It usually works by searching massive document stores for relevant information and then using this information to synthetically generate answers.

Our source data will be taken from the Wiki Snippets dataset, which contains over 17 million passages from Wikipedia. We will only utilize 5,000 passages that include "History" in the "section title" column (due to memory issues, you can utilize the complete dataset if you want; the official guide used 50,000 passages).

import pandas as pd

# create a pandas dataframe with the documents we extracted
df = pd.DataFrame(docs)
df.head()

To build our vector index, we must first establish a connection with Pinecone. Then, we create a new index. An index is the highest-level organizational unit of vector data in Pinecone. It accepts and stores vectors, serves queries over the vectors it contains, and does other vector operations over its contents. We specify the metric type as "cosine" and dimension as 768 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 768-dimension vectors. Other metrics are "euclidean" and "dotproduct."

import pinecone

# connect to pinecone environment
pinecone.init(
    api_key="YOUR_API_KEY",
    environment="us-central1-gcp"
)

index_name = "abstractive-question-answering"

# check if the abstractive-question-answering index exists
if index_name not in pinecone.list_indexes():
    # create the index if it does not exist
    pinecone.create_index(
        index_name,
        dimension=768,
        metric="cosine"
    )

# connect to abstractive-question-answering index we created
index = pinecone.Index(index_name)

We will use a SentenceTransformer model based on Microsoft's MPNet as our retriever. Also, we will be using ELI5 BART for the generator which is a Sequence-To-Sequence model trained using the "Explain Like I'm 5" (ELI5) dataset. Sequence-To-Sequence models can take a text sequence as input and produce a different text sequence as output. You can download these models from the Huggingface hub.

import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the retriever model from huggingface model hub
retriever = SentenceTransformer("flax-sentence-embeddings/all_datasets_v3_mpnet-base", device=device)
retriever

from transformers import BartTokenizer, BartForConditionalGeneration

# load bart tokenizer and model from huggingface
tokenizer = BartTokenizer.from_pretrained('vblagoje/bart_lfqa')
generator = BartForConditionalGeneration.from_pretrained('vblagoje/bart_lfqa').to(device)

Next, we should upload our data to the pinecone database using the index.upsert() command. If the operation was successful, you should see the following output.

Then let's right some helper functions to retrieve context passages from Pinecone index and to format the query in the way the generator expects the input.

def query_pinecone(query, top_k):
    # generate embeddings for the query
    xq = retriever.encode([query]).tolist()
    # search pinecone index for context passage with the answer
    xc = index.query(xq, top_k=top_k, include_metadata=True)
    return xc

def format_query(query, context):
    # extract passage_text from Pinecone search result and add the <P> tag
    context = [f"<P> {m['metadata']['passage_text']}" for m in context]
    # concatinate all context passages
    context = " ".join(context)
    # contcatinate the query and context passages
    query = f"question: {query} context: {context}"
    return query

query = "when was the first electric power system built?"
result = query_pinecone(query, top_k=1)
result

Lastly, we'll write a helper function that generates the answer given a query.

from pprint import pprint

def generate_answer(query):
    # tokenize the query to get input_ids
    inputs = tokenizer([query], max_length=1024, return_tensors="pt")
    # use generator to predict output ids
    ids = generator.generate(inputs["input_ids"], num_beams=2, min_length=20, max_length=40)
    # use tokenizer to decode the output ids
    answer = tokenizer.batch_decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return pprint(answer)

We use this function to test different queries as shown below.

Note that the answers are not complete since we only utilized 5,000 passages. You can adjust the numbers of passages and observe results.

The real reason that these databases are so hot right now is that they can extend LLMs with long-term memory. You start with a general purpose model like OpenAI's GPT-4, Meta's LLaMA, or Google's LaMDA then provide your own data in a vector database. When the user makes a prompt, you can then query relevant documents from your own database to update the context which will customize the final response and it can also retrieve historical data to give the AI long-term memory. They also integrate with tools like Langchain that combine multiple LLMs together.

Some parts transcribed from Fireship's video: Vector databases are so hot right now. WTF are they?

Official abstractive-question-answering guide: https://docs.pinecone.io/docs/abstractive-question-answering

Twitter's Open-Source Recommendation Algorithm

Robin Lee — Thu, 18 May 2023 06:02:32 +0000

Twitter's Recommendation Algorithm

Twitter aims to deliver you the best of what’s happening in the world right now. This blog is an introduction to how the algorithm selects Tweets for your timeline.

blog.twitter.com

Less than seven months ago, Elon Musk paid 44 billion dollars for Twitter. Ever since, he fired half the company and gave blue check marks to everyone. Twitter is now only worth 20 billion dollars. Many users have moved to Mastodon and the NYT lost its blue check. It looks like Twitter is collapsing. However, in reality, Elon is playing the long game of chess against the mainstream news media like the Fox News and CNN channels. He is trying to take their advertisers by making Twitter the future platform for all journalism.

Twitter made a part of its recommendation algorithm open-source about a month ago. Although it is real production code at Twitter, it is not 100 percent of the code, so it is really only useful for research and transparency. The code base is mostly written in Scala, a JVM language that is similar to JAVA but concise. Twitter was originally written with Ruby on Rails but they moved away from it over a decade ago.

If you take a closer look into some of the files in the repo, we can notice some extremely interesting implementations and details. Take a look at these code snippets for example (getLinearRankingParams from EarlybirdTensorflowBasedSimilarityEngine.scala file is now deprecated as of Apr 05, 2023).

We have a bunch of ranking parameters each with a default value. Retweets provide a 20 times boost while likes provide a 30 times boost. Images and videos also provide a small boost. Not surprisingly, you also get a boost for being a paying Twitter blue member.

On the other hand, a tweet can also get a negative boost if the account has a lot of mutes, blocks, or spam reports. Spelling errors and made up words will also give you a debuff.

Offensive, spamming, and NSFW tweets can also get a debuff while trending, verified, and media tweets get a boost. There is also a long list of topics that won't be amplified: anything that has been flagged as misinformation, harassment, etc.

How does Twitter actually select the tweets to display on our home page using these parameters, then? We can break the recommendation pipeline into three parts.

How the twitter recommendation pipeline works

The first step is to find a pool of 1500 tweets that you might be interested in using a technique called candidate sourcing. There are three ways Twitter uses for candidate sourcing. First pool of candidates that consist a majority of your home page is using your followers, or your in-network source. For this, Twitter uses a model called Realgraph which predicts the likelihood of engagement between two users. Second pool of candidates come from accounts you don’t follow yet, or your out-of-network source, using two concepts: social graphs and embedding spaces. To select relevant tweets from your graph, Twitter uses an algorithm called GraphJet, a graph processing engine that maintains a real-time interaction graph between users and Tweets that traverses through your social graph. For most of your out-of-network tweets, however, Twitter uses an algorithm called SimClusters to discover communities anchored by a cluster of influential users in an embedding space.

Communities in an embedding space grouped by the SimClusters algorithm

From there, it ranks that pool of tweets with a 48 million parameter neural network. Lastly, it filters out contents by static rules like accounts that you've blocked or muted.

Why would Elon do this? Why would he release his trade secrets to the public? Well, it kind of makes Twitter like the Linux of social media. The public can identify parts that are unfair in the algorithm and address them in public.

In my opinion, it is mostly a marketing move to build trust. It no longer feels like Twitter is run by a mysterious figure and de-boost content without some degree of transparency. There is also a huge opportunity here because the trust in the mainstream media has fallen so low many people already use Twitter to consume the news. And although Twitter is currently losing money, they have talked about compensating content creators just like Youtube and other platforms, too. When that happens, journalists could potentially make a living on Twitter and put their best content there.

Elon knows Twitter blue is never going to make Twitter any money but rather it is designed to uplift independent creators while embarrassing the establishment. The blue checks are now irrelevant and by open sourcing the code, Twitter is laying the groundwork to become the fair and balanced most trusted name in the news. This may force other social media platforms to become more transparent.

Some parts transcribed from Fireship's video: Twitter algorithm open-sourced...