DEV Community

John Napiorkowski
John Napiorkowski

Posted on

Using Postgresql pg_vector for AI: Part 2, Using vectors For Natural Language Processing

Introduction

In the previous post we just learned how to use pg_vector in SQL to look for similarities. In this followup we will use OpenAI to create vector maps of a list of meals, and then create a short script that uses natural language processing to query that list.

Using OpenAI to Create Vectors

When creating multi dimensional vectors for your source content we previously said the hard part was creating the dimensions and populating them with good values. What if there was a way to automatically create generic vector data for any arbitrary set of things that was good enough to perform all sorts of similarity searches? Well, you're in luck because that is exactly what Open AIs embeddings API can do!

I'm not going to cover how to go to the open AI website, create an account and get an API key since that's well covered by many other resources (and its trivial to figure out even without Chat GPTs help). Just a warning this embeddings API isn't free so you'll need to setup a billing account, but it's pretty cheap and the cost for fully replicating this tutorial should be negligible. Let's just assume for the moment you have done that and that you also got an API key. Also let's assume we are building from the previous example, which means you have a Postgresql DB with pg_vector installed. I'll add one more assumption, that you have a command line environment and an OS with some version of Perl installed and that you can using CPAN to install two non standard library modules. If you need help with that chat me up in the comments.

Ok so let's choose first a set of things to create vectors for. This time let's go for something much more complicated than colors, let's use a list of meals from various times of the day and from various cultures. I used ChatGPT to generate this list but you can probably find many more. Here's a sample of what the list of meals looks like:

French Croissant with Jam and Coffee
American Pancakes with Maple Syrup
Japanese Tamago Kake Gohan (Rice with Raw Egg)
Indian Masala Dosa with Coconut Chutney
Mexican Huevos Rancheros
Enter fullscreen mode Exit fullscreen mode

The full list and loader script can be found here.

Now, we could ourselves with a lot of tedium build a dimensional graph for these meals, probably including things like culture of origin, suitability for breakfast, lunch or dinner, using (or avoiding) certain types of meat, etc. But that would be a drag. Let's instead use the OpenAI embeddings API to do it for us. This API will parse the name of the meal and return a huge set of dimensions (over 1000!) categorizing that meal in various ways based on its internal language model. As you will later see this is well enough for us to do all sorts of random generic meal searches. Lets look at the loader script:

use warnings;
use strict;

use DBI;
use AI::Embedding;

my $dbh = DBI->connect(
  'DBI:Pg:dbname=[YOUR DB]',
  '[USERNAME]',
  '[PASSWORD]',
) || die "Can't connect to DB";

my $embedding_api = AI::Embedding->new(
    api => 'OpenAI',
    key => '[YOUR KEY HERE]'
);

my $insert_stmt = $dbh->prepare('INSERT INTO meals (name, vector_info) VALUES (?, ?)');

foreach my $meal (<DATA>) {
  chomp($meal);

  my $embedding = $embedding_api->embedding($meal);
  die $embedding_api->error unless $embedding_api->success;

  $insert_stmt->execute($meal, "[${embedding}]")
    or die "Couldn't execute statement: $DBI::errstr";
}

$insert_stmt->finish();
$dbh->disconnect();

__DATA__
French Croissant with Jam and Coffee
American Pancakes with Maple Syrup
Japanese Tamago Kake Gohan (Rice with Raw Egg)
Indian Masala Dosa with Coconut Chutney
Mexican Huevos Rancheros
[SEE GIST LINKED ABOVE FOR FULL SCRIPT]
Enter fullscreen mode Exit fullscreen mode

Here's the database definition for the table this script will populate:

CREATE TABLE meals (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    vector_info VECTOR(1536) NOT NULL
);
Enter fullscreen mode Exit fullscreen mode

As you can see we are expecting OpenAI to return a ton of vector dimensions. What this script is going to do is to loop over each meal name, send it to embeddings API and retrieve all ~1500 embeddings, saving that into the database as a vector. As I mentioned this API isn't free but for the 150 meals you are processing it won't break the bank.

After you've run the script we can now query the database similarly to how we did in the first post. Let's find some similar meals:

SELECT name
FROM meals
WHERE name <> 'Mexican Huevos Rancheros'
ORDER BY vector_info <=> (
  SELECT vector_info
  FROM meals where name = 'Mexican Huevos Rancheros'
)
LIMIT 5;

                    name                    
 --------------------------------------------
 Mexican Chilaquiles (Tortilla Casserole)
 Mexican Enchiladas with Guacamole and Rice
 Mexican Chicken Enchiladas with Red Sauce
 Mexican Tamales
 Mexican Guacamole with Tortilla Chips
(5 rows)
Enter fullscreen mode Exit fullscreen mode

Since my test set only has 150 meals in it the quality of similarity isn't amazing but it's good enough to get the idea. Now this is useful in and of itself. I think you can see how you can use this tool for creating reports and so forth on you existing database. But let's make it a bit more interesting. I want to write a command line application that accepts a natural language string, parses it, and returns a good matching meal based on what I asked. This is actually easy to do. Basically I'm going to prompt the user to enter a sentence, I'm going to parse that sentence using the exact same embeddings API to get its embedding information, and then using SQL I will match that against the list of meals. Since we parse the meal list and the user query using the same embeddings API, they both get the same set of generically useful vector embedding as they are both using the same underlying language model. Lets see what that looks like:

use strict;
use warnings;

use DBI;
use AI::Embedding;

my $dbh = DBI->connect(
  'DBI:Pg:dbname=[YOUR DB]',
  '[USERNAME]',
  '[PASSWORD]',
) || die "Can't connect to DB";

my $embedding_api = AI::Embedding->new(
    api => 'OpenAI',
    key => '[YOUR KEY HERE]'
);

my @context = ();
my $sth = $dbh->prepare('SELECT name FROM meals ORDER BY vector_info <=> ? LIMIT 1');
while (1) {
    # Prompt the user for input
    print "Enter your meal query: ";
    my $query = <STDIN>;
    chomp $query;

    # create the embeddings via the API
    push @context, $query;
    my $prompt = "Answer the following question using this context: @{[ join ',', @context]}";

    my $embeddings = $embedding_api->embedding($prompt);
    die $embedding_api->error unless $embedding_api->success;

    # execute the SQL
    $sth->execute("[${embeddings}]") or die "Couldn't execute statement: $DBI::errstr";

    # Fetch and display the results
    print "Results of the SELECT query:\n";
    while (my $row = $sth->fetchrow_hashref()) {
      print "$row->{name}\n";
    }

    # Prompt the user to continue or quit
    print "Enter 'quit' to exit or press Enter to continue: ";
    my $input = <STDIN>;
    chomp $input;

    # If the user enters "quit", break the loop and exit the program
    last if lc $input eq 'quit';

    # Otherwise, continue to the next iteration of the loop
}
$sth->finish();
$dbh->disconnect();

print "Goodbye! Exiting the program.\n";
Enter fullscreen mode Exit fullscreen mode

Again remember you need to install the two Perl modules, DBI and AI::Embedding for this to work. Lets run this from the command line and see what I get:

$ perl nlp_meals.pl

Enter your meal query: I want a lunch with beef
Results of the SELECT query:
Peruvian Lomo Saltado (Beef Stir-Fry)
Enter 'quit' to exit or press Enter to continue:

Enter your meal query: something French with beef
Results of the SELECT query:
French Boeuf Bourguignon (Beef Stew)
Enter 'quit' to exit or press Enter to continue: 

Enter your meal query: an indian meal that's easy to make
Results of the SELECT query:
Indian Chana Masala (Chickpea Curry)
Enter 'quit' to exit or press Enter to continue:

Enter your meal query: something an american that hated the french would hate the most to eat
Results of the SELECT query:
French Escargot (Snails)
Enter 'quit' to exit or press Enter to continue:

Enter your meal query: something an American would love for breakfast, something sweet.
Results of the SELECT query:
American Pancakes with Maple Syrup
Enter 'quit' to exit or press Enter to continue: quit

Goodbye! Exiting the program.
Enter fullscreen mode Exit fullscreen mode

One thing you'll note is I limited the search to just one result. This is because with only 150 meals in the list you very rapidly get poor results. With a much bigger list you could return more.

And that's how you can use OpenAI's embedding API to do natural language processing on your custom data!

One More Thing

One neat part about ChatGPT is what seems like a tremendous ability to maintain the context of the conversation. In this script we try to replicate that by simple growing the prompt based on previous queries. For our simple dataset it works ok:

$ perl nlp_meals.pl 
Enter your meal query: something with chicken
Results of the SELECT query:
Chinese General Tso's Chicken
Enter 'quit' to exit or press Enter to continue:

Enter your meal query: I prefer french food
Results of the SELECT query:
French Coq au Vin (Chicken in Red Wine Sauce)

Enter your meal query: sorry lets make that Japanese
Results of the SELECT query:
Japanese Yakitori (Grilled Skewered Chicken)
Enter fullscreen mode Exit fullscreen mode

Stuff I left Out

Please note this is a toy script and to make it robust you'll need to write more code around API error checking. Also I'm just continuously growing the context which at some point will get bigger than what the API will allow. At that point you'll either have to roll off old context or perhaps use the API to summarize some of the context to take less space. My basic prompt could use a lot of tweaks to make it better as well. Again this is just to show you the ropes.

Your custom data is likely to be more than just a simple list of meals, but rather long strings of data, or stuff in table or JSON format. You need to remember that OpenAI APIs have token length limits, which means if for your very long content you will need to page it or to summarize it in some manner. For now I'll leave that as an exercise for the reader but if there's enough acclaim I will return with a bigger example.

Looking for a programming to help you do training on what generative AI can do for your business, or have an AI project in mind but not sure how to start? Ping me, at jjn1056@gmail.com, I'm an experienced consultant who can program and teach!

Top comments (0)