MongoDB Guests for MongoDB

Posted on Sep 11

Semantic Search on Easy Mode

#mongodb #webdev #programming #developer

A blog post handcrafted for developers by John Underwood

The current renaissance of information retrieval is exciting to those of us who have spent some years in the “search space.” The introduction of LLMs and embedding models into the mainstream of data engineering and information systems has both given us new tools, such as LLM judges and cross-encoding re-rankers, and shone a spotlight on existing techniques and concepts like query-intent understanding and expansion. As others have written (far more competently than I), much of this is not new, although our selection of tools has increased. So in this blog post, I won’t be aiming to introduce some novel technique or try to explain exactly what “semantic search” ought to mean. Instead, I’ll share a frictionless approach to building a semantic search application on “easy mode” for developers who are both pressed for time but can’t compromise too far on retrieval capabilities. If that’s you, read on.

In this blog post, we’ll look at two useful features of MongoDB Atlas Vector Search which enable this “easy mode”: View support and auto-embedding (now in private preview).

A brief history of MongoDB Atlas Vector Search

Before we dive in, a brief history of Atlas Vector Search… In the beginning was $vectorSearch—the ability to store vector embeddings in your operational database and index them using HNSW for nearest neighbor similarity search. This attracted lots of interesting workloads onto the MongoDB Atlas platform and at the same time highlighted the key areas for development. In quick succession, MongoDB added:

binData representations of vectors in the database to provide better compression.Binary and scalar quantization for more cost-effective indexes at scale.
Dedicated search nodes for workload isolation and cost-optimised performance.
Hybrid search operators for better relevance tuning.
View support and auto-embeddings.

The key idea that’s been driving all these developments is how to provide a frictionless experience to application developers at scale and cost competitively. In the rest of this short post, I’ll show you how some of these features come together to offer a real use case, outlining the most frictionless path to a semantic search app that’s available to developers today.

You’ve got your embedding model, now what?

Some of the early hype around vector databases and embedding models seemed to imply that once you had chosen an embedding model for producing dense vector representations of your data, and had picked a performant vector DB, you were all set to build your semantic search feature. But of course, buried in the implementation details is a whole series of further questions. One of the most obvious is: “What should I embed?” Now, it might seem like a nonsensical question at first—surely the answer is “your data.” But depending on your data model, this might not be as straightforward as it appears.

How it started…

Consider a movie dataset like the sample_mflix data that ships with MongoDB. A movie document looks something like this:

{
  "_id": {
    "$oid": "573a1390f29313caabcd42e8"
  },
  "plot": "A group of bandits stage a brazen train hold-up, only to find a determined posse hot on their heels.",
  "cast": [
    "A.C. Abadie",
    "Gilbert M. 'Broncho Billy' Anderson",
    "George Barnes",
    "Justus D. Barnes"
  ],
  "title": "The Great Train Robbery",
  "directors": [
    "Edwin S. Porter"
  ],
  "year": 1903
}

A very simple embedding strategy would be to create a new plot_embedding field, populate it with the vector embedding for the text in the plot field, and call it a day. We can now perform vector search over the plot and capture different semantics of our users’ queries even if they don’t use exactly the right words.

How it’s going…

But then, you hit a snag. Your users are starting to search for things like “ghost comedy starring Billy Murray” or “great train robbery from 1900s”. You still get results, but they are often not what the user is really expecting.

Examples using my atlas-search CLI tool and embedding just the plot with voyage-3:

$: atlas-search vector "ghost comedy starring bill murray" --config embedded_movies --embedWithVoyage --limit 1                    
[
  {
    "_id": "573a139af29313caabcef0a4",
    "plot": "A paranormal expert and his daughter bunk in an abandoned house populated by 3 mischievous ghosts and one friendly one.",
    "title": "Casper",
    "year": 1995
  }
]

The cast data is missing from the embedding, so we don’t get Ghostbusters as our first result.

$: atlas-search vector "great train robbery from 1900s" --config embedded_movies --embedWithVoyage --limit 1
[
  {
    "_id": "573a1396f29313caabce39c5",
    "plot": "Two Western bank/train robbers flee to Bolivia when the law gets too close.",
    "title": "Butch Cassidy and the Sundance Kid",
    "year": 1969
  }
]

The movie title (as well as year) are missing, so we don’t get any of the Great Train Robbery movies, let alone the one from 1903!

You realise that you need a more nuanced plan.

At this point, I’ll acknowledge that there are definitely multiple ways to crack this nut. For example, you could use a taxonomy search approach (similar to the one designed by my colleague Erik Hatcher for “search as you type”) to identify specific terms in the query and add the appropriate metadata filters to the vector search. This approach would yield the most deterministic, explainable, and possibly the most accurate results for certain query types. You could also pass the query to an LLM and have it extract the key metadata filters. This is more expensive and less deterministic, but (when it works) easier and, with the right LLM and training data, probably just as accurate.

But you’re in a time crunch and you want to create your search experience with minimal tooling and manage it all in one place—the database. You might be wondering, “Can’t I just embed the metadata, as well?”

Great minds think alike!

More context needed!

You’re now considering how to add the metadata context to the plot summary. If you’ve read a bit about embedding models, you are probably realising that simply concatenating a bunch of values isn’t going to be the best bet. Rather than something like:

The Great Train Robbery A.C. Abadie,Gilbert M. 'Broncho Billy' Anderson,George Barnes,Justus D. Barnes ,Edwin S. Porter 1903 TV-G Among the earliest existing films in American cinema - notable as the first film that presented a narrative story to tell - it depicts a group of cowboy outlaws who hold up a train and rob the passengers. They are then pursued by a Sheriff's posse. Several scenes have color included - all hand tinted.

We’d rather have:

Movie Title: The Great Train Robbery

Cast: A.C. Abadie,Gilbert M. 'Broncho Billy' Anderson,George Barnes,Justus D. Barnes

Directors: Edwin S. Porter

Year: 1903

Rated: TV-G

Plot: Among the earliest existing films in American cinema - notable as the first film that presented a narrative story to tell - it depicts a group of cowboy outlaws who hold up a train and rob the passengers. They are then pursued by a Sheriff's posse. Several scenes have color included - all hand tinted.

Notice that the metadata field names are included in the source text. This “steers” the embedding model toward a representation which is closer to the meaning of the document—e.g., that “from 1900s” relates to the year that the movie was released.

Bringing it all together—views and auto-embedding

Now, you might be thinking that this doesn’t sound that frictionless. First, I’ve got to pollute my nice, clean database with this weird concatenated field. Then, I’ve got to call an embedding service, update my database, and populate my search index. Or at least some combination of those things. Well, this is where the magic of view support and auto-embedding in MongoDB Atlas Vector Search comes in.

You see, with the ability to create Atlas Vector Search indexes on a view of your MongoDB collection, you can model your data in-place without polluting your schema with fields which are only needed for search. Moreover, when you make updates to the metadata the data in your search index will automatically update!

But wait, there’s more! With the introduction of auto-embedding, you can now specify any text field in your document and have it automatically embedded using state of the art Voyage AI embedding models and indexed for vector search. At query time, the same process happens over the input text. No more calling external embedding APIs or duct taping different systems together.

Here’s how it works.

1. Define a view that concatenates metadata fields and plot

The view definition is going to use an $addFields stage to add a new field to each document called embedding_source—this is the text field from which the auto-embedding service will create a vector embedding. Within the stage, we’ll use the $concat operator to create a single string representation of the values of multiple fields in the document.

We can use $reduce to iterate over array values, such as cast and directors, and $toString to cast non-string values, like year. Within the string representation, we place prefixes like “Movie Title” to help steer the embedding model, and we include the whitespace padding characters “\n\n” to separate different fields. You could also opt to use Markdown representations as a way to structure the input.

We handle the case where some fields might be null by setting defaults for them, using the value of the plot field for fullplot and N/A for the rated field.


[
  {
    $addFields: {
      plot: {$ifNull: ["$plot", "$None"]},
      rated: {$ifNull: ["$rated", "Unrated"]}
    }
  },
  {
    $addFields: {
      embedding_source: {
        $concat: [
          "Movie Title: ","$title","\n\n",
          "Cast: ",
          {
            $substr: [
              {
                $reduce: {
                  input: "$cast",
                  initialValue: "",
                  in: {
                    $concat: ["$$value",", ","$$this"]
                  }
                }
              },2,-1]
          },"\n\n",
          "Directors: ",
          {
            $substr: [
              {
                $reduce: {
                  input: "$directors",
                  initialValue: "",
                  in: {
                    $concat: ["$$value",", ","$$this"]
                  }
                }
              },2,-1]
          },"\n\n",
          "Year: ",{$toString: "$year"},"\n\n",
          "Rated: ","$rated","\n\n",
          "Plot: ","$plot"
        ]
      }
    }
  }
]

2. Create an auto-embedding vector index on the view

This is very straightforward and can be achieved via the Atlas Vector Search UI or JSON editor. We specify the embedding model to use, the path to the field we created in the view, and that it is a text type.

{
 "fields": [
   {
     "model": "voyage-3-large",
     "path": "embedding_source",
     "type": "text"
   }
 ]
}

If we go back to the query from earlier, we can see we get much better performance in terms of relevant results.

Now, we can get a movie about the great train robbery:

$: atlas-search vector "great train robbery" --config auto_embedded_movies --limit 1
[
  {
    "_id": "573a1395f29313caabce329c",
    "plot": "A dramatization of the Great Train Robbery. While not a 'how to', it is very detail dependent, showing the care and planning that took place to pull it off.",
    "cast": [
      "Stanley Baker",
      "Joanna Pettet",
      "James Booth",
      "Frank Finlay"
    ],
    "title": "Robbery",
    "directors": [
      "Peter Yates"
    ],
    "year": 1967
  }
]

We can even get an earlier movie (from 1903) simply by adding “from 1900s” to the query—since the year is also part of the embedding:

$: atlas-search vector "great train robbery from 1900s" --config auto_embedded_movies --limit 1
[
  {
    "_id": "573a1390f29313caabcd42e8",
    "plot": "A group of bandits stage a brazen train hold-up, only to find a determined posse hot on their heels.",
    "cast": [
      "A.C. Abadie",
      "Gilbert M. 'Broncho Billy' Anderson",
      "George Barnes",
      "Justus D. Barnes"
    ],
    "title": "The Great Train Robbery",
    "directors": [
      "Edwin S. Porter"
    ],
    "year": 1903
  }
]

And now, we also see Ghostbusters starring Bill Murray in the first spot, as expected:

$: atlas-search vector "ghost comedy starring bill murray" --config auto_embedded_movies --limit 1
[
  {
    "_id": "573a1398f29313caabce912c",
    "plot": "Three unemployed parapsychology professors set up shop as a unique ghost removal service.",
    "cast": [
      "Bill Murray",
      "Dan Aykroyd",
      "Sigourney Weaver",
      "Harold Ramis"
    ],
    "title": "Ghostbusters",
    "directors": [
      "Ivan Reitman"
    ],
    "year": 1984
  }
]

Again, these results could absolutely be achieved with other techniques, such as the ones already mentioned or by creating vector embeddings for each field and searching across them all. But multi-embedding approaches do have the downside of incurring more costs at index time and more complexity at query time. You could also invest in a hybrid search approach—but then we’re really getting into proper search tuning!

The finished product

So there you have it! You now have an operational database with a semantic search index attached. Any updates you make on the collection will be automatically picked up by the MongoDB Atlas indexing process, and the document description text will be generated and automatically embedded using the Voyage AI models. This is arguably the most frictionless semantic search implementation available to an application developer.

So no excuses now—get building! 😃
Read about View support in MongoDB Atlas Vector Search
Apply for early access to auto-embedding in Atlas Vector Search