DEV Community

Cover image for Designing Data Systems with Vector Embeddings using Redis Vector Sets
Ricardo Ferreira for Redis

Posted on • Edited on

Designing Data Systems with Vector Embeddings using Redis Vector Sets

Introduction

Every software engineer has faced this challenge: how do you model complex, multifaceted relationships in your data? Whether it's products in an e-commerce catalog, documents in a search engine, or content in a recommendation system, we often struggle to represent data entities with multiple attributes and complex relationships.

Today, we'll explore this fundamental problem through an unexpected lens: Pixar's Finding Nemo. By modeling Marlin's journey to find his son, Nemo, you'll learn how Redis Vector Sets and vector embeddings can elegantly solve problems that most traditional approaches struggle with.

Best of all? You'll learn how to do this hands-on.

The Challenge: Data Design

Imagine you're tasked with building a system representing Marlin's journey in Finding Nemo. If you haven't watched the movie (well, if that is true, then shame on you), here is a TL;DR of the storyline.

  1. Nemo, a clownfish, gets lost in the sea and is dragged to Sydney
  2. Marlin, his father, starts a rescue journey alone and afraid
  3. He meets Dory, a helpful but rather forgetful blue tang
  4. They encounter Bruce and his shark friends, reformed predators
  5. A school of moonfish gives Marlin and Dory directions
  6. Crush and the sea turtles help them ride the sea to the EAC
  7. A whale swallows them and takes them to Sydney by accident
  8. Nigel the pelican recognizes Marlin and shares where Nemo is
  9. Finally, Marlin reunites with Nemo

These are the requirements your system needs to keep in mind:

  • Preserve the journey order — Who does Marlin meet first, second, third?
  • Find similar characters — Which characters are most like Dory?
  • Filter by attributes — Show me all the "helpers" or "large creatures"
  • Handle flexible insertion — Add characters in any order, not just chronologically
  • Support proximity queries — Who does Marlin meet after the sharks?

Which well-known approaches would you use to address these requirements? Let's discuss some options and their trade-offs.

🔗 Using Linked Lists

package main

type Character struct {
    Name string
    Next *Character
}

func main() {
    marlin := &Character{Name: "Marlin"}
    dory := &Character{Name: "Dory"}
    marlin.Next = dory
    // Same for next ones...
}
Enter fullscreen mode Exit fullscreen mode

Linked Lists are great for preserving the journey order. It allows you to navigate the relationships in different directions: forward-only, bidirectionally, and with circular support. But here is the problem:

💡 You can't query by attributes, no similarity search, and rigid insertion order.

🗄️ Relational Database

CREATE TABLE journey (
    id INT PRIMARY KEY,
    character_name VARCHAR(50),
    position INT,
    species VARCHAR(50),
    helpfulness FLOAT,
    size FLOAT
);
Enter fullscreen mode Exit fullscreen mode

Relational databases are fairly attractive because they provide a proven programming model. If your data is in a table, you probably know how to query it with SQL. But here is the problem:

💡 Similarity queries require complex JOINs, "find nearest neighbors" is expensive, and multi-attribute distance calculations are cumbersome.

🕸️ Graph Database

(marlin:Character)-[:MEETS]->(dory:Character)-[:MEETS]->(sharks:Character)
Enter fullscreen mode Exit fullscreen mode

Graph databases excel in scenarios requiring complex relationships. Let's face it, this is how the world looks most of the time. But here is the problem:

💡 While suitable for relationships, calculating multi-dimensional similarity is still complex, and ordering isn't inherent.

📄 Document Store with Full-Text Search

{
  "character": "Dory",
  "attributes": ["helpful", "forgetful", "blue tang"],
  "position": 2
}
Enter fullscreen mode Exit fullscreen mode

Document stores are great because they provide flexible data models that allow you to adapt your query needs faster and painlessly. But here is the problem:

💡 Text matching isn't the same as mathematical similarity, and no accurate distance calculations exist.

What's Missing?

All these approaches treat the journey order and character attributes as separate concerns. As developers, we often tend to pick one of these approaches because we are used to it and struggle to implement the remaining requirements. It's almost as if we hope it will work great in the end, like magic.

What if we could encode both in a unified mathematical representation? What if each character could exist as a point in multi-dimensional space, where their position store both when they appear and what they're like? Well, this is why vector embeddings are here for.

A New Perspective with Vectors

Instead of thinking of characters as records, nodes, or documents, imagine them as points in a multi-dimensional space. Each dimension represents an attribute that helps you design your data entity correctly:

  1. Journey Position (when Marlin meets them)
  2. Helpfulness (how much they assist)
  3. Size (physical dimensions)
  4. Swimming Style (movement patterns)
  5. Courage (bravery level)

With these five dimensions, finding "who comes next in the journey" becomes now a nearest-neighbor search. Finding "similar characters" is just measuring distances in this space. Magic? No. Mathematics. The usage of dimensions with a vector store allows you the following:

  • Flexible Insertion: Add data in any order; relationships are maintained by vector dimensions with loose references across different records.
  • Multi-Attribute Similarity: Distance calculations consider all dimensions simultaneously. There are no more per-field comparisons.
  • Semantic Queries: Search by example or by constructing meaningful query vectors. Search by meaning instead of purely precision.
  • Hybrid Search: Combine vector similarity with attribute filtering. Vectors for semantic search, filters for additional result precision.
  • Performance: HNSW indices make even high-dimensional searches fast. Search massive amounts of data with O(Log N) performance.

Building Marlin's Journey with Redis Vector Sets

Let's build this step by step. We'll use Redis Vector Sets, providing high-performance vector similarity search and additional filtering capabilities. You will need Redis Open Source for this.

Step 1: Creating Our Universe

First, let's check that we're starting fresh and understand what we're building. Let's see what data type we're creating:

TYPE finding-nemo
Enter fullscreen mode Exit fullscreen mode

➡️ Expected output:

"none"
Enter fullscreen mode Exit fullscreen mode

Great! We're starting with a clean slate. Now let's add our first character—Marlin. He is the anxious father starting his journey.

VADD finding-nemo VALUES 5 0.0 0.5 0.2 0.1 0.1 marlin SETATTR '{"species":"clownfish","type":"father","quote":"I have to find my son!"}'
Enter fullscreen mode Exit fullscreen mode

➡️ Expected output:

(integer) 1
Enter fullscreen mode Exit fullscreen mode

What just happened?

  • We've created a new vector set, under the key finding-nemo
  • We've added Marlin as a 5-dimensional point at position (0.0, 0.5, 0.2, 0.1, 0.1)
  • We've attached metadata to Marlin about species, role, and his iconic quote

The vector embedding [0.0, 0.5, 0.2, 0.1, 0.1] tells us Marlin starts at the first position (0.0), has moderate helpfulness (0.5), is small (0.2), swims cautiously (0.1), and begins with low courage (0.1). In this example, we have used only five dimensions because these are all the attributes we need to implement the scenario. However, you are not limited to this. You can use as many dimensions as you need. Have you ever heard about OpenAI's embedding models that create 1536 dimensions? This is one example of how far you can go!

Step 2: Building the Journey (Out of Order!)

Here's where it gets interesting. In traditional systems, we'd need to insert characters in order. But with vectors, watch this:

VADD finding-nemo VALUES 5 6.0 0.9 0.5 0.7 0.7 nigel SETATTR '{"species":"pelican","type":"informant","quote":"Hop inside my mouth!"}'
VADD finding-nemo VALUES 5 1.0 1.0 0.2 0.2 0.2 dory SETATTR '{"species":"blue tang","type":"helper","quote":"Just keep swimming!"}'
VADD finding-nemo VALUES 5 5.0 0.9 1.0 0.9 0.6 whale SETATTR '{"species":"whale","type":"transporter","quote":"*whale sounds*"}'
VADD finding-nemo VALUES 5 3.0 0.7 0.7 0.4 0.3 moonfish SETATTR '{"species":"moonfish","type":"guides","quote":"Follow the EAC!"}'
Enter fullscreen mode Exit fullscreen mode

Notice we're adding Nigel (position 6.0) before Dory (position 1.0). In a Linked List, this would break our ordering. But vectors don't care about insertion order. They care about position in space.

Let's complete our cast:

VADD finding-nemo VALUES 5 7.0 0.0 0.2 0.1 0.8 nemo SETATTR '{"species":"clownfish","type":"son","quote":"Dad!"}'
VADD finding-nemo VALUES 5 4.0 0.9 0.6 0.8 0.5 turtles SETATTR '{"species":"sea turtles","type":"transporters","quote":"Righteous! Righteous!"}'
VADD finding-nemo VALUES 5 2.0 0.7 0.8 0.3 0.3 sharks SETATTR '{"species":"sharks","type":"reformed predators","quote":"Fish are friends, not food!"}'
Enter fullscreen mode Exit fullscreen mode

Step 3: Verifying Our Vector Universe

Let's examine what we've built. What type of data structure did we create?

TYPE finding-nemo
Enter fullscreen mode Exit fullscreen mode

➡️ Expected output:

vectorset
Enter fullscreen mode Exit fullscreen mode

Vector Sets provide a way for you to inspect your data very easily. Use the command VCARD to count how many characters are in our journey.

VCARD finding-nemo
Enter fullscreen mode Exit fullscreen mode

➡️ Expected output:

(integer) 8
Enter fullscreen mode Exit fullscreen mode

What if you want to investigate how many dimensions you are using? This is quite common, as the team that loads vectors into the databases is not always the same one that queries them. Use the command VDIM to find how many dimensions each character has.

VDIM finding-nemo
Enter fullscreen mode Exit fullscreen mode

➡️ Expected output:

(integer) 5
Enter fullscreen mode Exit fullscreen mode

If you need to retrieve information about your vector set, use the command VINFO for this.

VINFO finding-nemo
Enter fullscreen mode Exit fullscreen mode

➡️ Expected output:

1) "quant-type"
2) "int8"
3) "hnsw-m"
4) "16"
5) "vector-dim"
6) "5"
7) "projection-input-dim"
8) "0"
9) "size"
10) "8"
11) "max-level"
12) "1"
13) "attributes-count"
14) "8"
15) "vset-uid"
16) "0"
17) "hnsw-max-node-uid"
18) "8"
Enter fullscreen mode Exit fullscreen mode

Step 4: Tracing the Journey

Now for the revealing moment. Despite inserting characters randomly, can we trace Marlin's journey in the correct order? The answer is yes. You must start from before the journey (-1.0) and find all the characters in order.

VSIM finding-nemo VALUES 5 -1.0 0.5 0.2 0.1 0.1
Enter fullscreen mode Exit fullscreen mode

➡️ Expected output:

1) "marlin"
2) "dory"
3) "sharks"
4) "moonfish"
5) "turtles"
6) "whale"
7) "nigel"
8) "nemo"
Enter fullscreen mode Exit fullscreen mode

🎉 Perfect! But how did this work?

The query vector [-1.0, 0.5, 0.2, 0.1, 0.1] is cleverly designed:

  • Position -1.0 places us "before" the journey starts
  • The other values (0.5, 0.2, 0.1, 0.1) match Marlin's characteristics

Redis finds the nearest neighbors in order, effectively tracing the path from start to finish. The significant gaps between journey positions (0, 1, 2... 7) ensure the first dimension dominates the distance calculation.

Step 5: Finding Similar Characters

Vector sets shine at similarity search. Let's explore relationships. For instance, who are the three characters most similar to Marlin?

VSIM finding-nemo ELE marlin COUNT 3
Enter fullscreen mode Exit fullscreen mode

➡️ Expected output:

1) "marlin"
2) "dory"    # Makes sense - small fish, next in journey
3) "sharks"  # Next encounter after Dory
Enter fullscreen mode Exit fullscreen mode

Who's closest to Nemo?

VSIM finding-nemo ELE nemo COUNT 3
Enter fullscreen mode Exit fullscreen mode

➡️ Expected output:

1) "nemo"
2) "nigel"   # Met right before reunion
3) "whale"   # Carried Marlin to Sidney
Enter fullscreen mode Exit fullscreen mode

The similarity considers all dimensions, not just journey position, but also size, helpfulness, and other attributes.

Step 6: Finding Helpers with Filtered Searches

Here's where vector sets truly excel over traditional approaches. Let's find all helpers and transporters near Dory.

VSIM finding-nemo ELE dory FILTER '.type == "helper" || .type == "transporters"' COUNT 5
Enter fullscreen mode Exit fullscreen mode

➡️ Expected output:

1) "dory"
2) "turtles"
Enter fullscreen mode Exit fullscreen mode

This combines vector similarity with attribute filtering, which would require complex queries in traditional databases.

Step 7: Semantic Queries

Let's find the largest creatures by searching near a "large creature" point. Let's search near a point representing large, helpful creatures.

VSIM finding-nemo VALUES 5 3.5 0.8 1.0 0.5 0.5 COUNT 3
Enter fullscreen mode Exit fullscreen mode

➡️ Expected output:

2) "moonfish" # Reef largest (size=0.8)
1) "whale"    # Largest (size=1.0)
3) "turtles"  # Also sizeable (size=0.6)
Enter fullscreen mode Exit fullscreen mode

We didn't need to write WHERE size > 0.7 as vector space naturally clusters large creatures together. Semantic search is a powerful type of querying that exploits data's proximity instead of its precision.

Summary

We've solved challenges that stump traditional approaches by representing Marlin's journey as vectors. Through the elegant mathematics of vector spaces, we can query by order, find similar entities, filter by attributes, and add data with flexible insertion. While the Finding Nemo example in this blog post may have been whimsical, all the underlying principles are foundational to modern AI and search systems.

Vector Sets, part of Redis Open Source, provide the perfect environment for exploring data modeling with vector embeddings, while providing a robust implementation of the HNSW algorithm.

So, the next time you face a complex data modeling challenge, ask yourself, could this be a vector? Who knows. Sometimes the best solutions come from seeing your data from a different dimension. Pun intended.

Top comments (0)