Sabrina

Posted on Oct 31, 2024

Conjuring Cursed Halloween Tales with Qdrant's Dark Arts

#qdrant #halloween #tutorial #ai

It’s finally Halloween!! 🎃

That time of the year for carved pumpkins, sexy costumes, and eerie stories whispered around a flickering candle.

But if you’re like me, you never quite remember any creepy tales when you need them the most. So I thought, why not create a tool that can go through a massive collection of stories and really pick the ones that can really give us the chills.

So that’s exactly what we’re building today.

The plan is simple.

We’ll take a dataset of Reddit Horror Stories, embed it, and set up a Qdrant collection to search through it based on themes, atmosphere, etc. Essentially, capturing the “vibe” like ‘haunted house’ or ‘creepy forest.’

I’ll show all the steps you'll need to build an app like this: setting up the vector database, embedding and indexing the data, and conjuring the most cursed Halloween tales.

So let's get started.

1. Install the Libraries

First things first, let's start by installing the tools we'll be using:

pip install qdrant-client sentence_transformers datasets

2. Download the Dataset

We'll be using the Reddit horror stories dataset. Let's download it using the datasets library:

from datasets import load_dataset

ds = load_dataset("intone/horror_stories_reddit")

3. Load the Embedding Model

We'll use the sentence_transformers library to help us embed our data with the model all-MiniLM-L6-v2. Here's how we'll set it up:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
          'sentence-transformers/all-MiniLM-L6-v2', device='cpu'
      )

If you have a GPU available and want to speed things up, simply change it to device='cuda:0'.

4. Create the Embeddings

The generate_embeddings_direct function processes the dataset part (like "train") by breaking it into smaller groups called batches, based on the specified batch_size. This helps manage memory efficiently.

For each batch, the function extracts a set of sentences (e.g., 32 at a time) and uses the loaded embedding model to embed them.

from tqdm import tqdm

def generate_embeddings(split, batch_size=32):
    embeddings = []
    split_name = [name for name, data_split in ds.items() if data_split is split][0]

    with tqdm(total=len(split), desc=f"Generating embeddings for {split_name} split") as pbar:
        for i in range(0, len(split), batch_size):
            batch_sentences = split['text'][i:i+batch_size]
            batch_embeddings = model.encode(batch_sentences)
            embeddings.extend(batch_embeddings)
            pbar.update(len(batch_sentences))

    return embeddings

It immediately adds them in a new column in the dataset. This way, the function efficiently updates the dataset without overloading memory.

train_embeddings = generate_embeddings(ds['train'])
ds["train"] = ds["train"].add_column("embeddings", train_embeddings)

5. Set up a Client

Now we can start our Qdrant Client. If you’re working locally, just connect to the default endpoint and you’re good to go:

from qdrant_client import QdrantClient

# Connecting to a locally running instance
qdrant_client = QdrantClient(url="http://localhost:6333")

Simple enough, right? But in the real world, you’re likely working in the cloud. That means getting your Qdrant Cloud instance set up and authenticated.

Cloud Setup

To connect to your cloud instance, you’ll need the instance URL and an API key. Here’s how to do it:

from qdrant_client import QdrantClient

# Initialize the client with the Qdrant Cloud URL and API key
qdrant_client = QdrantClient(
    url="https://YOUR_CLOUD_INSTANCE_ID.aws.qdrant.tech",  # Replace with your cloud instance URL
    api_key="YOUR_API_KEY"  # Replace with your API key
)

Make sure to replace YOUR_CLOUD_INSTANCE_ID with your actual instance ID and YOUR_API_KEY with the API key you created. You’ll find these in your Qdrant Cloud Console.

6. Create a Collection

A collection in Qdrant is like a mini-database optimized for storing and querying vectors. When defining one, we need to set the size of our vectors and the metric to measure similarity. Here’s what that setup might look like:

from qdrant_client import models

collection_name="halloween"

# Creating a collection to hold vectors for product features
qdrant_client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(size=384, distance=models.Distance.COSINE)
)

We defined a collection halloween with 384-dimensional vectors which is the size of the all-MiniLM-L6-v2 embeddings. Cosine distance is used here as our similarity metric. Depending on your data and use case, you might want to use different distance metrics like Distance.EUCLID or Distance.DOT.

7. Load the Vectors

Collections are nothing without data. It’s time to insert the embeddings we created earlier into it. Here’s a strategy to load embeddings in batches:

def batched(iterable, n):
    iterator = iter(iterable)
    while batch := list(islice(iterator, n)):
        yield batch

batch_size = 100
current_id = 0  # Initialize a counter

The batched function divides an iterable into smaller chunks of size n. It uses islice to extract consecutive elements and yields each chunk until the dataset is fully processed.

from itertools import islice

for batch in batched(ds["train"], batch_size):
    # Generate a list of IDs using the counter
    ids = list(range(current_id, current_id + len(batch)))

    # Update the counter to continue from the next ID after the batch
    current_id += len(batch)

    vectors = [point.pop("embeddings") for point in batch]

    qdrant_client.upsert(
        collection_name=collection_name,
        points=models.Batch(
            ids=ids,
            vectors=vectors,
            payloads=batch,
        ),
    )

Each batch is sent to Qdrant using the upsert method, which inserts the batch of data. The upsert method takes a collection of IDs, vectors, and remaining item data (payloads) to store or update in the Qdrant collection.

8. Conjuring the Cursed Tales

Finally, it's time.

With everything set up, it’s time to see if our horror story search tool can deliver some real scares. Let’s try searching for a theme like “creepy clown” and see what we get:

import json
import textwrap

# Function to wrap and print long text
def print_wrapped(text, width=80):
    wrapped_text = textwrap.fill(text, width=width)
    print(wrapped_text)

# Search result query
search_result = qdrant_client.query_points(
    collection_name=collection_name,
    query=model.encode("creepy clown").tolist(),
    limit=1,
)

# Access the first result
if search_result.points:
    tale = search_result.points[0]

    # Pretty-print the payload
    print("ID:", tale.id)
    print("Score:", tale.score)
    print("Original:", tale.payload.get('isOriginal', 'N/A'))

    # Print specific payload fields
    print("Title:", tale.payload.get('title', 'N/A'))
    print("Author:", tale.payload.get('author', 'N/A'))
    print("Subreddit:", tale.payload.get('subreddit', 'N/A'))
    print("URL:", tale.payload.get('url', 'N/A'))

    # Print the text of the story separately with word wrapping for readability
    print("\nStory Text:\n")
    print_wrapped(tale.payload.get('text', 'No text available'), width=80)
else:
    print("No results found.")

The result popped up, and there it was: a story titled “Sneaky Peeky.”

And honestly, it was CREEPY.

Whether it’s based on a true story or not? Honestly, I don’t know. It leaves you with that lingering unease like something’s watching. It's quite long, so I won't post it here, but if you want to see for yourself, go ahead—run the program and try it.

You can explore any other atmosphere: “haunted house,” “creepy forest,” “possessed doll,” or whatever you’re in the mood for. Who knows? You might find something even creepier.

If you do, please post it in the comments. I’d love to see what else this thing can discover.

Next Steps

Thanks for sticking with me through this Halloween experiment! If you’ve followed along, you’ve now taken your first step into the world of vector search and learned how to find stories that feel creepy rather than just containing spooky words.

If you’re ready to go into the dark arts of vector search, there are lots of more advanced topics you can explore, like multitenancy, payload structures, and bulk upload.

So, go ahead, and see just how deep you can go.

Happy hunting! 👻