It’s finally Halloween!! 🎃
That time of the year for carved pumpkins, sexy costumes, and eerie stories whispered around a flickering candle.
But if you’re like me, you never quite remember any creepy tales when you need them the most. So I thought, why not create a tool that can go through a massive collection of stories and really pick the ones that can really give us the chills.
So that’s exactly what we’re building today.
The plan is simple.
We’ll take a dataset of Reddit Horror Stories, embed it, and set up a Qdrant collection to search through it based on themes, atmosphere, etc. Essentially, capturing the “vibe” like ‘haunted house’ or ‘creepy forest.’
I’ll show all the steps you'll need to build an app like this: setting up the vector database, embedding and indexing the data, and conjuring the most cursed Halloween tales.
So let's get started.
1. Install the Libraries
First things first, let's start by installing the tools we'll be using:
pip install qdrant-client sentence_transformers datasets
2. Download the Dataset
We'll be using the Reddit horror stories dataset. Let's download it using the datasets library:
from datasets import load_dataset
ds = load_dataset("intone/horror_stories_reddit")
3. Load the Embedding Model
We'll use the sentence_transformers
library to help us embed our data with the model all-MiniLM-L6-v2
. Here's how we'll set it up:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
'sentence-transformers/all-MiniLM-L6-v2', device='cpu'
)
If you have a GPU available and want to speed things up, simply change it to device='cuda:0'
.
4. Create the Embeddings
The generate_embeddings_direct
function processes the dataset part (like "train") by breaking it into smaller groups called batches, based on the specified batch_size
. This helps manage memory efficiently.
For each batch, the function extracts a set of sentences (e.g., 32 at a time) and uses the loaded embedding model to embed them.
from tqdm import tqdm
def generate_embeddings(split, batch_size=32):
embeddings = []
split_name = [name for name, data_split in ds.items() if data_split is split][0]
with tqdm(total=len(split), desc=f"Generating embeddings for {split_name} split") as pbar:
for i in range(0, len(split), batch_size):
batch_sentences = split['text'][i:i+batch_size]
batch_embeddings = model.encode(batch_sentences)
embeddings.extend(batch_embeddings)
pbar.update(len(batch_sentences))
return embeddings
It immediately adds them in a new column in the dataset. This way, the function efficiently updates the dataset without overloading memory.
train_embeddings = generate_embeddings(ds['train'])
ds["train"] = ds["train"].add_column("embeddings", train_embeddings)
5. Set up a Client
Now we can start our Qdrant Client. If you’re working locally, just connect to the default endpoint and you’re good to go:
from qdrant_client import QdrantClient
# Connecting to a locally running instance
qdrant_client = QdrantClient(url="http://localhost:6333")
Simple enough, right? But in the real world, you’re likely working in the cloud. That means getting your Qdrant Cloud instance set up and authenticated.
Cloud Setup
To connect to your cloud instance, you’ll need the instance URL and an API key. Here’s how to do it:
from qdrant_client import QdrantClient
# Initialize the client with the Qdrant Cloud URL and API key
qdrant_client = QdrantClient(
url="https://YOUR_CLOUD_INSTANCE_ID.aws.qdrant.tech", # Replace with your cloud instance URL
api_key="YOUR_API_KEY" # Replace with your API key
)
Make sure to replace YOUR_CLOUD_INSTANCE_ID
with your actual instance ID and YOUR_API_KEY
with the API key you created. You’ll find these in your Qdrant Cloud Console.
6. Create a Collection
A collection in Qdrant is like a mini-database optimized for storing and querying vectors. When defining one, we need to set the size of our vectors and the metric to measure similarity. Here’s what that setup might look like:
from qdrant_client import models
collection_name="halloween"
# Creating a collection to hold vectors for product features
qdrant_client.create_collection(
collection_name=collection_name,
vectors_config=models.VectorParams(size=384, distance=models.Distance.COSINE)
)
We defined a collection halloween
with 384-dimensional vectors which is the size of the all-MiniLM-L6-v2
embeddings. Cosine distance is used here as our similarity metric. Depending on your data and use case, you might want to use different distance metrics like Distance.EUCLID
or Distance.DOT
.
7. Load the Vectors
Collections are nothing without data. It’s time to insert the embeddings we created earlier into it. Here’s a strategy to load embeddings in batches:
def batched(iterable, n):
iterator = iter(iterable)
while batch := list(islice(iterator, n)):
yield batch
batch_size = 100
current_id = 0 # Initialize a counter
The batched
function divides an iterable into smaller chunks of size n
. It uses islice
to extract consecutive elements and yields each chunk until the dataset is fully processed.
from itertools import islice
for batch in batched(ds["train"], batch_size):
# Generate a list of IDs using the counter
ids = list(range(current_id, current_id + len(batch)))
# Update the counter to continue from the next ID after the batch
current_id += len(batch)
vectors = [point.pop("embeddings") for point in batch]
qdrant_client.upsert(
collection_name=collection_name,
points=models.Batch(
ids=ids,
vectors=vectors,
payloads=batch,
),
)
Each batch is sent to Qdrant using the upsert
method, which inserts the batch of data. The upsert method takes a collection of IDs, vectors, and remaining item data (payloads) to store or update in the Qdrant collection.
8. Conjuring the Cursed Tales
Finally, it's time.
With everything set up, it’s time to see if our horror story search tool can deliver some real scares. Let’s try searching for a theme like “creepy clown”
and see what we get:
import json
import textwrap
# Function to wrap and print long text
def print_wrapped(text, width=80):
wrapped_text = textwrap.fill(text, width=width)
print(wrapped_text)
# Search result query
search_result = qdrant_client.query_points(
collection_name=collection_name,
query=model.encode("creepy clown").tolist(),
limit=1,
)
# Access the first result
if search_result.points:
tale = search_result.points[0]
# Pretty-print the payload
print("ID:", tale.id)
print("Score:", tale.score)
print("Original:", tale.payload.get('isOriginal', 'N/A'))
# Print specific payload fields
print("Title:", tale.payload.get('title', 'N/A'))
print("Author:", tale.payload.get('author', 'N/A'))
print("Subreddit:", tale.payload.get('subreddit', 'N/A'))
print("URL:", tale.payload.get('url', 'N/A'))
# Print the text of the story separately with word wrapping for readability
print("\nStory Text:\n")
print_wrapped(tale.payload.get('text', 'No text available'), width=80)
else:
print("No results found.")
The result popped up, and there it was: a story titled “Sneaky Peeky.”
And honestly, it was CREEPY.
Whether it’s based on a true story or not? Honestly, I don’t know. It leaves you with that lingering unease like something’s watching. It's quite long, so I won't post it here, but if you want to see for yourself, go ahead—run the program and try it.
You can explore any other atmosphere: “haunted house,” “creepy forest,” “possessed doll,” or whatever you’re in the mood for. Who knows? You might find something even creepier.
If you do, please post it in the comments. I’d love to see what else this thing can discover.
Next Steps
Thanks for sticking with me through this Halloween experiment! If you’ve followed along, you’ve now taken your first step into the world of vector search and learned how to find stories that feel creepy rather than just containing spooky words.
If you’re ready to go into the dark arts of vector search, there are lots of more advanced topics you can explore, like multitenancy, payload structures, and bulk upload.
So, go ahead, and see just how deep you can go.
Happy hunting! 👻
Top comments (2)
@sabrinaesaquino interesting POV on how sentence-transformers + Qdrant can actually "get" the vibe of stories beyond basic keyword stuff hehe
Neat way of using cosine similarity to match the spooky feels 🎃
thanks, Dani! just wait until midnight... I'll tell you some of the creepy stories I found 👻