Prema Ananda

Posted on Jul 29

Opening the "Black Box": A Journey into Vector Search and Spam Protection

#redischallenge #devchallenge #database #ai

Redis AI Challenge: Real-Time AI Innovators

This is a submission for the Redis AI Challenge: Real-Time AI Innovators.

How I built a spam fighting system with Redis 8 Vector Sets and what I learned about the inner workings of machine learning

What I Built

A spam classification system for dev.to that allows you to trace the entire journey from raw text to the final "spam/not spam" decision. Every step is transparent, measurable, and explainable.

In the era of ChatGPT and ready-made AI APIs, it's easy to forget that complex mathematics lies behind beautiful interfaces. When I started this project, I had a simple goal: to understand how vector search actually works, rather than just using it as a "magical" function.

Demo

premananda108 / redis8-spam-guard

🛡️ Redis8 Spam Guard

An intelligent spam classification system for dev.to posts using Redis 8 Vector Sets and FastAPI.

🎯 Features

Real-time classification - instant analysis of posts.
Redis 8 Vector Sets - leveraging the latest vector search technology.
FastAPI - a modern asynchronous API.
Machine Learning - classification based on k-NN with vector embeddings.
Interactive Web UI - a dashboard for moderators with real-time post classification, manual checking, statistics, and training logs.
Dynamic Data Enrichment - automatic loading of additional data, such as the author's follower count, for more accurate classification.
Feedback Loop - a system for improving the model based on moderator feedback.

🏗️ Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Dev.to API    │ -> │  FastAPI Server  │ -> │  Redis 8        │
│ (Posts, Users)  │    │ (API, Web UI)    │    │  (Vector Sets)  │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Components:

Data Collector - collects posts and user data from…

View on GitHub

The complete project code is available on GitHub. The system works in real-time and is ready for demonstration.

How I Used Redis 8

Vector Sets — More Than Just Search

Redis Vector Sets turned out to be not just a database, but a full-fledged semantic search engine. Creating the index:

schema = (
    VectorField("vector", "HNSW", {"TYPE": "FLOAT32", "DIM": self.vector_dim, "DISTANCE_METRIC": "COSINE"}),
    TagField("label"),
    TextField("title"),
    TagField("url")
)

HNSW (Hierarchical Navigable Small World) — this isn't just an algorithm, it's an entire philosophy of search. It builds a multi-level graph of connections between vectors, allowing you to find nearest neighbors in logarithmic time.

The Heart of Vector Search: How Redis Finds Similar Posts

The real magic happens in the vector search itself. The search for vectors occurs in the core.py file, inside the find_similar_posts method of the RedisVectorClassifier class.

Here's the specific code section that sends the query to Redis:

async def find_similar_posts(self, query_vector: np.ndarray, k: int = 5) -> List[Dict[str, Any]]:
    """Finds similar posts."""
    if not self.redis_client:
        return []
    try:
        # 1. Query formation
        query = (
            f"*=>[KNN {k} @vector $blob AS score]"
        )

        # 2. Executing the search command
        results = await self.redis_client.execute_command(
            "FT.SEARCH", self.index_name, query,
            "PARAMS", "2", "blob", query_vector.tobytes(),
            "DIALECT", "2",
            "RETURN", "3", "score", "title", "url"
        )

        # ... (result processing follows)

How this works:

query = f"*=>[KNN {k} @vector $blob AS score]": This line forms the search query itself.
- * — search across all documents in the index.
- => — apply the following operation to the results.
- [KNN {k} @vector $blob AS score] — this is the vector search command (KNN - k-Nearest Neighbors). It tells Redis: "Find k nearest neighbors in the vector field, using the vector we'll pass in the $blob parameter, and name the similarity field 'score'".
await self.redis_client.execute_command(...): This line sends the FT.SEARCH command to Redis, passing it:
- The index name (self.index_name).
- The formed query (query).
- The actual vector for search (query_vector.tobytes()) as the $blob parameter.

This command is what makes Redis use its powerful algorithm (HNSW) for ultra-fast search of the most similar vectors in the database.

k-NN — Democracy of Vectors

Classification of new posts happens through "voting" of nearest neighbors:

# Find 9 nearest neighbors
similar_posts = await self.find_similar_posts(query_vector, k=9)

# Collect their labels
labels = [post['label'] for post in similar_posts]

# Majority wins
predicted_label = Counter(labels).most_common(1)[0][0]
confidence = vote_count / total_neighbors

Magic in details: Why exactly k=9? After experiments, I found that an odd number of neighbors avoids ties, and 9 provides sufficient sampling without losing accuracy.

Redis as the Architecture Foundation

The system is built entirely around Redis Vector Sets capabilities. Without Redis, there is no system — it's not just a database, but the core of all classification logic.

Why Look Under the Hood?

Anatomy of "Training": What Actually Happens?

First Discovery: We Don't Train the Model

When people say "model training," they usually imagine a neural network learning from data. In my case, everything turned out differently. I don't train a model — I use the ready-made all-MiniLM-L6-v2 as a "translator" from human language to the language of mathematics.

# This isn't model training, it's creating a knowledge base
model = SentenceTransformer('all-MiniLM-L6-v2')
text_vector = model.encode("How to earn $1000 in a week!")
# We get an array of 384 numbers reflecting the text's meaning

The real "training" happens at the level of creating a database of vectors with correct labels.

But there's one "but": Redis still requires an external model for vectorization. It would be revolutionary if Redis could accept raw text and create vectors independently. Imagine:

# Dream: pass text, get ready search
await redis_client.hset("post:123", mapping={
    "text": "How to earn $1000 in a week!",  # Just text!
    "label": "spam"
})
# Redis creates the vector under the hood

Such integration would eliminate the need to manage separate embedding models and make Redis a truly self-sufficient AI platform.

Second Discovery: Automatic Labeling is an Art of Heuristics

The biggest problem of any ML project is obtaining labeled data. I didn't have thousands of posts with "spam/not spam" labels, so I had to create SpamLabelGenerator — a rule system for automatic labeling:

def calculate_spam_score(self, article: Dict) -> float:
    score = 0.0

    # Spam words in title (high weight)
    for keyword in self.spam_keywords:
        if keyword in title:
            score += 0.3

    # Short posts with low engagement
    if reading_time < 2 and reactions < 5:
        score += 0.3

    # Too many tags
    if len(tags) > 10:
        score += 0.2

    return min(max(score, 0.0), 1.0)

Challenge: How to find balance between precision and recall? Too strict rules — we miss spam, too lenient — we block legitimate content.

Solution: Added an element of randomness and multiple criteria. The system analyzes not only text, but also engagement metrics, author profile, publication time.

Third Discovery: Vectors Are Not Just Numbers

Initially, I thought a vector was just a "fingerprint" of text. In practice, it turned out you can create hybrid vectors by combining semantic embeddings with numerical features:

# Text vector (384 dimensions)
text_vector = self.model.encode(combined_text)

# Numerical features (3 dimensions)
numeric_features = np.array([
    reading_time,
    user_followers,
    len(tags)
])

# Combine into a single vector (387 dimensions)
final_vector = np.concatenate([text_vector, numeric_features])

Insight: Vector space can be "enriched" with any numerical data. Posts become similar not only in meaning but also in structure.

Redis as a "Card Index of Meanings"

Challenges and Solutions

Problem 1: Quality of Automatic Labeling

Challenge: How to verify that automatic labels are correct?

Solution: Added detailed logging and quality metrics:

metrics = {
    'accuracy': (tp + tn) / total,
    'precision': tp / (tp + fp),
    'recall': tp / (tp + fn),
    'f1_score': 2 * (precision * recall) / (precision + recall)
}

Problem 2: Data Imbalance — Too Little Spam

Challenge: Data from dev.to API contained very little real spam. The model couldn't learn effectively — it lacked examples of "bad" behavior.

Solution: Created a synthetic spam dataset and mixed it with real data:

# Load synthetic spam posts
try:
    with open('spam_dataset.json', 'r', encoding='utf-8') as f:
        spam_articles = json.load(f)
        # Mark as known spam
        for article in spam_articles:
            article['is_known_spam'] = True
        articles.extend(spam_articles)
except FileNotFoundError:
    logger.warning("spam_dataset.json not found")

Synthetic dataset composition (50 examples):

Quick money schemes and crypto scams
Phishing links and fake security notifications
SEO services sales and follower boosting
Dubious courses and "miracle products"

System Evolution: From Static to Self-Learning

The first version of the system had a critical flaw — it couldn't learn from its mistakes. Every moderator correction was lost after page reload.

Feedback Loop in Action

The solution turned out to be elegant: with each moderator click, the complete post content is saved in Redis with an expert verdict:

# Check for moderator verdict from Redis
feedback_key = f"feedback:{post.id}"
feedback_data = await redis_classifier.redis_client.get(feedback_key)
if feedback_data:
    feedback_json = json.loads(feedback_data)
    moderator_verdict = "spam" if feedback_json.get("is_spam") else "legit"

During the next training, this data is loaded with the highest priority, turning every mistake into an opportunity for improvement.

Dramatic Results

Before feedback loop:

Accuracy: 78-85%
Precision: ~85%

After implementation (just 9 expert corrections):

Accuracy: 96.4%
Precision: 100% — complete absence of false positives

Insight: 9 expert moderator decisions proved more valuable than 1000 automatically labeled posts. Data quality beats quantity.

Training Control Panel

The interface includes a "Training Control Panel" that displays current model information:

Last known accuracy
Number of trained examples (vectors in the database)

The actual retraining launches only after clicking the "Start New Training" button, and the process can be observed in real-time through automatically updated logs.

What I Learned About the "Black Box"

1. Vectors Have Geometry

Texts similar in meaning really are located close together in vector space. This isn't a metaphor — it's mathematical reality that can be measured through cosine distance or Euclidean norm.

2. Data Quality Matters More Than Algorithm

The most advanced algorithm won't help if the data is poorly labeled. 80% of project success is quality preparation of the training sample.

3. Explainability Can Be Built In

Every system decision is accompanied by detailed explanation:

reasoning = [
    "Similar to known spam posts (via Redis)",
    "Contains spam keywords: 'earn money'",
    "Low follower count (5)",
    "Short reading time (1 minute)"
]

4. Redis as Architecture Foundation

The system is built entirely around Redis Vector Sets capabilities. Without Redis, there is no system — it's not just a database, but the core of all classification logic.

Conclusions: Why Open the "Black Box"?

This project taught me that Redis Vector Sets isn't just a new feature, but a revolution in the approach to semantic search:

Speed: Vector search works in milliseconds even on thousands of records
Scalability: System is ready for growth without architectural changes
Transparency: Every decision can be traced and explained

Redis Vector Sets transformed the complex task of semantic search into an elegant and understandable solution.

Looking to the future: The next logical step would be to embed vectorization directly into Redis. Instead of the chain "text → external model → vector → Redis" we would get simply "text → Redis". This would make Redis the only component needed for a full-fledged AI system — from raw data to ready search results.

DEV Community

Opening the "Black Box": A Journey into Vector Search and Spam Protection

What I Built

Demo

premananda108 / redis8-spam-guard

🛡️ Redis8 Spam Guard

🎯 Features

🏗️ Architecture

Components:

How I Used Redis 8

Vector Sets — More Than Just Search

The Heart of Vector Search: How Redis Finds Similar Posts

k-NN — Democracy of Vectors

Redis as the Architecture Foundation

Why Look Under the Hood?

Anatomy of "Training": What Actually Happens?

First Discovery: We Don't Train the Model

Second Discovery: Automatic Labeling is an Art of Heuristics

Third Discovery: Vectors Are Not Just Numbers

Redis as a "Card Index of Meanings"

Challenges and Solutions

Problem 1: Quality of Automatic Labeling

Problem 2: Data Imbalance — Too Little Spam

System Evolution: From Static to Self-Learning

Feedback Loop in Action

Dramatic Results

Training Control Panel

What I Learned About the "Black Box"

1. Vectors Have Geometry

2. Data Quality Matters More Than Algorithm

3. Explainability Can Be Built In

4. Redis as Architecture Foundation

Conclusions: Why Open the "Black Box"?

Top comments (0)