This is a submission for the Redis AI Challenge: Real-Time AI Innovators.
How I built a spam fighting system with Redis 8 Vector Sets and what I learned about the inner workings of machine learning
What I Built
A spam classification system for dev.to that allows you to trace the entire journey from raw text to the final "spam/not spam" decision. Every step is transparent, measurable, and explainable.
In the era of ChatGPT and ready-made AI APIs, it's easy to forget that complex mathematics lies behind beautiful interfaces. When I started this project, I had a simple goal: to understand how vector search actually works, rather than just using it as a "magical" function.
Demo
🛡️ Redis8 Spam Guard
An intelligent spam classification system for dev.to posts using Redis 8 Vector Sets and FastAPI.
🎯 Features
- Real-time classification - instant analysis of posts.
- Redis 8 Vector Sets - leveraging the latest vector search technology.
- FastAPI - a modern asynchronous API.
- Machine Learning - classification based on k-NN with vector embeddings.
- Interactive Web UI - a dashboard for moderators with real-time post classification, manual checking, statistics, and training logs.
- Dynamic Data Enrichment - automatic loading of additional data, such as the author's follower count, for more accurate classification.
- Feedback Loop - a system for improving the model based on moderator feedback.
🏗️ Architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Dev.to API │ -> │ FastAPI Server │ -> │ Redis 8 │
│ (Posts, Users) │ │ (API, Web UI) │ │ (Vector Sets) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
Components:
- Data Collector - collects posts and user data from…
The complete project code is available on GitHub. The system works in real-time and is ready for demonstration.
How I Used Redis 8
Vector Sets — More Than Just Search
Redis Vector Sets turned out to be not just a database, but a full-fledged semantic search engine. Creating the index:
schema = (
VectorField("vector", "HNSW", {"TYPE": "FLOAT32", "DIM": self.vector_dim, "DISTANCE_METRIC": "COSINE"}),
TagField("label"),
TextField("title"),
TagField("url")
)
HNSW (Hierarchical Navigable Small World) — this isn't just an algorithm, it's an entire philosophy of search. It builds a multi-level graph of connections between vectors, allowing you to find nearest neighbors in logarithmic time.
The Heart of Vector Search: How Redis Finds Similar Posts
The real magic happens in the vector search itself. The search for vectors occurs in the core.py
file, inside the find_similar_posts
method of the RedisVectorClassifier
class.
Here's the specific code section that sends the query to Redis:
async def find_similar_posts(self, query_vector: np.ndarray, k: int = 5) -> List[Dict[str, Any]]:
"""Finds similar posts."""
if not self.redis_client:
return []
try:
# 1. Query formation
query = (
f"*=>[KNN {k} @vector $blob AS score]"
)
# 2. Executing the search command
results = await self.redis_client.execute_command(
"FT.SEARCH", self.index_name, query,
"PARAMS", "2", "blob", query_vector.tobytes(),
"DIALECT", "2",
"RETURN", "3", "score", "title", "url"
)
# ... (result processing follows)
How this works:
-
query = f"*=>[KNN {k} @vector $blob AS score]"
: This line forms the search query itself.-
*
— search across all documents in the index. -
=>
— apply the following operation to the results. -
[KNN {k} @vector $blob AS score]
— this is the vector search command (KNN - k-Nearest Neighbors). It tells Redis: "Find k nearest neighbors in the vector field, using the vector we'll pass in the $blob parameter, and name the similarity field 'score'".
-
-
await self.redis_client.execute_command(...)
: This line sends the FT.SEARCH command to Redis, passing it:- The index name (
self.index_name
). - The formed query (
query
). - The actual vector for search (
query_vector.tobytes()
) as the $blob parameter.
- The index name (
This command is what makes Redis use its powerful algorithm (HNSW) for ultra-fast search of the most similar vectors in the database.
k-NN — Democracy of Vectors
Classification of new posts happens through "voting" of nearest neighbors:
# Find 9 nearest neighbors
similar_posts = await self.find_similar_posts(query_vector, k=9)
# Collect their labels
labels = [post['label'] for post in similar_posts]
# Majority wins
predicted_label = Counter(labels).most_common(1)[0][0]
confidence = vote_count / total_neighbors
Magic in details: Why exactly k=9? After experiments, I found that an odd number of neighbors avoids ties, and 9 provides sufficient sampling without losing accuracy.
Redis as the Architecture Foundation
The system is built entirely around Redis Vector Sets capabilities. Without Redis, there is no system — it's not just a database, but the core of all classification logic.
Why Look Under the Hood?
Anatomy of "Training": What Actually Happens?
First Discovery: We Don't Train the Model
When people say "model training," they usually imagine a neural network learning from data. In my case, everything turned out differently. I don't train a model — I use the ready-made all-MiniLM-L6-v2
as a "translator" from human language to the language of mathematics.
# This isn't model training, it's creating a knowledge base
model = SentenceTransformer('all-MiniLM-L6-v2')
text_vector = model.encode("How to earn $1000 in a week!")
# We get an array of 384 numbers reflecting the text's meaning
The real "training" happens at the level of creating a database of vectors with correct labels.
But there's one "but": Redis still requires an external model for vectorization. It would be revolutionary if Redis could accept raw text and create vectors independently. Imagine:
# Dream: pass text, get ready search
await redis_client.hset("post:123", mapping={
"text": "How to earn $1000 in a week!", # Just text!
"label": "spam"
})
# Redis creates the vector under the hood
Such integration would eliminate the need to manage separate embedding models and make Redis a truly self-sufficient AI platform.
Second Discovery: Automatic Labeling is an Art of Heuristics
The biggest problem of any ML project is obtaining labeled data. I didn't have thousands of posts with "spam/not spam" labels, so I had to create SpamLabelGenerator
— a rule system for automatic labeling:
def calculate_spam_score(self, article: Dict) -> float:
score = 0.0
# Spam words in title (high weight)
for keyword in self.spam_keywords:
if keyword in title:
score += 0.3
# Short posts with low engagement
if reading_time < 2 and reactions < 5:
score += 0.3
# Too many tags
if len(tags) > 10:
score += 0.2
return min(max(score, 0.0), 1.0)
Challenge: How to find balance between precision and recall? Too strict rules — we miss spam, too lenient — we block legitimate content.
Solution: Added an element of randomness and multiple criteria. The system analyzes not only text, but also engagement metrics, author profile, publication time.
Third Discovery: Vectors Are Not Just Numbers
Initially, I thought a vector was just a "fingerprint" of text. In practice, it turned out you can create hybrid vectors by combining semantic embeddings with numerical features:
# Text vector (384 dimensions)
text_vector = self.model.encode(combined_text)
# Numerical features (3 dimensions)
numeric_features = np.array([
reading_time,
user_followers,
len(tags)
])
# Combine into a single vector (387 dimensions)
final_vector = np.concatenate([text_vector, numeric_features])
Insight: Vector space can be "enriched" with any numerical data. Posts become similar not only in meaning but also in structure.
Redis as a "Card Index of Meanings"
Challenges and Solutions
Problem 1: Quality of Automatic Labeling
Challenge: How to verify that automatic labels are correct?
Solution: Added detailed logging and quality metrics:
metrics = {
'accuracy': (tp + tn) / total,
'precision': tp / (tp + fp),
'recall': tp / (tp + fn),
'f1_score': 2 * (precision * recall) / (precision + recall)
}
Problem 2: Data Imbalance — Too Little Spam
Challenge: Data from dev.to API contained very little real spam. The model couldn't learn effectively — it lacked examples of "bad" behavior.
Solution: Created a synthetic spam dataset and mixed it with real data:
# Load synthetic spam posts
try:
with open('spam_dataset.json', 'r', encoding='utf-8') as f:
spam_articles = json.load(f)
# Mark as known spam
for article in spam_articles:
article['is_known_spam'] = True
articles.extend(spam_articles)
except FileNotFoundError:
logger.warning("spam_dataset.json not found")
Synthetic dataset composition (50 examples):
- Quick money schemes and crypto scams
- Phishing links and fake security notifications
- SEO services sales and follower boosting
- Dubious courses and "miracle products"
System Evolution: From Static to Self-Learning
The first version of the system had a critical flaw — it couldn't learn from its mistakes. Every moderator correction was lost after page reload.
Feedback Loop in Action
The solution turned out to be elegant: with each moderator click, the complete post content is saved in Redis with an expert verdict:
# Check for moderator verdict from Redis
feedback_key = f"feedback:{post.id}"
feedback_data = await redis_classifier.redis_client.get(feedback_key)
if feedback_data:
feedback_json = json.loads(feedback_data)
moderator_verdict = "spam" if feedback_json.get("is_spam") else "legit"
During the next training, this data is loaded with the highest priority, turning every mistake into an opportunity for improvement.
Dramatic Results
Before feedback loop:
- Accuracy: 78-85%
- Precision: ~85%
After implementation (just 9 expert corrections):
- Accuracy: 96.4%
- Precision: 100% — complete absence of false positives
Insight: 9 expert moderator decisions proved more valuable than 1000 automatically labeled posts. Data quality beats quantity.
Training Control Panel
The interface includes a "Training Control Panel" that displays current model information:
Last known accuracy
Number of trained examples (vectors in the database)
The actual retraining launches only after clicking the "Start New Training" button, and the process can be observed in real-time through automatically updated logs.
What I Learned About the "Black Box"
1. Vectors Have Geometry
Texts similar in meaning really are located close together in vector space. This isn't a metaphor — it's mathematical reality that can be measured through cosine distance or Euclidean norm.
2. Data Quality Matters More Than Algorithm
The most advanced algorithm won't help if the data is poorly labeled. 80% of project success is quality preparation of the training sample.
3. Explainability Can Be Built In
Every system decision is accompanied by detailed explanation:
reasoning = [
"Similar to known spam posts (via Redis)",
"Contains spam keywords: 'earn money'",
"Low follower count (5)",
"Short reading time (1 minute)"
]
4. Redis as Architecture Foundation
The system is built entirely around Redis Vector Sets capabilities. Without Redis, there is no system — it's not just a database, but the core of all classification logic.
Conclusions: Why Open the "Black Box"?
This project taught me that Redis Vector Sets isn't just a new feature, but a revolution in the approach to semantic search:
- Speed: Vector search works in milliseconds even on thousands of records
- Scalability: System is ready for growth without architectural changes
- Transparency: Every decision can be traced and explained
Redis Vector Sets transformed the complex task of semantic search into an elegant and understandable solution.
Looking to the future: The next logical step would be to embed vectorization directly into Redis. Instead of the chain "text → external model → vector → Redis" we would get simply "text → Redis". This would make Redis the only component needed for a full-fledged AI system — from raw data to ready search results.
Top comments (0)