The Problem
I wanted to monitor discussions around products, bugs and trends across communities.
Examples:
Reddit
Hacker News
GitHub Issues
Forums
Most solutions rely on:
OpenAI embeddings
Pinecone
Weaviate
Expensive vector search
For a side project, that cost didn't make sense.
My Goal
I wanted something that could:
Run on a cheap VPS
Process thousands of posts
Group similar discussions
Generate summaries
Without paying for embeddings.
The Approach
Instead of embeddings, I used:
Post
↓
Tokenize
↓
TF-IDF Vector
↓
Cosine Similarity
↓
Cluster
The implementation ended up surprisingly simple.
The Hardest Part Wasn't Coding
The hardest problem was choosing a similarity threshold.
I tested:
0.15
0.20
0.25
0.30
0.35
Results:
0.35
too strict
almost every thread became its own cluster
0.15
too loose
unrelated discussions merged together
0.25
good enough
This single number took longer to tune than the rest of the project.
Architecture
Crawler
↓
BullMQ
↓
TF-IDF Engine
↓
Cluster Engine
↓
PostgreSQL
↓
Groq Summary Generator
Stack:
Node.js
PostgreSQL
Redis
BullMQ
Groq
Vanilla JS
Why I Chose Groq
I didn't need GPT-4 quality.
I needed:
fast summaries
low cost
easy API
Groq was sufficient.
What Surprised Me
For discussion monitoring, TF-IDF works better than people expect.
Most conversations contain repeated terms:
rate limits
api quota
billing
token cost
Those keywords alone are often enough to group discussions accurately.
Current Results
Example topic:
OpenAI Rate Limits
Generated clusters:
API throttling complaints
Increase request discussions
Azure OpenAI alternatives
Workarounds and solutions
The summaries are generated automatically.
Future Plans
I'm considering adding:
Hacker News
GitHub Issues
RSS feeds
Telegram alerts
Trend detection
Feature request extraction
Repository
GitHub:
https://github.com/melyx-id/discussion-radar
I'd love feedback from people building monitoring or intelligence tools.
Top comments (0)