DEV Community

Mervin
Mervin

Posted on

How I Built Semantic Discussion Clustering Without Embeddings (and Why It Was Good Enough)

The Problem

I wanted to monitor discussions around products, bugs and trends across communities.

Examples:

Reddit
Hacker News
GitHub Issues
Forums

Most solutions rely on:

OpenAI embeddings
Pinecone
Weaviate
Expensive vector search

For a side project, that cost didn't make sense.

My Goal

I wanted something that could:

Run on a cheap VPS
Process thousands of posts
Group similar discussions
Generate summaries

Without paying for embeddings.

The Approach

Instead of embeddings, I used:

Post

Tokenize

TF-IDF Vector

Cosine Similarity

Cluster

The implementation ended up surprisingly simple.

The Hardest Part Wasn't Coding

The hardest problem was choosing a similarity threshold.

I tested:

0.15
0.20
0.25
0.30
0.35

Results:

0.35

too strict
almost every thread became its own cluster

0.15

too loose
unrelated discussions merged together

0.25

good enough

This single number took longer to tune than the rest of the project.

Architecture
Crawler

BullMQ

TF-IDF Engine

Cluster Engine

PostgreSQL

Groq Summary Generator

Stack:

Node.js
PostgreSQL
Redis
BullMQ
Groq
Vanilla JS
Why I Chose Groq

I didn't need GPT-4 quality.

I needed:

fast summaries
low cost
easy API

Groq was sufficient.

What Surprised Me

For discussion monitoring, TF-IDF works better than people expect.

Most conversations contain repeated terms:

rate limits
api quota
billing
token cost

Those keywords alone are often enough to group discussions accurately.

Current Results

Example topic:

OpenAI Rate Limits

Generated clusters:

API throttling complaints
Increase request discussions
Azure OpenAI alternatives
Workarounds and solutions

The summaries are generated automatically.

Future Plans

I'm considering adding:

Hacker News
GitHub Issues
RSS feeds
Telegram alerts
Trend detection
Feature request extraction
Repository

GitHub:

https://github.com/melyx-id/discussion-radar

I'd love feedback from people building monitoring or intelligence tools.

Top comments (0)