elizabeththomas7

Posted on May 8

What Reddit Can Teach Us About Women’s Watch Preferences (Python + NLP Project)

#python #nlp #sentimentanalysis #datascience

Most “what watch should I buy?” discussions online skew heavily male. A friend wanted to launch a women’s watch, so I helped with a small data analysis.

In this post I’ll walk through a small but complete Python pipeline I built:

Scrape relevant posts and comments from Reddit with no API keys
Filter out irrelevant posts (e.g. men asking for themselves)
Run NLP analysis: sentiment, brands, features, prices, keywords, clustering, topic modeling
Generate visualizations and CSVs you can explore further

Everything here is powered by standard Python libraries: requests, pandas, nltk, scikit‑learn, and wordcloud.

1. Collecting Reddit data without API keys

We didn’t use the official Reddit API; instead we hit the public JSON endpoints directly using requests.

At the top of reddit_json_scraper.py we define search URLs across multiple subreddits:

Each URL returns a JSON blob; we wrap that in a helper:

Reddit’s listing JSON has a fairly nested structure, so we created extract_post_data to normalize it into a flat dictionary with the fields we actually care about (ID, subreddit, title, body, score, comment count, timestamps, etc.):

def extract_post_data(post_json):
    """
    Extract relevant information from a Reddit post JSON
    """
    try:
        data = post_json['data']

        return {
            'post_id': data.get('id', ''),
            'subreddit': data.get('subreddit', ''),
            'title': data.get('title', ''),
            'text': data.get('selftext', ''),
            'author': data.get('author', ''),
            'score': data.get('score', 0),
            'upvote_ratio': data.get('upvote_ratio', 0),
            'num_comments': data.get('num_comments', 0),
            'created_utc': data.get('created_utc', 0),
            'created_date': datetime.fromtimestamp(data.get('created_utc', 0)).strftime('%Y-%m-%d %H:%M:%S'),
            'url': f"https://reddit.com{data.get('permalink', '')}",
            'post_url': data.get('url', ''),
            'is_video': data.get('is_video', False),
            'over_18': data.get('over_18', False)
        }

The main collection loop simply iterates through all search URLs, fetches JSON, and appends normalized posts into a list:

We also fetch comments for the most “interesting” posts, sorted by engagement (score + num_comments), by hitting each post’s .json endpoint and walking the comment tree.

At the end of main() we save everything to CSV and run a quick text summary (brand and keyword counts, simple price stats).

2. Filtering: keeping posts that are really about women’s watches

Search results are noisy. Some posts mention “women” but are actually men asking for themselves.
filter_posts.py applies a simple but effective regex filter.We flag posts that contain phrases like “as a man” or “for men”:

…but we keep posts that clearly talk about buying for a woman, e.g. “gift for my wife”:

NON_FILTER_PATTERNS=r"(for|gift|buying|getting|choosing|help).*(mum|mom|mother|wife|girlfriend|partner|daughter|sister|woman|female|her|she)"

filter_check combines title and text, applies these patterns, and filtered_posts_csv writes a cleaned filtered_posts.csv. This becomes the starting point for our analysis.

3. Analyzing the conversations with WatchDataAnalyzer

The main analysis lives in watch_analyzer.py as a single class:

Load the filtered posts and comments
Combine titles, bodies, and comment text into all_text
Set up NLTK and VADER sentiment

3.1. Light text cleaning

We remove URLs and normalize whitespace, then build a combined_text column per post:

3.2. Sentiment on posts and comments

Using VADER, we compute a compound score and label each post/comment as positive, neutral, or negative:

self.posts_df['sentiment_scores']=self.posts_df['combined_text'].apply(lambda x: self.sia.polarity_scores(x))
self.posts_df['sentiment_compound']=self.posts_df['sentiment_scores'].apply(lambda x: x['compound'])
self.posts_df['sentiment_label']=self.posts_df['sentiment_compound'].apply(
    lambda x: 'positive' if x>0.05 else ('negative' if x<-0.05 else 'neutral')
)

We do the same for comments and then plot the distribution, saving sentiment_dist.png.

3.3. Brands, price ranges, and features

We look at three practical angles:

Brand mentions — a curated list from Titan and Seiko to Rolex and Omega, counted across all text.

def extract_brands(self):
    # Common watch brands
    brands = [
        'casio', 'seiko', 'citizen', 'timex', 'fossil', 'orient', 'tissot',
        'michael kors', 'daniel wellington', 'mvmt', 'skagen', 'swatch',
        'rolex', 'omega', 'cartier', 'tag heuer', 'breitling', 'patek philippe',
        'audemars piguet', 'vacheron constantin', 'baume mercier', 'longines',
        'hamilton', 'bulova', 'invicta', 'bering', 'titan', 'fastrack',
        'sonata', 'maxima', 'hmt', 'raymond weil', 'zenith', 'iwc'
    ]
    brand_mentions={}
    # ...

Price — regexes to capture Indian price patterns with ₹/rs/inr or “rupees”, then bucketed into budget/mid‑range/premium/luxury ranges.

def extract_prices(self):
    # Patterns for price extraction
    patterns = [
        r'(?:₹|rs\.?|inr)\s*(\d+(?:,\d{3})*(?:\.\d+)?)',
        r'(\d+(?:,\d{3})*(?:\.\d+)?)\s*(?:₹|rs\.?|inr)',
        r'(\d+(?:,\d{3})*(?:\.\d+)?)\s*(?:rupees|rupee)',
    ]

    all_prices=[]

    for text in self.all_text:
        for pattern in patterns:
            matches=re.findall(pattern,str(text),re.IGNORECASE)
            for match in matches:
                price_str=match.replace(',','').replace('.','')
                all_prices.append(int(price_str))

    ranges = {
            'Budget (<₹5,000)': sum(1 for p in all_prices if p < 5000),
            'Mid-range (₹5,000-₹20,000)': sum(1 for p in all_prices if 5000 <= p < 20000),
            'Premium (₹20,000-₹1,00,000)': sum(1 for p in all_prices if 20000 <= p < 100000),
            'Luxury (>₹1,00,000)': sum(1 for p in all_prices if p >= 100000)
        }

Features — categories like size, material, movement, style, strap, and “features” (water resistance, sapphire, chronograph, etc.), each with their own keyword list.

This gives a quick picture of which brands dominate, what price bands people discuss, and which attributes come up most.

3.4. Keywords, clusters, and topics

Using scikit‑learn:

TF‑IDF keywords — we build a TfidfVectorizer over combined_text
and save the top terms to keywords_tfidf.csv.

def extract_keywords(self):
    self.preprocess_all_text()

    vectorizer=TfidfVectorizer(
        max_features=80,
        stop_words='english',
        min_df=2
    )

    texts=self.posts_df['combined_text'].fillna('').tolist()
    X=vectorizer.fit_transform(texts)

    feature_names=vectorizer.get_feature_names_out()

    scores = X.mean(axis=0).A1

    # Create keyword dataframe
    keywords_df = pd.DataFrame({
        'keyword': feature_names,
        'tfidf_score': scores
    }).sort_values('tfidf_score', ascending=False)

Clustering — we cluster posts into 5 groups using K‑Means over TF‑IDF vectors, then inspect top words per cluster.

def cluster_posts(self, n_clusters=5):
    """
    Cluster posts based on text similarity
    """
    # ...
    vectorizer = TfidfVectorizer(
        max_features=50,
        stop_words='english',
        min_df=2
    )

    texts = self.posts_df['combined_text'].fillna('').tolist()
    X = vectorizer.fit_transform(texts)

    # K-means clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    clusters = kmeans.fit_predict(X)

    self.posts_df['cluster'] = clusters

Topic modeling — we run LDA/NMF over the same vectors to discover high‑level themes (“budget gifts”, “small wrists and office wear”, “sporty/outdoor”, etc.).

def topic_modeling(self, n_topics=5, method='lda'):
    """
    Perform topic modeling using LDA or NMF
    """
    # ...
    vectorizer = TfidfVectorizer(
        max_features=100,
        stop_words='english',
        ngram_range=(1, 2),
        min_df=2,
        max_df=0.95
    )
    # ...
    if method.lower() == 'lda':
        model = LatentDirichletAllocation(n_components=n_topics, random_state=42)
    else:  # NMF
        model = NMF(n_components=n_topics, random_state=42)
    # ...
    for idx, topic in enumerate(model.components_):
        top_words = [feature_names[i] for i in topic.argsort()[-10:][::-1]]
        print(f"\n   Topic {idx + 1}: {', '.join(top_words)}")

4. Putting it all together

The generate_report() method runs the full pipeline:

Preprocess text
Run sentiment, brand/feature/price extraction
Compute keywords, clusters, and topics
Generate a word cloud and sentiment plot
Save everything to CSVs you can open in Excel or a notebook

def generate_report(self):

    # 1. Pre- processing the text 
    self.preprocess_all_text()

    # 2. Analyze sentiment
    sentiment_df=self.analyze_sentiment()

    # 3. Brands mentions
    brands=self.extract_brands()

    # 4. Features
    features = self.extract_features()

    # 5. Prices
    prices = self.extract_prices()

    # 6. Keywords
    keywords_df = self.extract_keywords()

    # 7. Clustering
    clusters = self.cluster_posts(n_clusters=5)

    # 8. Topic Modeling
    print("\n🔍 Running topic modeling (this may take a moment)...")
    topic_model, vectorizer = self.topic_modeling(n_topics=5, method='lda')

    # 9. Visualizations
    print("\n🎨 Creating visualizations...")
    self.create_wordcloud('wordcloud.png')
    self.plot_sentiment_distribution('sentiment_dist.png')

    # Save results ...

It’s a compact example of how to go from raw Reddit JSON to structured insights about a very specific question: what are people really saying when they talk about women’s watches ?

Top comments (2)

PracHub • May 11

You mentioned collecting Reddit data without API keys using requests. How reliable is this method for larger datasets or when dealing with rate limits? API keys usually offer more control over these issues. When analyzing sentiment with VADER, did you run into any challenges with sarcasm or mixed sentiments that might skew results? I've found PracHub useful for prepping data-related technical screens. Their repositories often include Python challenges similar to those in interviews.

elizabeththomas7 • May 12

thanks for reading! 🙌

On scraping without API keys: I used Reddit’s JSON endpoints with requests mainly for quick experimentation. It works fine for small datasets

On VADER sentiment: Sarcasm and mixed opinions were definitely limitations. I treated the sentiment as directional insight rather than absolute truth. If I extend the project, I’d look into aspect-based sentiment or fine-tuned transformer models for better accuracy.