DEV Community: Visesh Agarwal

Reddit Data Analysis: Insights from Machine Learning Models

Visesh Agarwal — Wed, 21 Aug 2024 05:02:17 +0000

Introduction

In the age of social media, Reddit stands out as a unique platform where users engage in discussions across a wide range of topics. This article presents an in-depth analysis of Reddit comments from various subreddits related to data science, programming, and technology. We'll explore the sentiment, emotions, and content of these comments using several machine learning techniques, including sentiment analysis, topic modeling, and text classification.

Data Collection and Preprocessing

Our analysis begins with data collection from eight subreddits: Python, DataScience, MachineLearning, DataAnalysis, DataMining, Data, DataSets, and DataCenter. We used the PRAW (Python Reddit API Wrapper) library to scrape comments from these subreddits.

Here's a snippet of the code used for data collection:

async def get_comments(subreddit_name, num_comments=2000):
    subreddit = await reddit.subreddit(subreddit_name)
    comments = []
    async for comment in subreddit.comments(limit=num_comments):
        comments.append({
            "subreddit": subreddit_name,
            "comment_body": comment.body,
            "upvotes": comment.score,
        })
    return comments

After collecting the data, we performed several preprocessing steps to clean and prepare the text for analysis:

Removing missing values and duplicates
Filtering out comments with less than three words
Tokenizing the text
Removing special characters and words with digits
Converting to lowercase
Removing stopwords
Lemmatizing the words

Here's a snippet of the preprocessing function:

def clean_text(text):
    text = word_tokenize(text)
    text = [re.sub(r"[^a-zA-Z0-9]+", ' ', word) for word in text]
    text = [word for word in text if not any(c.isdigit() for c in word)]
    text = [word.lower() for word in text]
    text = [word for word in text if word not in stopwords.words('english')]
    lemmatizer = WordNetLemmatizer()
    text = [lemmatizer.lemmatize(word) for word in text]
    text = ' '.join(text)
    text = re.sub(r'[^\w\s]', '', text)
    words = ['http','com','www','reddit','comment','comments','http','https','org','jpg','png','gif','jpeg']
    text = ' '.join(word for word in text.split() if word not in words)
    return text

Exploratory Data Analysis

Comment Distribution Across Subreddits

We first examined the distribution of comments across the different subreddits:

print(df['subreddit'].value_counts())

The results showed:

DataCenter        934
Python            859
DataMining        847
Data              803
DataScience       763
DataSets          763
MachineLearning   733
DataAnalysis      645

This distribution gives us insight into the relative activity levels of these subreddits during our data collection period.

Word Cloud Visualization

To get a quick overview of the most frequent words in our dataset, we created a word cloud:

text = " ".join(comment for comment in df.cleaned_comment)
wordcloud = WordCloud(width=800, height=400, background_color ='white').generate(text)
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

The word cloud highlights the most frequent terms across all comments, giving us a visual representation of the dominant topics and terms in our dataset.

Sentiment Analysis

We performed sentiment analysis using the TextBlob library to understand the overall sentiment of the comments:

df['polarity'] = df['cleaned_comment'].apply(lambda x: TextBlob(x).sentiment.polarity)
df['sentiment'] = df['polarity'].apply(lambda x: 'positive' if x > 0 else 'negative' if x < 0 else 'neutral')

We then visualized the sentiment distribution:

plt.figure(figsize=(10, 6))
sns.histplot(df['polarity'], kde=True)
plt.title('Sentiment Distribution')
plt.xlabel('Polarity')
plt.ylabel('Count')
plt.show()

plt.figure(figsize=(10, 6))
df['sentiment'].value_counts().plot.pie(autopct='%1.1f%%')
plt.title('Sentiment Distribution')
plt.ylabel('')
plt.show()

The sentiment analysis revealed that the majority of comments had a neutral to slightly positive sentiment. This suggests that discussions in these tech-related subreddits tend to be more informative and objective rather than highly emotional.

Topic Modeling

To uncover the main topics discussed across these subreddits, we employed Latent Dirichlet Allocation (LDA) for topic modeling:

lda_model = gensim.models.ldamodel.LdaModel(
    corpus, num_topics=5, id2word=dictionary, passes=15
)

topics = lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)

The LDA model identified five main topics:

General discussion and etiquette (keywords: post, please, r, message, thank)
Data center infrastructure (keywords: cooling, power, rack, ups, system)
Data analysis and tools (keywords: data, n, like, would, get)
Data science applications (keywords: data, n, use, would, need)
Data center operations (keywords: data, power, center, get, like)

These topics provide insight into the main areas of discussion across the analyzed subreddits, ranging from technical discussions about data center operations to more general data science and analysis topics.

Emotion Analysis

To gain a deeper understanding of the emotional content of the comments, we performed emotion analysis using the NRCLex library:

df["emotions"] = df["cleaned_comment"].apply(analyze_emotions)
emotion_df = df["emotions"].apply(pd.Series).fillna(0)
df = pd.concat([df, emotion_df], axis=1)

emotion_totals = emotion_df.sum().sort_values(ascending=False)
plt.figure(figsize=(12, 8))
sns.barplot(x=emotion_totals.index, y=emotion_totals.values, palette="viridis")
plt.title("Total Emotion Counts in Reddit Comments")
plt.xlabel("Emotion")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()

The emotion analysis revealed that the most prevalent emotions in the comments were:

Trust
Anticipation
Joy
Fear
Sadness

This distribution suggests that while the overall sentiment tends to be neutral or slightly positive, there's a complex emotional landscape in these tech-related discussions. The high levels of trust and anticipation might indicate a generally optimistic and collaborative atmosphere in these communities.

Named Entity Recognition

To identify key entities mentioned in the comments, we performed Named Entity Recognition (NER) using the spaCy library:

def extract_entities(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

df["entities"] = df["cleaned_comment"].apply(extract_entities)

entities_df = pd.DataFrame(all_entities, columns=["Entity", "Label"])
label_counts = entities_df["Label"].value_counts()

plt.figure(figsize=(12, 6))
label_counts.plot(kind="bar", color="skyblue")
plt.title("Distribution of Entity Labels")
plt.xlabel("Entity Label")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()

The NER analysis highlighted the most common types of entities mentioned in the comments, which included:

Organizations (ORG)
People (PERSON)
Products (PRODUCT)
Locations (GPE)

This distribution gives us insight into the types of entities that are frequently discussed in these tech-related subreddits, with a focus on organizations and people involved in the field.

Text Classification Models

To predict the subreddit of a given comment, we implemented and compared several machine learning models:

Support Vector Machine (SVM)
Logistic Regression
Random Forest
K-Nearest Neighbors (KNN)
Long Short-Term Memory (LSTM) neural network

Here's a summary of the performance metrics for each model:

Model                 Accuracy  Precision  Recall    F1 Score  ROC AUC
SVM                   0.523622  0.533483   0.523622  0.525747  0.853577
Logistic Regression   0.541732  0.545137   0.541732  0.536176  0.857676
Random Forest         0.485039  0.488424   0.485039  0.477831  0.819124
KNN                   0.230709  0.326545   0.230709  0.151028  0.720319
LSTM                  0.483302  0.483103   0.483302  0.478265  NaN

We visualized the performance of these models:

plt.figure(figsize=(12, 6))
plt.plot(models, accuracies, marker="o", label="Accuracy")
plt.plot(models, precisions, marker=".", label="Precision")
plt.plot(models, recalls, marker=".", label="Recall")
plt.plot(models, f1_scores, marker=".", label="F1 Score")
plt.plot(models, roc_auc_scores, marker=".", label="ROC AUC")
plt.title("Model Comparison")
plt.xlabel("Model")
plt.ylabel("Score")
plt.legend()
plt.xticks(rotation=45)
plt.show()

Model Performance Analysis

Logistic Regression performed the best overall, with the highest accuracy (54.17%), precision (54.51%), recall (54.17%), and F1 score (53.62%). It also had the highest ROC AUC score (0.8577), indicating good discrimination ability.
SVM was a close second, with performance metrics very similar to Logistic Regression. This suggests that both linear models (Logistic Regression and SVM) are well-suited for this text classification task.
Random Forest performed slightly worse than the linear models but still achieved reasonable results. Its lower performance might indicate that the decision tree-based approach is less effective for capturing the nuances in the text data compared to linear models.
The LSTM model showed comparable performance to Random Forest in terms of accuracy, precision, recall, and F1 score. However, we couldn't calculate its ROC AUC score due to limitations in the implementation.
KNN performed significantly worse than the other models across all metrics. This poor performance suggests that the nearest neighbor approach might not be suitable for high-dimensional text data.

The relatively close performance of different models (except KNN) suggests that the task of predicting subreddits based on comment content is challenging. This could be due to overlapping topics across different subreddits or the presence of general discussion that isn't specific to any particular subreddit.

Conclusion

Our analysis of Reddit comments from tech-related subreddits has provided valuable insights into the nature of discussions in these online communities:

Sentiment and Emotions: The overall sentiment tends to be neutral to slightly positive, with trust and anticipation being the dominant emotions. This suggests a generally constructive and forward-looking atmosphere in these tech-focused discussions.
Topics: The main topics identified through LDA include general discussion etiquette, data center infrastructure, data analysis tools, data science applications, and data center operations. This diverse range of topics reflects the broad scope of discussions in these tech-related subreddits.
Entities: Organizations and people are the most frequently mentioned entities, highlighting the importance of industry players and thought leaders in these discussions.
Text Classification: While our models achieved moderate success in predicting subreddits based on comment content, the task proved challenging. Logistic Regression and SVM performed best, suggesting that linear models are well-suited for this type of text classification task.

These findings provide valuable insights for community managers, data scientists, and researchers interested in understanding the dynamics of tech-related discussions on Reddit. Future work could explore more advanced natural language processing techniques, such as transformers-based models like BERT, to potentially improve classification performance and extract even more nuanced insights from the text data.

GitHub Repo with all the code and detailed analysis
Github Repo Link

"Mind Meets Machine: The Evolution of AI through Cognitive Psychology"

Visesh Agarwal — Sat, 04 May 2024 19:37:40 +0000

Introduction

Integrating cognitive psychology principles into developing AI and machine learning algorithms marks a significant stride toward creating more adaptable, transparent, and efficient systems. As AI permeates various aspects of society, understanding the interplay between human cognition and artificial intelligence becomes increasingly crucial. In this context, exploring future directions and emerging topics at the intersection of cognitive psychology and computer science offers exciting prospects for advancing both fields and addressing contemporary challenges.

Future Directions in Cognitive Psychology and Computer Science:

Cognitive Neuroscience and AI:

Deepening our understanding of the neural underpinnings of cognitive processes holds immense potential for enhancing AI algorithms. Future research might focus on integrating insights from cognitive neuroscience, such as brain-computer interfaces and neuroimaging techniques, to inform the development of more biologically inspired AI models. This interdisciplinary approach could lead to innovations in areas like brain-inspired computing and neuromorphic engineering, enabling AI systems to mimic human cognitive functions more closely.

Human-AI Interaction:

As AI systems become more prevalent daily, understanding how humans perceive, interact with, and trust these systems is paramount. Future research could explore cognitive psychology aspects such as user experience, trust, and cognitive biases in human-AI interaction. This exploration could inform the design of AI interfaces that are intuitive, trustworthy, and conducive to effective collaboration between humans and machines.

Cognitive Robotics:

Integrating cognitive psychology principles with robotics can lead to the development of robots capable of understanding human intentions, emotions, and social cues. Future research might focus on incorporating theories of social cognition and theory of mind into robotic systems, enabling them to interact with humans in more nuanced and socially appropriate ways. This interdisciplinary endeavour could pave the way for the widespread adoption of cognitive robots in various domains, including healthcare, education, and entertainment.

Emerging Topics in Cognitive Psychology and Computer Science:

Explainable and Ethical AI:

With the increasing complexity of AI models, there is a growing need for algorithms that are accurate, explainable, and ethically sound. Future research might explore how cognitive psychology insights can inform the development of explainable AI (XAI) systems that provide transparent explanations for their decisions. Additionally, integrating ethical principles derived from cognitive psychology, such as moral reasoning and fairness perceptions, could lead to the creation of AI algorithms that align with societal values and norms.

Cognitive Computing:

Cognitive computing, combining AI techniques with human-like cognitive abilities such as natural language understanding and reasoning, represents a promising frontier in cognitive psychology and computer science. Future research might focus on advancing cognitive computing technologies, such as cognitive assistants and virtual agents, to enhance their ability to understand and respond to human needs and preferences. This interdisciplinary endeavour could lead to breakthroughs like personalized healthcare, virtual education, and customer service.

Adaptive Learning Systems:

Leveraging cognitive psychology principles such as spaced repetition, retrieval practice, and metacognition, future research could develop adaptive learning systems that optimize individual learning experiences. These systems could revolutionize education and training across diverse domains by incorporating AI algorithms that adaptively tailor educational content and strategies to learners' cognitive strengths and weaknesses. This interdisciplinary approach holds promise for addressing challenges such as individual differences in learning styles and preferences and improving long-term knowledge retention and transfer.

Conclusion:

Integrating cognitive psychology insights into developing AI and machine learning algorithms opens up new avenues for innovation and collaboration between cognitive psychology and computer science. Future directions and emerging topics at the intersection of these disciplines hold promise for advancing both fields and addressing pressing societal challenges. Researchers can create more adaptable, transparent, and human-centred AI systems that profoundly enhance our lives by leveraging the synergies between human cognition and artificial intelligence.