Now we are going to split the data to train and test the data to check the accuracy.
df = pd.read_csv('labeled_reddit_comments.csv')
This line reads the previously saved CSV file (labeled_reddit_comments.csv) containing cleaned Reddit comments and their corresponding labels into a Pandas DataFrame (df).
Splitting Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(df['cleaned_comment'], df['label'], test_size=0.2, random_state=42)
Here, we split the data into two parts:
X_train and y_train: These variables contain 80% of the data (df['cleaned_comment'] and df['label']) which will be used for training the model.
X_test and y_test: These variables contain the remaining 20% of the data, which will be used to evaluate how well the trained model performs on new, unseen data.
Creating a Pipeline with a Random Forest Classifier
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', RandomForestClassifier(random_state=42))
])
This sets up a pipeline (pipeline) that sequentially applies two steps to the data:
Step 1 ('tfidf', TfidfVectorizer()): Converts the text data (X_train and X_test) into numerical TF-IDF (Term Frequency-Inverse Document Frequency) vectors.
Step 2 ('clf', RandomForestClassifier(random_state=42)): Trains a Random Forest classifier on the TF-IDF vectors. The random_state=42 ensures reproducibility of results.
Defining Hyperparameters for Tuning
param_grid = {
'tfidf__max_features': [10000, 20000, None],
'clf__n_estimators': [50, 100],
'clf__max_depth': [None, 10],
'clf__min_samples_split': [2, 5],
'clf__min_samples_leaf': [1, 2]
}
This dictionary (param_grid) specifies different hyperparameter values to explore during the grid search process:
'tfidf_max_features': Limits the number of features generated by TfidfVectorizer.
'clfn_estimators', 'clfmax_depth', 'clfmin_samples_split', 'clf_min_samples_leaf': Parameters that control the behavior of the Random Forest classifier.
Performing GridSearchCV for Hyperparameter Tuning
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', verbose=1, error_score='raise')
grid_search.fit(X_train, y_train)
Here, GridSearchCV is used to search for the best combination of hyperparameters (param_grid) for the pipeline (pipeline). It:
Divides the data into 5 folds (cv=5) for cross-validation.
Uses accuracy (scoring='accuracy') as the metric to evaluate the performance of each combination of hyperparameters.
Prints detailed messages (verbose=1) during the search process and raises errors (error_score='raise') if an error occurs.
Evaluating the Best Model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
# Print evaluation metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))
After finding the best set of hyperparameters (best_model), the code evaluates this model's performance on the test data (X_test) that was set aside earlier (y_test).
It:
Predicts labels (y_pred) for the test data.
Calculates and prints the accuracy score (accuracy_score) of the predictions compared to the actual labels (y_test).
Prints a detailed classification report (classification_report) showing precision, recall, F1-score, and support for each class (sarcasm and non-sarcasm).
After training and Testing I got an accuracy of 97%
Testing with sample text
Checking on the top 5 comments on a post on Reddit
GITHUB: https://github.com/stevie1mat/Sarcasm-Detection-With-Reddit-Comments
Author: Steven Mathew
Top comments (0)