DEV Community

Cover image for Sarcasm Detection AI Model (97% Accuracy) Trained With Reddit Comments - Training & Testing
Steven Mathew
Steven Mathew

Posted on • Edited on

Sarcasm Detection AI Model (97% Accuracy) Trained With Reddit Comments - Training & Testing

Now we are going to split the data to train and test the data to check the accuracy.

df = pd.read_csv('labeled_reddit_comments.csv')
Enter fullscreen mode Exit fullscreen mode

This line reads the previously saved CSV file (labeled_reddit_comments.csv) containing cleaned Reddit comments and their corresponding labels into a Pandas DataFrame (df).

Splitting Data into Training and Testing Sets

X_train, X_test, y_train, y_test = train_test_split(df['cleaned_comment'], df['label'], test_size=0.2, random_state=42)
Enter fullscreen mode Exit fullscreen mode

Here, we split the data into two parts:
X_train and y_train: These variables contain 80% of the data (df['cleaned_comment'] and df['label']) which will be used for training the model.

X_test and y_test: These variables contain the remaining 20% of the data, which will be used to evaluate how well the trained model performs on new, unseen data.

Creating a Pipeline with a Random Forest Classifier

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', RandomForestClassifier(random_state=42))
])
Enter fullscreen mode Exit fullscreen mode

This sets up a pipeline (pipeline) that sequentially applies two steps to the data:
Step 1 ('tfidf', TfidfVectorizer()): Converts the text data (X_train and X_test) into numerical TF-IDF (Term Frequency-Inverse Document Frequency) vectors.

Step 2 ('clf', RandomForestClassifier(random_state=42)): Trains a Random Forest classifier on the TF-IDF vectors. The random_state=42 ensures reproducibility of results.

Defining Hyperparameters for Tuning

param_grid = {
    'tfidf__max_features': [10000, 20000, None],
    'clf__n_estimators': [50, 100],
    'clf__max_depth': [None, 10],
    'clf__min_samples_split': [2, 5],
    'clf__min_samples_leaf': [1, 2]
}
Enter fullscreen mode Exit fullscreen mode

This dictionary (param_grid) specifies different hyperparameter values to explore during the grid search process:
'tfidf_max_features': Limits the number of features generated by TfidfVectorizer.
'clf
n_estimators', 'clfmax_depth', 'clfmin_samples_split', 'clf_min_samples_leaf': Parameters that control the behavior of the Random Forest classifier.

Performing GridSearchCV for Hyperparameter Tuning

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', verbose=1, error_score='raise')
grid_search.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

Here, GridSearchCV is used to search for the best combination of hyperparameters (param_grid) for the pipeline (pipeline). It:
Divides the data into 5 folds (cv=5) for cross-validation.

Uses accuracy (scoring='accuracy') as the metric to evaluate the performance of each combination of hyperparameters.
Prints detailed messages (verbose=1) during the search process and raises errors (error_score='raise') if an error occurs.

Evaluating the Best Model

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Print evaluation metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))
Enter fullscreen mode Exit fullscreen mode

After finding the best set of hyperparameters (best_model), the code evaluates this model's performance on the test data (X_test) that was set aside earlier (y_test).

It:
Predicts labels (y_pred) for the test data.
Calculates and prints the accuracy score (accuracy_score) of the predictions compared to the actual labels (y_test).

Prints a detailed classification report (classification_report) showing precision, recall, F1-score, and support for each class (sarcasm and non-sarcasm).

After training and Testing I got an accuracy of 97%

Image description

Testing with sample text

Image description

Checking on the top 5 comments on a post on Reddit

Image description

GITHUB: https://github.com/stevie1mat/Sarcasm-Detection-With-Reddit-Comments

Author: Steven Mathew

Sentry image

Hands-on debugging session: instrument, monitor, and fix

Join Lazar for a hands-on session where you’ll build it, break it, debug it, and fix it. You’ll set up Sentry, track errors, use Session Replay and Tracing, and leverage some good ol’ AI to find and fix issues fast.

RSVP here →

Top comments (0)

AWS Security LIVE!

Join us for AWS Security LIVE!

Discover the future of cloud security. Tune in live for trends, tips, and solutions from AWS and AWS Partners.

Learn More

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay