Sarcasm Detection AI Model (97% Accuracy) Trained With Reddit Comments - Training & Testing

Now we are going to split the data to train and test the data to check the accuracy.

df = pd.read_csv('labeled_reddit_comments.csv')

This line reads the previously saved CSV file (labeled_reddit_comments.csv) containing cleaned Reddit comments and their corresponding labels into a Pandas DataFrame (df).

Splitting Data into Training and Testing Sets

X_train, X_test, y_train, y_test = train_test_split(df['cleaned_comment'], df['label'], test_size=0.2, random_state=42)

Here, we split the data into two parts:
X_train and y_train: These variables contain 80% of the data (df['cleaned_comment'] and df['label']) which will be used for training the model.

X_test and y_test: These variables contain the remaining 20% of the data, which will be used to evaluate how well the trained model performs on new, unseen data.

Creating a Pipeline with a Random Forest Classifier

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', RandomForestClassifier(random_state=42))
])

This sets up a pipeline (pipeline) that sequentially applies two steps to the data:
Step 1 ('tfidf', TfidfVectorizer()): Converts the text data (X_train and X_test) into numerical TF-IDF (Term Frequency-Inverse Document Frequency) vectors.

Step 2 ('clf', RandomForestClassifier(random_state=42)): Trains a Random Forest classifier on the TF-IDF vectors. The random_state=42 ensures reproducibility of results.

Defining Hyperparameters for Tuning

param_grid = {
    'tfidf__max_features': [10000, 20000, None],
    'clf__n_estimators': [50, 100],
    'clf__max_depth': [None, 10],
    'clf__min_samples_split': [2, 5],
    'clf__min_samples_leaf': [1, 2]
}

This dictionary (param_grid) specifies different hyperparameter values to explore during the grid search process:
'tfidf_max_features': Limits the number of features generated by TfidfVectorizer.
'clfn_estimators', 'clfmax_depth', 'clfmin_samples_split', 'clf_min_samples_leaf': Parameters that control the behavior of the Random Forest classifier.

Performing GridSearchCV for Hyperparameter Tuning

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', verbose=1, error_score='raise')
grid_search.fit(X_train, y_train)

Here, GridSearchCV is used to search for the best combination of hyperparameters (param_grid) for the pipeline (pipeline). It:
Divides the data into 5 folds (cv=5) for cross-validation.

Uses accuracy (scoring='accuracy') as the metric to evaluate the performance of each combination of hyperparameters.
Prints detailed messages (verbose=1) during the search process and raises errors (error_score='raise') if an error occurs.

Evaluating the Best Model

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Print evaluation metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))

After finding the best set of hyperparameters (best_model), the code evaluates this model's performance on the test data (X_test) that was set aside earlier (y_test).

It:
Predicts labels (y_pred) for the test data.
Calculates and prints the accuracy score (accuracy_score) of the predictions compared to the actual labels (y_test).

Prints a detailed classification report (classification_report) showing precision, recall, F1-score, and support for each class (sarcasm and non-sarcasm).