My First Data Science Project

#datascience #jupyter #machinelearning #productivity

I started reading a book called "Deepwork" by Cal Newport. It kickstarted my whole journey of becoming a data scientist again. I have always had this dream at the back of my mind. but i never really got to it. I started 3 different courses about datascience that i bought from Udemy, but I never finished them. I listed out projects that i thought that i would work on, but never did. I lost confidence in myself. I gave up. I convinced myself that I didn't really want this.

I am glad that I forgave myself and decided to learn how to teach myself. This book is changing my life. I haven't turned a new leaf overnight. I still feel like I won't be good enough for recruiters to want me or even good enough to work on projects with other people. But I have resolved to make small incremental changes everyday. To do something new everyday. Surely I would be even a tad bit better than I am today in the next 1,2,3 months.
With this in mind, I tackled my first machine learning project. A Fake News Detection model.
With the assistance of GeeksforGeeks and chatgpt, I managed to write code and understand it effectively. I'll break it down in simple language so you can understand.

*Step One *
I imported the main python libraries that i would need; pandas, seaborn, matplotlib, nltk, scikit-learn.

Step Two
I preprocessed my data. This included all processes to clean up data and remove unwanted text, punctuations, columns, rows or anything that would be unnecessary to the machine learning model. I used regular expressions to iterate through the dataset to remove all specified strings, text or punctuations. Although I suggest that you use another method to do this as regular expressions on a large dataset could slow down your model.

Step Three
I created a WordCloud to visualize the real news and fake news separately. It is important to visualize words in a project like this because;

WordCloud helps to identify important words in each of the categories. Larger words indicate higher frequency of those words. this gives an idea of themes or topics that may be common in one class.
WordCloud gives a deeper understanding of the dataset. A clear depiction of the words and their frequency help to understand the dataset even better and give an intuitive idea on the classification process.

Step Four
I plotted a bargraph of the top twenty words that were most frequent in the dataset. First i imported the following libraries: Sklearn.featureextraction.text and CountVectorizer
The CountVectorizer is a library that converts the text in the data into a numerical format that the machine learning algorithm can understand.
I then developed a function that could return a matrix all the text and the number of times they appear. From there i retrieved the top 20 most frequent words in the dataset. With this information I then plotted a barchart to illustrate this information.

Step Five
I prepared to split my dataset into training and testing for the machine learning model, specifically Logistic regression. Training data is the part of the dataset that is used to train the model whereas Testing dataset is the portion of the dataset used to evaluate how well the machine learning model can perform.
x_train, x_test corresponds to the data that is used to train and test the features of your dataset.
y_train, y_test corresponds to the labels or targets with represent whether the news is "real" or "fake".
test_size = 0.25 mearns that 25% of the dataset would be used for testing whilst 75% of the data is used to train the model.

Step Six
Logistic Regression is a machine learning model used typically for binary classification, such as determining whether a piece of information is "real" or "fake". After splitting the data into training and testing, the data is then converted into numerical format by the CountVectorizer as mentioned earlier, and then fed into the Logistic Regression mode. This is a breakdown of the process:
CountVectorizer: Converts the text data into numerical form (bag-of-words).
train_test_split: Splits the data into training and test sets.
LogisticRegression(): Initializes the logistic regression model.
model.fit(x_train, y_train): Trains the model on the training data.
model.predict(x_test): Makes predictions on the test data.
accuracy_score(y_test, y_pred): Measures how accurate the model's predictions are.

Step Seven
TfidfVectorizer (Term Frequency-Inverse Document Frequency) is a way to convert text into numbers, similar to CountVectorizer, but it goes one step further. Instead of just counting how many times each word appears in a document, it also considers how important each word is. Words that appear frequently across many documents (like "the" or "is") get less importance, while words that are more unique to specific documents (like "machine" or "news") get more importance.

from sklearn.feature_extraction.text import TfidfVectorizer
vectorization = TfidfVectorizer()
x_train = vectorization.fit_transform(x_train)
x_test = vectorization.transform(x_test)

vectorization = TfidfVectorizer():
This creates a TfidfVectorizer object. It will be used to transform the text data into a matrix of TF-IDF features.

x_train = vectorization.fit_transform(x_train):
This line does two things:
fit: It learns the vocabulary from the training data and calculates the importance of each word using the TF-IDF algorithm.
transform: It converts the training data into a matrix of TF-IDF features (numbers).

x_test = vectorization.transform(x_test):
This transforms the test data using the same vocabulary and TF-IDF calculations learned from the training data. It ensures the test data is transformed consistently without learning from it.
Important: We only transform the test data and do not refit it (as refitting would make the model learn from the test data, which we want to avoid).

Step Eight
I train the logistic regression model and evaluate the accuracy of the model.
model = LogisticRegression(): This initializes a Logistic Regression model.
model.fit(x_train, y_train): This line trains the model using the training data (x_train, which contains the transformed text, and y_train, which contains the labels or classes).

model.predict(x_train): This predicts the labels (or classes) for the training data (x_train), based on the model the Logistic Regression has learned.

accuracy_score(y_train, model.predict(x_train)): This calculates the accuracy by comparing the true labels (y_train) with the predicted labels from the training set. It tells you how well the model performed on the data it was trained on.

accuracy_score(y_test, model.predict(x_test)): Similarly, this calculates the accuracy on the test data, which the model hasn't seen before. This is a more realistic measure of how well the model will perform on unseen data.

Training Accuracy: The accuracy on the training data. A high value here means that the model fits the training data well. If this value is too high (close to 100%), there might be overfitting (the model memorizes the training data).
Test Accuracy: The accuracy on the test data. This is more important because it shows how well the model generalizes to new, unseen data.
Typical Results:
If your training accuracy is much higher than the test accuracy, your model might be overfitting, meaning it's not generalizing well to new data.
If both accuracies are relatively close and reasonably high, your model is performing well.

Alternatively, I used a Decision Tree Classifier instead of Logistic Regression to classify your text data.

model = DecisionTreeClassifier(): Initializes a Decision Tree Classifier.
model.fit(x_train, y_train): Trains the Decision Tree using your x_train (transformed text) and y_train (labels).

I then used a confusion matrix that visually represents the performance of your Decision Tree classifier:
A confusion matrix is a table used to evaluate the performance of a classification algorithm. It provides insight into how well the model's predictions match the actual labels, with a breakdown of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).

Here's how the confusion matrix is structured:

Predicted False Predicted True
Actual False TN (True Negative) FP (False Positive)
Actual True FN (False Negative) TP (True Positive)