We're students of Information Technology (IT) at the University of Pamulang (Universitas Pamulang). It's one of the best private universities, providing excellent classes for various majors.
Student Names:
- Abdul Saboor Hamedi
- Esa Rizki Hari Utama
- Anydya Relbi Wayah Pandeyani
- Moh. Erland Sumantri
This blog is an assignment for our Computer System and Networking subject. In this blog, we will go through a paper sourced from Scopus, titled "Machine learning algorithm for detecting suspicious email messages using Natural Language Processing NLP."
You can access the paper through this here...
Introduction
In our increasingly connected world, email isn't just for sending holiday snaps or coordinating a Friday arvo barbie. It's a fundamental part of global connectivity and even drives economic growth. But with this convenience comes a serious downside: email is a prime target for cyber threats. We're talking sophisticated phishing schemes and sneaky malware distribution that can hit individuals, companies, and even institutions hard.
Traditional security measures, bless 'em, are finding it tough to keep up with how quickly these nasty tactics evolve. And let's be fair, there's a serious shortage of cybersecurity pros to fight this battle – one survey across eight countries found about 82% of employers are feeling the pinch. Data from the US also shows unfilled cybersecurity jobs have jumped over 50% since 2015, with projections of a global deficit reaching a whopping 1.8 million roles soon. This talent gap just makes the problem worse.
When security systems can't adapt, we end up with frustrating classification errors: false positives (FP) and false negatives (FN). FPs are when a perfectly harmless email gets flagged as a threat, ruining the reliability of email communication. Even worse are FNs, where a genuinely harmful email slips through the net. These errors can lead to data breaches, losing your hard-earned cash, or damaging reputations. With email being crucial for business, sorting out email security is a huge deal, demanding fresh ideas to fix the gaps in older systems.
The Rise of Machine Learning and NLP
Over the years, folks have tried various ways to combat email threats. Early efforts moved beyond just labelling emails as 'ham' (good) or 'spam' (bad) to a three-class system using Artificial Neural Networks (ANN). Hybrid machine learning techniques also showed promise. More recently, tackling targeted malicious emails (TMEs) has been a focus. Some approaches have used methods like SpamAssassin and ClamAV, while others have found success with Support Vector Machine (SVM) algorithms. However, dealing with the sheer volume and complexity of spam means selecting the right features in the email data is crucial for boosting performance.
Despite the progress, existing systems still chuck up too many false positives, don't quite grasp the full context of a phishing attempt, and struggle to adapt to new threats. This is where combining Machine Learning (ML) and Natural Language Processing (NLP) comes in. NLP helps systems analyse the actual content of emails, spotting tricky language and common patterns found in phishing attempts. This improved accuracy, cutting down on both FPs and FNs. Later on, NLP started looking at the context and meaning of the content, working out the intent behind the text to tell suspicious emails apart from everyday ones. Analysing linguistic patterns and sentiment became key, flagging emails that use persuasive or urgent language. It turns out NLP works particularly well when teamed up with SVM models.
Our Proposed Solution: SVC with NLP and BERT
This research builds on previous efforts by bringing together a Support Vector Classifier (SVC) with NLP-based feature extraction, including the advanced BERT model, to really nail down classification accuracy and cut those pesky false positives. While past studies often used Random Forest, Naïve Bayes, or standard SVM models, our work shows that an optimised SVC model using smart feature selection techniques achieves higher accuracy (an impressive 98.65%) and is more effective at filtering spam.
How Does It Work? Unpacking the Methodology
Let's break down the process.
- The Data: The study used the Kaggle Email Spam Classification Dataset. This dataset is a benchmark and contains details from 5172 emails, each labelled as either spam (1) or non-spam (0). Each email is represented by word counts across a massive 3002 columns, plus its label. About 39.4% were spam and 60.6% non-spam, showing the dataset had a class imbalance.
- Getting the Data Ready (Preprocessing): Before the ML model could chew on the data, it needed a clean-up. This involved a few steps:
- Lowercasing: All text was converted to lowercase to ensure consistency.
- Tokenization: Emails were broken down into individual words or "tokens" for detailed analysis.
- Stop Word Removal: Common, uninformative words like "the", "is", and "and" were removed to focus on words that actually carry meaning. Other cleaning included removing emojis, HTML tags, special characters, and URLs.
- Lemmatization: Words were reduced to their base form.
- Extracting Features: Once the text was clean, important features were pulled out using NLP techniques.
- One-Hot Encoding: Categorical features (like email subject) were turned into a binary format. While efficient, it doesn't capture meaning or relationships between words, limiting its use for complex phishing detection.
- TF-IDF Vectorization: This technique turns the preprocessed text into numerical features, giving words a weight based on how often they appear in an email compared to the whole dataset. It's simple and good for basic text classification, but misses the context between words, which is a limitation for sophisticated phishing.
- BERT Embeddings: To fix the limitations of TF-IDF and One-Hot Encoding, BERT was brought in. BERT is a state-of-the-art NLP model that creates contextual embeddings, helping the model understand the meaning of words based on their surroundings. This is a game-changer for spotting subtle linguistic cues in phishing emails, though it does require a fair bit of computational power. Combining One-Hot, TF-IDF, and BERT showed the best performance in feature extraction tests.
- Handling Imbalance: Because there were more non-spam emails than spam, the dataset was imbalanced. To counter this bias and improve performance, especially in reducing false negatives, several techniques were used:
- SMOTE: This technique creates synthetic examples for the minority class (spam) by interpolating between existing ones. This boosted recall.
- Under-sampling: This reduces the number of majority class (non-spam) examples. This slightly reduced overall accuracy but improved precision for spam detection.
- Algorithmic Adjustments: The
class_weight
parameter in the SVC was set to 'balanced', giving more importance to the minority class during training. Combining these methods helped balance recall and precision for better overall reliability.
- Choosing and Training the Model: An SVC model was selected because it's great at binary classification (suspicious or not suspicious), handles high-dimensional data well (like text features), and finds an optimal separation boundary that helps prevent overfitting. Its performance was compared against other popular classifiers like Random Forest, Neural Networks, Decision Trees, and Naive Bayes. The SVC model came out on top, especially for critical metrics like recall and F1-score, which are vital for catching true threats. The training involved using k-fold cross-validation (with 5 folds) to check for overfitting and evaluate the model better. GridSearchCV was used to find the best settings (hyperparameters) for the SVC model. Specific parameters for the SVC included C=1.0, kernel='RBF' (for non-linear data), gamma='scale', and class_weight='balanced'. The model achieved a training accuracy of 98.89%.
- Understanding the SVC Structure: At its heart, the SVC finds the best hyperplane (a decision boundary) to separate the different classes in the data. It does this by maximising the margin (distance) between the boundary and the closest data points from each class, known as Support Vectors. The model uses a Kernel Function to calculate similarity between data points.
- The Procedure in Steps: The overall process involved Exploratory Data Analysis (EDA) to visualise the dataset, the preprocessing and splitting (80% for training, 20% for testing), loading and initialising the SVC model, training the model, and finally testing it. The process can be visualised as data preprocessing -> handling imbalance -> model training -> model evaluation -> output.
- Tools Used: The research employed standard computing gear and several software tools, including Python (3.7), Pandas, NumPy, Scikit-learn, NLTK, Matplotlib/Seaborn, BeautifulSoup, Joblib, Uvicorn, and FastAPI.
- Deployment: The proposed email security solution is envisioned as a browser extension installed on a user's personal computer. The ML and NLP modules would work within a security engine in the extension to analyse emails and provide results to the user.
The Results Are In!
Our model achieved an impressive accuracy of 98.65% on the test set. This is pretty darn good at telling spam/phishing emails from legitimate ones. The study used a test set of 1034 emails. The results, shown in the confusion matrix, highlight the model's effectiveness:
- True Positives (TP): 731 phishing emails correctly identified.
- True Negatives (TN): 290 non-phishing emails correctly identified.
- False Positives (FP): Only 11 non-suspicious emails were wrongly flagged as suspicious.
- False Negatives (FN): Only 3 suspicious emails were wrongly missed.
This high precision means the model is reliable for real-world use.
Comparing our SVC approach with others mentioned in the literature showed it performed very competently:
| Paper | Model Used | Accuracy |
|------------------------|---------------------------|----------|
| Amin et al. | Random Forest Classifier | 91% |
| Khamis et al. | SVM | 88.80% |
| Ghaleb et al. | MOGOA and EGOA | 98.3% |
| Magdy et al. | ANN | 99.5% |
| M. Dewis and T. Viana | MLP | 94% |
| Y. Li | Naïve Bayes | 99.2% |
| Our SVC Approach | Integrated NLP and SVC | 98.65% |
Table derived from source.
While some studies showed slightly higher accuracy (like Magdy et al. and Y. Li), our approach particularly excels in reducing FPs and FNs, which is crucial for reliability.
Balancing Performance and Efficiency
It's also worth looking at how the models perform computationally:
| Classifier | Training time (s) | Inference latency (ms/email) | Memory usage (MB) | Accuracy (%) |
|:---|:---:|:---:|:---:|:---:|
| Random Forest | 18.5 | 1.8 | 210 | 97.20 |
| Neural Network | 210.3 | 5.6 | 350 | 97.80 |
| Naive Bayes | 4.2 | 0.9 | 80 | 95.40 |
| Gradient Boosting | 35.7 | 2.1 | 180 | 97.60 |
| SVC (Proposed) | 42.1 | 3.2 | 120 | 98.65 |
Table derived from source.
Our SVC model takes a bit longer to train (42.1 s) and uses more memory (120 MB) compared to Naive Bayes (4.2 s, 80 MB) or Random Forest (18.5 s, 210 MB). This is mainly because of its kernel-based optimisation. However, it strikes a good balance with inference speed (3.2 ms/email) and high accuracy. Neural Networks had a higher accuracy listed in the table (97.80%), but with significant computational cost, which limits scalability for large-scale deployments. This shows that the choice of model depends on what you need – SVC is great for high accuracy, but others might suit if computing resources are tight.
Dealing with Errors: False Positives and False Negatives
Let's revisit those classification errors, as they matter a lot in detecting phishing emails.
- False Positives (FPs) are annoying. They can make users distrust the system, lead to lost productivity from checking quarantined emails, and mean you might miss important messages.
- False Negatives (FNs) are dangerous. When a phishing email slips through, it can lead to successful attacks, compromising sensitive info, and causing financial and reputational damage. FNs are considered more critical in this context.
How can we tackle these?
- Threshold Adjustment: Tweaking the model's decision threshold can help balance catching threats (sensitivity) with not flagging good emails (specificity). This was tested and reduced FNs by 20%.
- Ensemble Methods: Combining multiple models (like SVC with others) can make the system tougher and more accurate, helping to reduce both FPs and FNs.
- Cost-sensitive Learning: Designing models that penalise missing a threat (FN) more heavily than flagging a safe email (FP) can bias the model towards minimising FNs.
Even with few errors, there's always room to improve. The complexity of email content and very subtle text details can still cause misclassifications.
Conclusion and What's Next
This study put forward a solid email security framework combining machine learning and NLP to get better at spotting suspicious emails. By teaming SVC with advanced feature extraction like BERT embeddings, the model hit an accuracy of 98.65%, outdoing many older spam detection methods. The results clearly show this system is effective at cutting down both false positives and false negatives, making email communication more reliable.
However, there are still challenges:
- Dataset Scope: The dataset used is a benchmark, which is good for evaluation, but it might not fully represent the wild diversity and constantly changing nature of real-world phishing emails, especially in businesses. It lacks some real-world context like sender reputation.
- Overfitting Risks: Complex models like BERT can risk overfitting, although cross-validation helps. Using more data and regularization techniques could further mitigate this.
- Language and Domain: The model was mainly trained on English emails. It might not work as well for other languages or phishing attacks specific to different cultures or domains.
- Dataset Biases: The dataset might have biases from its source or how it was labelled, potentially affecting performance in varied real-world situations.
Looking ahead, future work will focus on:
- Cross-Lingual Training: Making the framework work for emails in different languages using models like mBERT.
- Bias Mitigation and External Validation: Testing the model on real corporate datasets and using techniques to identify and correct biases from public datasets.
- Generalizability: Using transfer learning to help the model adapt to new email types, including regional threats and emerging tactics like AI-generated phishing.
These steps are about making the model continuously better and adaptable to the ever-changing world of email threats.
Ultimately, this research highlights the huge potential of using AI-driven solutions to enhance email security, helping to protect our digital communication channels from cyber threats.
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.