DEV Community: saboor

Your Thesis is Being Used to Train Its Own Replacement: The Ethics of Cloud-Based AI Detection

saboor — Thu, 18 Jun 2026 01:28:22 +0000

Team

Abdul Saboor Hamedi
Rosyaida Saara Hellena
Akhmad ghozal
Novar Kurniawan

This is the repo you can find all the source code.

Your Thesis is Being Used to Train Its Own Replacement: The Ethics of Cloud-Based AI Detection

The Hidden Cost of Good Writing

There is a growing fear in schools and universities today. Students who spend hours writing perfect, professional papers are being wrongly accused of cheating.

Because online AI detectors are too simple, they cannot tell the difference between a highly disciplined human writer and a machine. To a basic detector, clean and correct English looks exactly like a robot wrote it.

To fix this, we built the Neural Lab—a private, offline software tool that lets you check your own writing on your own computer without losing your privacy.

Free Online Detectors Steal Your Work

Most popular online AI detectors seem free, but you actually pay with your data.

The Risk: When you upload your essay or thesis to a cloud website, you give away your ownership. The company uses your original ideas to train and improve their future AI systems.
The Solution: The Neural Lab runs completely on your own computer hardware. It keeps your data safe and uses a fast local system to scan a massive document in under 2 seconds.

How Neural Lab Works Under the Hood

To keep your data 100% private, the entire system is built to run locally on your own machine, completely bypassing the cloud. The following figure shows how the project handles your text using a fast, secure setup:

1. Ingestion (The Gateway)

The system workflow begins with how it safely receives your documents.

Frontend Editor (A): This is the simple screen where you paste your raw text or upload your PDF files to be checked.
FastAPI Gateway (B): This acts like a private front door on your computer. It safely takes your document without letting any outside websites spy on it, keeping your files completely private.

2. Pre-processing (The Smart Traffic Controller)

Once inside, the tool checks the document size and breaks it down.

Document Length Check (C): The system automatically counts the characters to choose the fastest way to handle your paper.
Fast Heuristic Router (D): If your document is massive (over 30,000 characters), it takes this shortcut path to keep your computer from slowing down or freezing.
Deep Neural Router (E): If your document is standard or short (under 30,000 characters), it takes this path for a deeper, highly accurate test.
Splitter (F): No matter the length, the tool cuts the text into separate sentences so it can scan your writing piece by piece.

3. Forensic Instruments (The Three Examiners)

Next, three different local tests scan your text fragments at the exact same time:

Neural Analysis (G): Uses an AI model (GPT-2) running directly on your computer to measure how predictable and robotic your word patterns look.
Lexical Analysis (H): Checks your personal writing habits by looking at your sentence rhythm and how many unique words you choose.
Integrity Audit (I): Uses a super-fast local scanner (pg_trgm) to check your text against your own private saved drafts to see if you are reusing your older work.

Global Verdict: 81% Neural

What Your Score Means: The engine runs a deep mathematical calculation across every line to give you an overall result. For example, a document scoring an 81% Neural Signature shows very consistent machine-like text patterns, meaning the system is highly confident the writing mimics an AI model.

4. Score Fusion (The Math Combiner)

After gathering data from all three tests, the system blends the signals together so it doesn't make mistakes or rely on unfair guesses.

Forensic Calculator (J): This acts as the central brain, combining the neural scores, vocabulary patterns, and draft history.
Sigmoid Normalization: It runs these mixed numbers through a smooth mathematical curve. This guarantees that long, critical paragraphs carry more weight than tiny connector phrases, giving you an accurate final score for every single sentence.

5. Persistence & Presentation (The Safe and Display)

The final step securely logs your data and builds your visual dashboard.

PostgreSQL Database (L): Your scan histories, text pieces, and final scores are safely saved right on your own hard drive under your complete control.
3-Tier Assembly Engine (M): Uses a clean system build to instantly package your saved metrics for your screen.
Visual UI (N): Brings up a clear, color-coded editor where you can view your highlighted sentences alongside automated metric graphs.

Why Being "Too Perfect" Makes You Look Like a Bot

Basic detectors use a single-metric approach, meaning they only look at one thing: word patterns.

AI is trained on highly polished, formal English.
If a human writes a flawless paper, a basic detector flags it as AI simply because it lacks mistakes.
The Neural Lab solves this by testing multiple patterns at once, including how tightly packed the information is.

Real Human Writing Has a "Heartbeat"

Human writers naturally change their rhythm. We talk and write in irregular bursts:

We constantly mix short, punchy statements with long, complicated sentences.
This creates a bumpy, natural reading rhythm.
AI models are programmed to be perfectly smooth, so their sentences all share the exact same length and complexity, making them sound flat and robotic.

Territory Coverage Breakdown

Mapping Your Document: The system splits your physical text into three clear zones based on probability thresholds, showing you exactly how much of your document real estate looks human versus machine.

Exact AI Match: High AI signals (probability over 60%).
Minor AI Changes: Mixed text that looks like a hybrid blend (probability 25% to 60%).
Human Written: Clean text showing authentic human signatures (probability under 25%).

The Element of Surprise

AI text is highly predictable because the machine always selects the most common, statistically likely word to come next.

The Neural Lab measures how surprised its local baseline engine is by your text.
Because human minds are creative, we choose unexpected words and unique phrasing that break the machine’s mathematical expectations.

Your Typos Prove You Are Human

In a digital world where perfection belongs to machines, human mistakes have become a superpower.

The Neural Lab scans your text for natural human typing mistakes, like swapped letters or missed apostrophes.
While an AI can be told to fake a mistake, it cannot copy the chaotic, organic pattern of a real human slip of the finger.
If the tool finds a genuine human typo, it instantly drops the AI suspicion score.

The "Furthermore" Trap

AI models tend to constantly recycle a small list of transition words like furthermore, moreover, consequently, and essentially.

If your paper repeats these specific words too often, your vocabulary score drops.
Humans naturally use a much wider variety of words when explaining complex ideas.
The Neural Lab uses a dashboard vocabulary gauge to track this; using a rich, diverse set of words provides a strong indicator that a human wrote the text.

Model Classification Matrix

Behind the Accuracy Numbers: To protect authentic academic writers from false plagiarism accusations, the local engine monitors its own system performance constants to keep its precision levels incredibly high.

Precision Control: 95.20% accuracy in separating human text from AI.
Confidence Rating: 94.90% reliability score based on signal data.
Recall Sensitivity: 96.10% success rate in catching actual AI patterns.

Protecting Your Voice

AI detectors must stop using single, blind guesses to accuse students. The Neural Lab fixes this by making sure your main paragraphs hold more weight than small connecting sentences.

As AI continues to make all writing look identical, your unique rhythm, diverse vocabulary, and even your mistakes are your best defense to prove your humanity.

A Private Laboratory for Forensic AI Detection and Academic Rigor

saboor — Wed, 06 May 2026 12:26:11 +0000

A Private Laboratory for Forensic AI Detection and Academic Rigor

This post serves as my masters degree midterm at the University of Pamulang. The subject is Advanced Computer Vision.
Author: Abdul Saboor Hamedi
Repo

Taking Back Our Data: Why We Built a Private AI Detector (And How It Works)

If you have spent any time in the modern academic landscape, you already know about the climate of fear surrounding AI writing tools. While large language models offer incredible assistance, they have also created a massive problem: students and researchers live in constant fear of being wrongly accused of AI plagiarism.

Have you ever written a highly structured, perfectly grammatical paper, only to have a cloud-based detector slap a 90% AI-generated label on it? You aren't alone. This is known as "Standardization Bias." Highly disciplined writers and non-native English speakers frequently get flagged simply because their writing is "too perfect." Most mainstream detectors rely on a flawed, single-metric approach that struggles to tell the difference between Standardized Academic English and machine-generated text.

Even worse is the data privacy crisis. When you upload your thesis or a novel research paper to a cloud-based AI detector, you are often participating in a "pay with your data" model. You are essentially handing your intellectual property over to a third-party corporation to train future models.

To solve this, a new system was developed called the Neural Lab. It is designed as a localized, private laboratory for forensic AI detection, giving the power of auditing back to the user before their work ever reaches a teacher's desk. Let’s dive into the architecture, the detection methodology, and some of the cool engineering challenges solved along the way.

The Architecture Stack

The Neural Lab was built for industrial production, running entirely on the user's local hardware to guarantee full data sovereignty.

The backend relies on a high-performance FastAPI web server with a PostgreSQL database acting as the persistence layer to safely store research history. On the machine learning side, the engine integrates a PyTorch backend to run the foundational transformer model, which seamlessly auto-detects CUDA availability to load the model in half-precision (float16) on GPU machines, saving VRAM and boosting throughput.

The Four Pillars of Detection (The Secret Sauce)

Instead of relying on a single, easily confused metric, the Neural Lab utilizes a four-metric fusion approach to dramatically reduce false positives.

The baseline model powering this is GPT-2. Why an older model? Because GPT-2 represents the foundational autoregressive logic used by all modern LLMs. When you feed GPT-2 text generated by a more advanced model like GPT-4 or Claude, it recognizes the familiar probability-maximization patterns.

The first core metric is Perplexity, which basically measures how "surprised" the model is by the text. Advanced AI text registers as low perplexity because GPT-2 easily anticipates it, whereas idiosyncratic human text registers as high perplexity.

The second metric is Shannon Entropy, which measures predictability at the token level. AI is highly confident at each step, taking highly predictable token paths (low entropy). Humans, on the other hand, are uncertain, and our text could go in many different directions at any given point (high entropy).

The third metric is my personal favorite: Sentence Burstiness, which acts as the "Human Pulse." Human writers naturally vary their rhythm. We mix short, punchy sentences with long, drawn-out elaborations, creating high statistical variance. AI models are optimized for coherence and fluency, meaning they tend to spit out sentences of highly uniform complexity resulting in low burstiness.

Finally, the system calculates a Lexical Uniqueness Ratio. This catches the AI's tendency to constantly lean on common transitional vocabulary (think words like "furthermore," "moreover," and "essentially").

These four metrics are fused together using a non-linear sigmoid calibration to create a final AI probability score. The developers even added a brilliant "Human Override" pipeline that scans for structural keystroke typos—if a human error is found, it applies a massive static deduction to the AI score.

Forensic Audit Performance Metrics

To demonstrate the system's precision, the following data represents a typical industrial audit performance. The engine uses a fused scoring model to ensure high-fidelity results.

Neural Signature & Coverage

Metric	Value	Explanation
Global Neural Confidence	35%	The character-weighted average of AI intensity across the entire document.
Human Authorship	65%	The percentage of the document identified as purely human-written.
Exact AI Match	0%	Territory coverage for sentences with 100% machine signature.
Minor AI Changes	70%	Territory coverage for sentences likely refined or edited by AI.
Human Written	30%	Territory coverage for sentences with no machine signature.

Integrity Audit & Neural Accuracy

Metric	Value	Explanation
Plagiarism Index	25%	Percentage of this document found in previously archived records.
Archive Size	176	Total records currently stored in the private forensic archive.
Source Traces	61	Individual matches detected across the internal database.
F1-Score	95.80%	The balanced accuracy of the neural engine.
Audit Confidence	86.20%	The statistical reliability of the current forensic audit.
Precision	95.20%	The engine's accuracy in identifying true AI-generated content.
Recall	96.10%	The engine's ability to detect all instances of machine text.

Streamlined Ingestion: Drag & Drop PDF Support

A critical feature for researcher productivity is the native support for Drag & Drop PDF ingestion. Instead of traditional file picking, users can simply drag a PDF file directly onto the editor workspace. This triggers an automated pipeline:

Extraction: The system uses PyMuPDF to extract text with per-word precision.
Coordinates: It maintains spatial bounding boxes (bbox) for every word.
Visualization: The frontend renders a visual text layer over the original PDF layout, allowing for spatial forensic highlighting without altering the source document.

Overcoming Engineering Challenges

Building a high-fidelity local detector brought up some fascinating edge cases.

One major issue was the Calibration Consistency Problem. Early versions of the app would sometimes show a global score of 97% AI, but the editor would highlight almost no sentences in red. This happened because the global score and the sentence scores were being calculated by two completely independent algorithms. To fix this UI nightmare, the global score was rewritten to be a character-length-weighted average of the individual sentence probabilities. This means a 50-word analytical paragraph now correctly carries more weight than a 5-word transition sentence, ensuring the UI is exactly 100% mathematically consistent with what the user sees highlighted.

Another massive win was optimizing the Integrity Audit Engine (the plagiarism checker). Scanning a massive document by running exact string matches against a database is a fast track to an expensive $O(N)$ database loop. Scanning a 78,000-character document used to take about 60 seconds and required over 520 individual queries.

The engineering fix? They partitioned the text into fixed 55-character overlapping fingerprints and utilized PostgreSQL's pg_trgm trigram indexing. By passing the extracted fingerprints in a single batched execution using UNNEST and ILIKE, they reduced the database scan time from 60 seconds down to under 2 seconds.

Wrapping Up

The Neural Lab proves that high-fidelity, mathematically rigorous AI detection doesn't require uploading your intellectual property to a corporate cloud. By leveraging local hardware, smart database indexing, and a multi-metric forensic approach, developers can build tools that actually protect students and researchers from false accusations while keeping their data completely safe and sovereign.

Dev Snippet — A Local-First Markdown Editor That Thinks With You

saboor — Fri, 02 Jan 2026 02:56:59 +0000

Free. Offline. No Cloud. No Tracking. Just pure, high-performance knowledge crafting.

After nine years of coding and months of focused development, I built the Markdown editor I always wished existed — one that respects your focus, your privacy, and your intelligence.

Why Dev Snippet stands out:

True Local-First: All data stays on your machine via SQLite and a secure snippet:// protocol. No accounts. No forced sync.
Flow Mode: A borderless, shadowed editor designed to induce deep work — neither VS Code nor Obsidian offers this.
Scientist Mode: Typewriter scrolling and minimal UI, optimized for thesis writing and long-form technical drafting.
Full Mermaid and Mathematical Support: Render diagrams, equations, and architectural sketches directly in the live preview.
Semantic Linking: Connect ideas with wiki-links [[snippet_name]], categorize with tags (#), and reference concepts with mentions (@).
Hybrid Search: Press Ctrl+Shift+F to search across all snippets by tag, mention, language, or content — powered by FTS5 and BM25.
Stable Editing Experience: A cursor-aware rendering architecture eliminates layout jumps when switching between modes.
Thoughtful Theming: Four built-in themes inspired by modern CLI tools, with real-time sync between editor and preview.
Built for:

Researchers drafting papers in Markdown (arXiv-ready)
Developers documenting systems and code
Students organizing lecture notes with structure and links
Anyone who values privacy, speed, and cognitive clarity
Tech Stack: Electron, React, CodeMirror 6, SQLite (WAL mode), and unified.js
License: MIT — fully open source, no hidden costs

Download v1.2.2 for Windows, macOS, and Linux: releases

Source and documentation: repo

Tools should adapt to cognition — not force cognition to adapt to tools.

I welcome your feedback — especially from researchers, developers, and serious note-takers. What would make this your daily driver?

Quick Snippets — A Small Tool for Big Focus

saboor — Mon, 01 Dec 2025 13:15:56 +0000

Built for developers who hate interrupting their flow.
Future update here dev-dialect
When you're deep in the zone and need to save that perfect regex, API response, or config snippet, you don't want to open a new IDE tab or search through messy text files. Quick Snippets gives you a lightning-fast, distraction-free space for those micro-knowledge pieces that deserve to be saved but don't belong in a full project.

The Philosophy: Less Friction, More Flow

Quick Snippets was born from frustration with existing tools that were either:

Too heavy (full IDEs for 10 lines of code)
Too simple (plain text files with no organization)
Too slow (cloud apps requiring authentication)

I wanted something that feels like an extension of my muscle memory — a tool that appears when I need it and disappears when I don't.

What Makes It Special

Instant Capture

Ctrl/Cmd+N → New snippet immediately focused and ready
Drag & drop files directly into the app
Auto-save every keystroke (never lose work)

Smart Organization

Command Palette (Ctrl/Cmd+P) – Fuzzy-search through all snippets instantly
Live Markdown Preview – See formatted results as you type
SQLite Backend – Local, fast, reliable storage that syncs nothing to the cloud

Keyboard-First Workflow

Ctrl+R – Rename selected snippet
Ctrl+Shift+C – Copy snippet to clipboard
Delete key – Remove with confirmation
Esc – Smart modal hierarchy (closes only what's relevant)

Clean, Focused Interface

// No bloat. No distractions.
// Just your code and a live preview.

See It in Action

Split-Pane Productivity

Left: Clean editor. Right: Instant Markdown rendering. No switching tabs.

Perfect For...

Daily Developer Tasks

Saving one-off commands you always forget
API examples and curl commands
Config snippets for different environments
Bug reproduction templates
Code review notes and templates

Beyond Just Code

Meeting notes in Markdown
Quick calculations
Project ideas
Contact templates
Issue descriptions

Getting Started

Installation

npm install
npm run dev  # For development
npm run build  # For production

First 60 Seconds

Ctrl+N – Create your first snippet
Give it a name (.js extension auto-detects language)
Type some code – watch it auto-save
Ctrl+P – Search for it
Ctrl+Shift+C – Copy it back to your main project

Why This Works

The magic is in the constraints:

No folders – Search instead of organizing
No cloud sync – Local-first means instant
No tabs – Single focus reduces cognitive load
No settings – It just works

Roadmap Ideas

Snippet tagging and collections
Quick export to gist/git
Theme customization
Plugin system for syntax highlighting

License

MIT – Use it, modify it, share it. Just keep the credits if you redistribute.

Want to Contribute?

Found a bug? Have a feature idea? The code is intentionally simple so you can jump right in. Issues and PRs are welcome!

Quick Snippets isn't trying to be your main editor. It's the sticky note on your monitor that actually gets used.

Cracking Down on Cyber Scams: A Breakthrough in Email Threat Detection Using AI .

saboor — Wed, 11 Jun 2025 12:09:22 +0000

We're students of Information Technology (IT) at the University of Pamulang (Universitas Pamulang). It's one of the best private universities, providing excellent classes for various majors.
Student Names:

Abdul Saboor Hamedi
Esa Rizki Hari Utama
Anydya Relbi Wayah Pandeyani
Moh. Erland Sumantri

This blog is an assignment for our Computer System and Networking subject. In this blog, we will go through a paper sourced from Scopus, titled "Machine learning algorithm for detecting suspicious email messages using Natural Language Processing NLP."
You can access the paper through this here...

Introduction

In our increasingly connected world, email isn't just for sending holiday snaps or coordinating a Friday arvo barbie. It's a fundamental part of global connectivity and even drives economic growth. But with this convenience comes a serious downside: email is a prime target for cyber threats. We're talking sophisticated phishing schemes and sneaky malware distribution that can hit individuals, companies, and even institutions hard.

Traditional security measures, bless 'em, are finding it tough to keep up with how quickly these nasty tactics evolve. And let's be fair, there's a serious shortage of cybersecurity pros to fight this battle – one survey across eight countries found about 82% of employers are feeling the pinch. Data from the US also shows unfilled cybersecurity jobs have jumped over 50% since 2015, with projections of a global deficit reaching a whopping 1.8 million roles soon. This talent gap just makes the problem worse.

When security systems can't adapt, we end up with frustrating classification errors: false positives (FP) and false negatives (FN). FPs are when a perfectly harmless email gets flagged as a threat, ruining the reliability of email communication. Even worse are FNs, where a genuinely harmful email slips through the net. These errors can lead to data breaches, losing your hard-earned cash, or damaging reputations. With email being crucial for business, sorting out email security is a huge deal, demanding fresh ideas to fix the gaps in older systems.

The Rise of Machine Learning and NLP

Over the years, folks have tried various ways to combat email threats. Early efforts moved beyond just labelling emails as 'ham' (good) or 'spam' (bad) to a three-class system using Artificial Neural Networks (ANN). Hybrid machine learning techniques also showed promise. More recently, tackling targeted malicious emails (TMEs) has been a focus. Some approaches have used methods like SpamAssassin and ClamAV, while others have found success with Support Vector Machine (SVM) algorithms. However, dealing with the sheer volume and complexity of spam means selecting the right features in the email data is crucial for boosting performance.

Despite the progress, existing systems still chuck up too many false positives, don't quite grasp the full context of a phishing attempt, and struggle to adapt to new threats. This is where combining Machine Learning (ML) and Natural Language Processing (NLP) comes in. NLP helps systems analyse the actual content of emails, spotting tricky language and common patterns found in phishing attempts. This improved accuracy, cutting down on both FPs and FNs. Later on, NLP started looking at the context and meaning of the content, working out the intent behind the text to tell suspicious emails apart from everyday ones. Analysing linguistic patterns and sentiment became key, flagging emails that use persuasive or urgent language. It turns out NLP works particularly well when teamed up with SVM models.

Our Proposed Solution: SVC with NLP and BERT

This research builds on previous efforts by bringing together a Support Vector Classifier (SVC) with NLP-based feature extraction, including the advanced BERT model, to really nail down classification accuracy and cut those pesky false positives. While past studies often used Random Forest, Naïve Bayes, or standard SVM models, our work shows that an optimised SVC model using smart feature selection techniques achieves higher accuracy (an impressive 98.65%) and is more effective at filtering spam.

How Does It Work? Unpacking the Methodology

Let's break down the process.

The Data: The study used the Kaggle Email Spam Classification Dataset. This dataset is a benchmark and contains details from 5172 emails, each labelled as either spam (1) or non-spam (0). Each email is represented by word counts across a massive 3002 columns, plus its label. About 39.4% were spam and 60.6% non-spam, showing the dataset had a class imbalance.
Getting the Data Ready (Preprocessing): Before the ML model could chew on the data, it needed a clean-up. This involved a few steps:
- Lowercasing: All text was converted to lowercase to ensure consistency.
- Tokenization: Emails were broken down into individual words or "tokens" for detailed analysis.
- Stop Word Removal: Common, uninformative words like "the", "is", and "and" were removed to focus on words that actually carry meaning. Other cleaning included removing emojis, HTML tags, special characters, and URLs.
- Lemmatization: Words were reduced to their base form.
Extracting Features: Once the text was clean, important features were pulled out using NLP techniques.
- One-Hot Encoding: Categorical features (like email subject) were turned into a binary format. While efficient, it doesn't capture meaning or relationships between words, limiting its use for complex phishing detection.
- TF-IDF Vectorization: This technique turns the preprocessed text into numerical features, giving words a weight based on how often they appear in an email compared to the whole dataset. It's simple and good for basic text classification, but misses the context between words, which is a limitation for sophisticated phishing.
- BERT Embeddings: To fix the limitations of TF-IDF and One-Hot Encoding, BERT was brought in. BERT is a state-of-the-art NLP model that creates contextual embeddings, helping the model understand the meaning of words based on their surroundings. This is a game-changer for spotting subtle linguistic cues in phishing emails, though it does require a fair bit of computational power. Combining One-Hot, TF-IDF, and BERT showed the best performance in feature extraction tests.
Handling Imbalance: Because there were more non-spam emails than spam, the dataset was imbalanced. To counter this bias and improve performance, especially in reducing false negatives, several techniques were used:
- SMOTE: This technique creates synthetic examples for the minority class (spam) by interpolating between existing ones. This boosted recall.
- Under-sampling: This reduces the number of majority class (non-spam) examples. This slightly reduced overall accuracy but improved precision for spam detection.
- Algorithmic Adjustments: The class_weight parameter in the SVC was set to 'balanced', giving more importance to the minority class during training. Combining these methods helped balance recall and precision for better overall reliability.
Choosing and Training the Model: An SVC model was selected because it's great at binary classification (suspicious or not suspicious), handles high-dimensional data well (like text features), and finds an optimal separation boundary that helps prevent overfitting. Its performance was compared against other popular classifiers like Random Forest, Neural Networks, Decision Trees, and Naive Bayes. The SVC model came out on top, especially for critical metrics like recall and F1-score, which are vital for catching true threats. The training involved using k-fold cross-validation (with 5 folds) to check for overfitting and evaluate the model better. GridSearchCV was used to find the best settings (hyperparameters) for the SVC model. Specific parameters for the SVC included C=1.0, kernel='RBF' (for non-linear data), gamma='scale', and class_weight='balanced'. The model achieved a training accuracy of 98.89%.
Understanding the SVC Structure: At its heart, the SVC finds the best hyperplane (a decision boundary) to separate the different classes in the data. It does this by maximising the margin (distance) between the boundary and the closest data points from each class, known as Support Vectors. The model uses a Kernel Function to calculate similarity between data points.
The Procedure in Steps: The overall process involved Exploratory Data Analysis (EDA) to visualise the dataset, the preprocessing and splitting (80% for training, 20% for testing), loading and initialising the SVC model, training the model, and finally testing it. The process can be visualised as data preprocessing -> handling imbalance -> model training -> model evaluation -> output.
Tools Used: The research employed standard computing gear and several software tools, including Python (3.7), Pandas, NumPy, Scikit-learn, NLTK, Matplotlib/Seaborn, BeautifulSoup, Joblib, Uvicorn, and FastAPI.
Deployment: The proposed email security solution is envisioned as a browser extension installed on a user's personal computer. The ML and NLP modules would work within a security engine in the extension to analyse emails and provide results to the user.

The Results Are In!

Our model achieved an impressive accuracy of 98.65% on the test set. This is pretty darn good at telling spam/phishing emails from legitimate ones. The study used a test set of 1034 emails. The results, shown in the confusion matrix, highlight the model's effectiveness:

True Positives (TP): 731 phishing emails correctly identified.
True Negatives (TN): 290 non-phishing emails correctly identified.
False Positives (FP): Only 11 non-suspicious emails were wrongly flagged as suspicious.
False Negatives (FN): Only 3 suspicious emails were wrongly missed.

This high precision means the model is reliable for real-world use.

Comparing our SVC approach with others mentioned in the literature showed it performed very competently:
| Paper | Model Used | Accuracy |
|------------------------|---------------------------|----------|
| Amin et al. | Random Forest Classifier | 91% |
| Khamis et al. | SVM | 88.80% |
| Ghaleb et al. | MOGOA and EGOA | 98.3% |
| Magdy et al. | ANN | 99.5% |
| M. Dewis and T. Viana | MLP | 94% |
| Y. Li | Naïve Bayes | 99.2% |
| Our SVC Approach | Integrated NLP and SVC | 98.65% |
Table derived from source.

While some studies showed slightly higher accuracy (like Magdy et al. and Y. Li), our approach particularly excels in reducing FPs and FNs, which is crucial for reliability.

Balancing Performance and Efficiency

It's also worth looking at how the models perform computationally:
| Classifier | Training time (s) | Inference latency (ms/email) | Memory usage (MB) | Accuracy (%) |
|:---|:---:|:---:|:---:|:---:|
| Random Forest | 18.5 | 1.8 | 210 | 97.20 |
| Neural Network | 210.3 | 5.6 | 350 | 97.80 |
| Naive Bayes | 4.2 | 0.9 | 80 | 95.40 |
| Gradient Boosting | 35.7 | 2.1 | 180 | 97.60 |
| SVC (Proposed) | 42.1 | 3.2 | 120 | 98.65 |
Table derived from source.

Our SVC model takes a bit longer to train (42.1 s) and uses more memory (120 MB) compared to Naive Bayes (4.2 s, 80 MB) or Random Forest (18.5 s, 210 MB). This is mainly because of its kernel-based optimisation. However, it strikes a good balance with inference speed (3.2 ms/email) and high accuracy. Neural Networks had a higher accuracy listed in the table (97.80%), but with significant computational cost, which limits scalability for large-scale deployments. This shows that the choice of model depends on what you need – SVC is great for high accuracy, but others might suit if computing resources are tight.

Dealing with Errors: False Positives and False Negatives

Let's revisit those classification errors, as they matter a lot in detecting phishing emails.

False Positives (FPs) are annoying. They can make users distrust the system, lead to lost productivity from checking quarantined emails, and mean you might miss important messages.
False Negatives (FNs) are dangerous. When a phishing email slips through, it can lead to successful attacks, compromising sensitive info, and causing financial and reputational damage. FNs are considered more critical in this context.

How can we tackle these?

Threshold Adjustment: Tweaking the model's decision threshold can help balance catching threats (sensitivity) with not flagging good emails (specificity). This was tested and reduced FNs by 20%.
Ensemble Methods: Combining multiple models (like SVC with others) can make the system tougher and more accurate, helping to reduce both FPs and FNs.
Cost-sensitive Learning: Designing models that penalise missing a threat (FN) more heavily than flagging a safe email (FP) can bias the model towards minimising FNs.

Even with few errors, there's always room to improve. The complexity of email content and very subtle text details can still cause misclassifications.

Conclusion and What's Next

This study put forward a solid email security framework combining machine learning and NLP to get better at spotting suspicious emails. By teaming SVC with advanced feature extraction like BERT embeddings, the model hit an accuracy of 98.65%, outdoing many older spam detection methods. The results clearly show this system is effective at cutting down both false positives and false negatives, making email communication more reliable.

However, there are still challenges:

Dataset Scope: The dataset used is a benchmark, which is good for evaluation, but it might not fully represent the wild diversity and constantly changing nature of real-world phishing emails, especially in businesses. It lacks some real-world context like sender reputation.
Overfitting Risks: Complex models like BERT can risk overfitting, although cross-validation helps. Using more data and regularization techniques could further mitigate this.
Language and Domain: The model was mainly trained on English emails. It might not work as well for other languages or phishing attacks specific to different cultures or domains.
Dataset Biases: The dataset might have biases from its source or how it was labelled, potentially affecting performance in varied real-world situations.

Looking ahead, future work will focus on:

Cross-Lingual Training: Making the framework work for emails in different languages using models like mBERT.
Bias Mitigation and External Validation: Testing the model on real corporate datasets and using techniques to identify and correct biases from public datasets.
Generalizability: Using transfer learning to help the model adapt to new email types, including regional threats and emerging tactics like AI-generated phishing.

These steps are about making the model continuously better and adaptable to the ever-changing world of email threats.

Ultimately, this research highlights the huge potential of using AI-driven solutions to enhance email security, helping to protect our digital communication channels from cyber threats.

MySQL charset

saboor — Sat, 31 Oct 2020 07:30:50 +0000

I don't know is it the correct place to ask this types of question?
BTW, I have a problem with charset "utf8mb4" of MySQL I have built a website and everything is working fine except Arabic Language here is the case:
I can update and insert new Arabic record on my table through PHP script, when I open workbench or BDeaver I get this character "ØµØ¨ÙˆØ±" which is == "صبور", but on my website I get the correct result, It sound like encrypt and decrypt process.
Here is the opposite, when I update this record through MySQL I get my name which is "صبور" but when I fetch this name on my website I get question mark == "?????"
Any help would be big help thank you in advance

How to decrypt hash password

saboor — Mon, 12 Oct 2020 09:12:05 +0000

Hello everyone! this is my first time I'm posting here. I have a question, Is it possible to decrypt hash password and fetch them all in table ?
here is the example