DEV Community

Abdul Ghani

Posted on May 21, 2020

Real-time Phishing Attack Detection using ML 💻

#devgrad2020 #octograd2020 #showdev #githubsdp

My Final Project

So, I've built this project called RPAD-ML in my final year. It is essentially an Android app coupled with a machine learning backend server which detects 🕵️ any link that is a possible phishing site in REALTIME ⚡. It can detect malicious/phishing links from any app. Open any app which has external links 🔗, RPAD-ML will detect it in no time and gives you a warning message⚠️ right away.

Demo

Download RPAD-ML Demo APK

I know there are lots of things available like Google safe browsing. But those are limited to chrome web browser. So, What I've done is used a machine learning model of phishing sites combined with Google safe browsing which when given a URL predicts whether it is a phishing website or not.

Link to Code

abdulghanitech / rpad-ml

Real-time Phishing Attack Detection using ML 💻

rpad-ml

Real-time Phishing Attack Detection using ML 💻

The repo contains code for both the ML server and the Android app which was used to detect phishing sites in real-time. Below is a flow chart of it.

View on GitHub

How I built it

I've got a machine learning model built using dataset of phishing sites.

DATA SELECTION

The dataset is downloaded from UCI machine learning repository. The dataset contains 31 columns, with 30 features and 1 target. The dataset has 2456 observations.

MODELS

To fit the models over the dataset the dataset is split into training and testing sets. The split ratio is 75-25. Where in 75% accounts to training set.

Now the training set is used to train the classifier. The classifiers chosen are:

* Logistic Regression

* Random Forest Classification

* Support Vector Machine

We will see which one fits best in our dataset.

1.Logistic Regression

Fitting logistic regression and creating confusion matrix of predicted values and real values I was able to get 92.3 accuracy. Which was good for a logistic regression model.

2.Support Vector Machine

Support vector machine with a rbf kernel and using gridsearchcv to predict best parameters for svm was a really good choice, and fitting the model with predicted best parameters I was able to get 96.47 accuracy which is pretty good.

3.Random Forest Classification

Next model I wanted to try was random forest and I will also get features importances using it, again using gridsearchcv to get best parameters and fitting best parameters to it I got very good accuracy 97.26.

Random forest was giving very good accuracy. We can also try artificial neural network to get a improved accuracy.

FEATURE IMPORTANCES

ML Model: Phishcoop

Hosting online as a server

I've used the Heroku platform (Hobby plan provided by GitHub education) to host this machine learning model online. I used pickle to save and load the machine learning model and hosted it using Flask.

The idea was to put this as a service and then call it from the android app.

Android App

Essentially, this is the front-end to call this service. I've used Android's accessibility API to access and intercept network. Hence, I got the URLs being opened in any app using this method.

Now, after getting this url, firstly I call the Google safe browsing API to check whether it is a phishing site or not. If yes, I show a warning dialog else I call the machine learning backend server and using the result provided by it I again show warning dialog if the result comes as phishing site.

Additional Thoughts / Feelings / Stories

This was more like a prototype. While it is not that perfect, but hey it works 🙌🏻. And the best thing is I've learnt so much by working on this project 🤓