DEV Community

Cover image for Detecting Android Malware Using Only App Permissions: A Lightweight ML Approach
NARESH
NARESH

Posted on

Detecting Android Malware Using Only App Permissions: A Lightweight ML Approach

πŸ‘‹ Introduction

Hey everyone! πŸ‘‹

Welcome to my capstone project "Malware Detection Using Android App Permissions" built as part of my learning journey with NxtWave. This project is a great example of how we can apply basic yet powerful machine learning concepts to solve real-world problems without diving into heavy or complex algorithms.

πŸ“± The Problem?

Many Android apps request multiple permissions like access to your location, SMS, or microphone. While some are necessary, others may indicate malicious intent. So I asked: "Can we detect malware just by analyzing the permissions an app asks for?"

πŸš€ The Goal

Build a supervised machine learning classifier that determines whether an Android app is benign or malicious, using only the permissions it requests.

πŸ”§ Why No Advanced Models (Yet)?

You won't find XGBoost, Random Forests, or deep learning here and that's on purpose. This project is about mastering the fundamentals of supervised learning, using models like Logistic Regression, KNN, SVM, and Decision Trees and making smart decisions using clean code and solid evaluation.

πŸ’‘ What You'll Learn

  • How to clean and prepare real-world datasets
  • How to explore and visualize permission patterns
  • How to train and evaluate classifiers
  • The impact of hyperparameter tuning (and when not to use it!)
  • Why sometimes a simple model like Logistic Regression can outperform the rest

I hope this blog helps you understand the power of simplicity in machine learning. Feel free to drop your thoughts in the comments or ask questions I'd love to connect!

Let's dive in πŸ‘‡


πŸ“₯ Step 1: Load and View the Dataset

We'll begin by importing the dataset directly from the provided CSV URL and taking a quick look at its structure.

import pandas as pd

# Load dataset
url = 'https://new-assets.ccbp.in/frontend/content/aiml/classical-ml/android_dataset.csv'
data = pd.read_csv(url)

# Preview first 5 rows
data.head()
Enter fullscreen mode Exit fullscreen mode

Now let's check the column names, dimensions, and basic information:

# Check column names
print(data.columns)

# Check dataset shape
print("Dataset shape:", data.shape)

# Info and data types
data.info()

# Basic statistical summary
data.describe()
Enter fullscreen mode Exit fullscreen mode

πŸ‘‰ Outcome:

You'll see a dataset where each row represents an Android app, each column represents a permission, and the target column Result indicates if it's benign (0) or malware (1).


🧹 Step 2: Data Cleaning

The dataset contains many permission columns with long and redundant names. We'll clean these up for easier interpretation and handling during modeling.

βœ… Rename and Standardize Column Names

import re

# Function to remove common prefixes
def clean_column(col_name):
    return re.sub(r'(android\.permission\.|com\.|me\.)', '', col_name)

# Apply to all columns except 'Result'
data.columns = [clean_column(col) if col != 'Result' else col for col in data.columns]
data.head()
Enter fullscreen mode Exit fullscreen mode

πŸ” Further Cleanup to Make Column Names Unique and Clear

def ultra_clean_column(col_name):
    if col_name == 'Result':
        return 'Result'
    parts = col_name.upper().split('.')
    ignore_keywords = {
        "COM", "ANDROID", "HTC", "HUAWEI", "SONY", "GOOGLE", "SEC", "OPPO",
        "VENDING", "GMS", "EVERYTHING", "LAUNCHER", "ALARM", "DEVICE",
        "PROVIDERS", "BADGER", "PERMISSION", "FINSKY", "INSERT", "COUNT",
        "GSF", "C2DM", "MAJEUR", "ANDDOES", "SONYMOBILE", "SAMSUNG", "AMAZON"
    }
    cleaned = [word for word in parts if word not in ignore_keywords and len(word) > 2]
    return "_".join(cleaned)

def clean_and_make_unique(df):
    cols = df.columns.tolist()
    cleaned_cols = [ultra_clean_column(col) for col in cols]
    seen = {}
    unique_cols = []
    for item in cleaned_cols:
        original_item = item
        count = 1
        while item in seen:
            item = f"{original_item}_{count}"
            count += 1
        seen[item] = True
        unique_cols.append(item)
    df.columns = unique_cols
    return df

# Apply deep cleaning
data = clean_and_make_unique(data)
data.head()
Enter fullscreen mode Exit fullscreen mode

🧼 Remove Missing and Duplicate Records

# Check for missing values
print("Missing values per column:\n", data.isna().sum())

# Check for duplicates
print("Duplicate rows:", data.duplicated().sum())

# Drop duplicates
data.drop_duplicates(inplace=True)
print("Shape after dropping duplicates:", data.shape)
Enter fullscreen mode Exit fullscreen mode

πŸ‘‰ Outcome:

  • Cleaner, readable column names.
  • Duplicates removed (important: dataset initially had ~70% duplicates!) we need to drop it since the model won't learn anything meaningful from it.
  • No missing values to worry about.

πŸ“Š Step 3: Exploratory Data Analysis (EDA)

import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x='Result', data=data, hue='Result', palette='viridis')
plt.title("Target Variable Distribution")
plt.xlabel("App Type (0: Benign, 1: Malware)")
plt.ylabel("Count")
plt.tight_layout()
plt.show()
Enter fullscreen mode Exit fullscreen mode

  • Benign apps outnumber malware apps, indicating a class imbalance in the dataset.
permission_counts = data.drop('Result', axis=1).sum().sort_values(ascending=False).head(10)

plt.figure(figsize=(10,6))
sns.barplot(x=permission_counts.values, y=permission_counts.index, palette='magma')
plt.title("Top 10 Most Requested Permissions")
plt.xlabel("Frequency")
plt.ylabel("Permissions")
plt.show()
Enter fullscreen mode Exit fullscreen mode

  • Apps mostly request internet, network, and storage access; location and vibration are less common.
top_permissions = permission_counts.index.tolist()
malware_ratio = {}

for col in top_permissions:
    if col in data.columns:
        malware_ratio[col] = data[data[col] == 1]['Result'].mean()
    else:
        original_col_index = data.columns.get_loc(col.upper()) 
        print(f"Warning: Column '{col}' not found after cleaning. Skipping.")

plt.figure(figsize=(10,6))
sns.barplot(x=list(malware_ratio.values()), y=list(malware_ratio.keys()), palette='coolwarm', hue=list(malware_ratio.keys()))
plt.title("Malware Probability per Top 10 Permission")
plt.xlabel("Malware Rate (P(Malware|Permission))")
plt.ylabel("Permissions")
plt.xlim(0,1)
plt.show()
Enter fullscreen mode Exit fullscreen mode


perm_cols = ['INTERNET', 'ACCESS_FINE_LOCATION', 'READ_SMS', 'RECORD_AUDIO']
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

for i, col in enumerate(perm_cols):
    ax = axes[i // 2, i % 2]
    ctab = pd.crosstab(data[col], data['Result'], normalize='index')
    ctab.plot(kind='bar', stacked=True, colormap='coolwarm', ax=ax)
    ax.set_title(f"{col} vs Result")
    ax.set_xlabel(f"{col} (0=No, 1=Yes)")
    ax.set_ylabel("Proportion")
    ax.legend(title='Result', labels=['Benign', 'Malware'])

plt.tight_layout()
plt.show()
Enter fullscreen mode Exit fullscreen mode

  • READ_SMS and ACCESS_FINE_LOCATION are strongly linked to malware, while RECORD_AUDIO shows a higher link to benign apps.
malware_prob = data.groupby('Result').mean().T
malware_prob.columns = ['Benign_Prob', 'Malware_Prob']
malware_prob['Diff'] = malware_prob['Malware_Prob'] - malware_prob['Benign_Prob']
top_diff = malware_prob.sort_values('Diff', ascending=False).head(10)

top_diff[['Malware_Prob', 'Benign_Prob']].plot(kind='bar', figsize=(12, 6))
plt.title("Top 10 Permissions with Highest Malware Bias")
plt.ylabel("Probability")
plt.xlabel("Permission")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Enter fullscreen mode Exit fullscreen mode

  • READ_PHONE_STATE_1, BOOT_COMPLETED, and LOCATION permissions show a strong malware bias, with very low benign association.
top4 = top_diff.index[:4].tolist()
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

for ax, col in zip(axes.flatten(), top4):
    sns.countplot(data=data, x=col, hue='Result', palette='Set2', ax=ax)
    ax.set_title(f"{col} vs Result")
    ax.set_xlabel(f"{col} (0 = No, 1 = Yes)")
    ax.set_ylabel("Count")
    ax.legend(title="Result", labels=["Benign", "Malware"])

plt.tight_layout()
plt.suptitle("Top 4 Discriminative Permissions vs Result", y=1.02, fontsize=14)
plt.show()
Enter fullscreen mode Exit fullscreen mode

  • Permissions like READ_PHONE_STATE_1 and RECEIVE_BOOT_COMPLETED are strongly associated with malware, making them key discriminators.
perm_only = data.drop(columns=['Result'])
corr_matrix = perm_only.corr()

plt.figure(figsize=(15, 12))
sns.heatmap(corr_matrix, cmap='YlGnBu', linewidths=0.1)
plt.title("Correlation Heatmap Among Permissions")
plt.show()
Enter fullscreen mode Exit fullscreen mode

I've performed 8-10 key analyses, but remember EDA isn't just a formality, it's the compass of any data-driven solution. It uncovers hidden patterns, exposes bias, and tells the story your model needs to hear. Feel free to explore even deeper data always has more to say.


πŸ›  Feature Engineering

We'll extract meaningful features from the existing permission data to help our machine learning models make better predictions.

βœ… 1. Total Permissions Used by Each App

# Calculate the total number of permissions used by each app
data['Total_Permissions'] = data.drop(columns='Result').sum(axis=1)
Enter fullscreen mode Exit fullscreen mode

πŸ“Œ Explanation:

We're summing across all permission columns (except Result) for each app to get a single feature: how many total permissions the app requests. Malicious apps tend to request more permissions.

βœ… 2. Count of Top 4 Most Discriminative Permissions

# Top 4 permissions most strongly associated with malware
top4 = ['READ_PHONE_STATE_1', 'BOOT_COMPLETED', 'RECEIVE_SMS', 'RECEIVE_BOOT_COMPLETED']
data['Top4_Permission_Count'] = data[top4].sum(axis=1)
Enter fullscreen mode Exit fullscreen mode

πŸ“Œ Explanation:

We identified 4 key permissions during EDA that correlate highly with malware. This feature captures how many of those risky permissions are used per app.

βœ… 3. Suspicious Permission Flag

# Flag apps that use any known suspicious permissions
suspicious_perms = ['ACCESS_FINE_LOCATION', 'ACCESS_COARSE_LOCATION', 'RECORD_AUDIO', 'READ_SMS', 'READ_LOGS']
data['Suspicious_Flag'] = data[suspicious_perms].sum(axis=1).apply(lambda x: 1 if x > 0 else 0)
Enter fullscreen mode Exit fullscreen mode

πŸ“Œ Explanation:

This is a binary flag:

1 β†’ If an app uses any suspicious permissions

0 β†’ If it uses none

It's a quick indicator of potentially dangerous behavior.


πŸ”„ Data Preparation Splitting Features and Target

We'll now separate our dataset into:

X β†’ All the input features (independent variables)

y β†’ The target variable (Result) that indicates whether an app is benign (0) or malware (1)

# Split features and target
X = data.drop(columns='Result')
y = data['Result']
Enter fullscreen mode Exit fullscreen mode

πŸ“Œ Explanation:

We drop the Result column from X because it's what we want to predict - it becomes our y.

βœ‚οΈ Train-Test Split

Now let's split the data into training and test sets:

from sklearn.model_selection import train_test_split

# Use stratification to maintain class balance in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
Enter fullscreen mode Exit fullscreen mode

πŸ“Œ Explanation:

  • test_size=0.2 β†’ 20% of the data is reserved for testing.
  • stratify=y β†’ Ensures both sets maintain the same class distribution (malware vs benign).

πŸš€ Model Training & Evaluation

We will now train and evaluate four classification models:

  1. Logistic Regression
  2. K-Nearest Neighbors (KNN)
  3. Support Vector Machine (SVM)
  4. Decision Tree

1️⃣ Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, RocCurveDisplay

# Instantiate and train
log_reg = LogisticRegression(solver='liblinear', random_state=42)
log_reg.fit(X_train, y_train)

# Predict
y_pred_lr = log_reg.predict(X_test)
y_proba_lr = log_reg.predict_proba(X_test)[:, 1]

# Evaluate
print("Logistic Regression")
print(classification_report(y_test, y_pred_lr))
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba_lr))

# ROC Curve
RocCurveDisplay.from_estimator(log_reg, X_test, y_test)
plt.title("Logistic Regression - ROC Curve")
plt.show()
Enter fullscreen mode Exit fullscreen mode

πŸ“Œ Explanation:

Logistic Regression is a linear model that's lightweight and interpretable.

We also check ROC-AUC to evaluate its classification ability on imbalanced data.

2️⃣ K-Nearest Neighbors (KNN)

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

y_pred_knn = knn.predict(X_test)
y_proba_knn = knn.predict_proba(X_test)[:, 1]

print("K-Nearest Neighbors")
print(classification_report(y_test, y_pred_knn))
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba_knn))

RocCurveDisplay.from_estimator(knn, X_test, y_test)
plt.title("KNN - ROC Curve")
plt.show()
Enter fullscreen mode Exit fullscreen mode

πŸ“Œ Explanation:

KNN is a distance-based model; it finds the k-nearest samples and predicts based on majority class.

Performance heavily depends on the choice of k.

3️⃣ Support Vector Machine (SVM)

from sklearn.svm import SVC

svm = SVC(probability=True, kernel='rbf', random_state=42)
svm.fit(X_train, y_train)

y_pred_svm = svm.predict(X_test)
y_proba_svm = svm.predict_proba(X_test)[:, 1]

print("Support Vector Machine")
print(classification_report(y_test, y_pred_svm))
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba_svm))

RocCurveDisplay.from_estimator(svm, X_test, y_test)
plt.title("SVM - ROC Curve")
plt.show()
Enter fullscreen mode Exit fullscreen mode

πŸ“Œ Explanation:

SVM with RBF kernel handles non-linear patterns well.

Setting probability=True allows ROC curve plotting.

4️⃣ Decision Tree

from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)

y_pred_dt = dt.predict(X_test)
y_proba_dt = dt.predict_proba(X_test)[:, 1]

print("Decision Tree")
print(classification_report(y_test, y_pred_dt))
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba_dt))

RocCurveDisplay.from_estimator(dt, X_test, y_test)
plt.title("Decision Tree - ROC Curve")
plt.show()
Enter fullscreen mode Exit fullscreen mode

πŸ“Œ Explanation:

Decision Trees are simple to interpret and capture non-linear relationships.

May overfit without proper pruning or depth control.

As you can see, Logistic Regression already gives a good result, but let's try to improve the accuracy through hyperparameter tuning to see if we can achieve a better model.

πŸ”§ Hyperparameter Tuning

We'll now use GridSearchCV to fine-tune each model by testing various combinations of parameters to find the best performing ones. This helps in optimizing model performance.

1️⃣ Logistic Regression - Grid Search

from sklearn.model_selection import GridSearchCV

param_grid_lr = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

grid_lr = GridSearchCV(LogisticRegression(random_state=42), param_grid_lr, cv=5, scoring='accuracy')
grid_lr.fit(X_train, y_train)

print("Best Parameters (Logistic Regression):", grid_lr.best_params_)
print("Best Score:", grid_lr.best_score_)
Enter fullscreen mode Exit fullscreen mode

πŸ“Œ Explanation:

  • C controls regularization strength (smaller values = stronger regularization).
  • penalty chooses between L1 (sparse) and L2 (ridge) regularization.
  • liblinear is used for small datasets and supports L1.

2️⃣ K-Nearest Neighbors - Grid Search

param_grid_knn = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

grid_knn = GridSearchCV(KNeighborsClassifier(), param_grid_knn, cv=5, scoring='accuracy')
grid_knn.fit(X_train, y_train)

print("Best Parameters (KNN):", grid_knn.best_params_)
print("Best Score:", grid_knn.best_score_)
Enter fullscreen mode Exit fullscreen mode

πŸ“Œ Explanation:

  • n_neighbors: number of nearest neighbors.
  • weights: uniform (equal) or distance-weighted voting.
  • metric: distance calculation method.

3️⃣ Support Vector Machine - Grid Search

param_grid_svm = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

grid_svm = GridSearchCV(SVC(probability=True, random_state=42), param_grid_svm, cv=5, scoring='accuracy')
grid_svm.fit(X_train, y_train)

print("Best Parameters (SVM):", grid_svm.best_params_)
print("Best Score:", grid_svm.best_score_)
Enter fullscreen mode Exit fullscreen mode

πŸ“Œ Explanation:

  • C: regularization.
  • kernel: transformation applied to input.
  • gamma: influence of a single data point.

4️⃣ Decision Tree - Grid Search

param_grid_dt = {
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}

grid_dt = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid_dt, cv=5, scoring='accuracy')
grid_dt.fit(X_train, y_train)

print("Best Parameters (Decision Tree):", grid_dt.best_params_)
print("Best Score:", grid_dt.best_score_)
Enter fullscreen mode Exit fullscreen mode

πŸ“Œ Explanation:

  • max_depth: depth of the tree.
  • min_samples_split: minimum samples to split an internal node.
  • criterion: function to measure quality of split.

βœ… Retrain Models with Best Hyperparameters (from GridSearchCV)

We'll now use the best parameters obtained from GridSearchCV to retrain our models and evaluate them on the test set.

πŸ” Refit All Models

# Best models from GridSearch
model_lr  = grid_lr.best_estimator_
model_knn = grid_knn.best_estimator_
model_svm = grid_svm.best_estimator_
model_dt  = grid_dt.best_estimator_

# Fit models to training data
model_lr.fit(X_train, y_train)
model_knn.fit(X_train, y_train)
model_svm.fit(X_train, y_train)
model_dt.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

πŸ“Š Evaluate All Models

from sklearn.metrics import classification_report, roc_curve, auc
import matplotlib.pyplot as plt

models = {
    "Logistic Regression": model_lr,
    "KNN": model_knn,
    "SVM": model_svm,
    "Decision Tree": model_dt
}

plt.figure(figsize=(10, 7))
for name, model in models.items():
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    print(f"\n{name}")
    print(classification_report(y_test, y_pred, digits=4))

    fpr, tpr, _ = roc_curve(y_test, y_prob)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f"{name} (AUC = {roc_auc:.2f})")

# Plot random classifier
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()
Enter fullscreen mode Exit fullscreen mode

🧠 Outcome Insight:

Each model is now optimized and re-evaluated using the best hyperparameters. You'll clearly observe how well each performs on the ROC curve and through the precision/recall/f1 metrics.

From our observations, the base models performed significantly better than the hyperparameter-tuned versions. Therefore, the tuned models are not suitable in this case. Let's explore alternative approaches such as using StratifiedKFold, RandomizedSearchCV, and increasing the number of cross-validation folds to see if we can further improve the model's performance.

πŸ” RandomizedSearchCV with StratifiedKFold Cross-Validation

Now we will perform more robust tuning using RandomizedSearchCV combined with StratifiedKFold to preserve label proportions across folds and improve model generalization.

βš™οΈ Define Cross-Validation Strategy

from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV
import numpy as np

cv_strategy = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
scoring = 'accuracy'
Enter fullscreen mode Exit fullscreen mode

🎯 1. Logistic Regression with RandomizedSearchCV

from sklearn.linear_model import LogisticRegression

param_dist_lr = {
    'C': np.logspace(-3, 2, 100),
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

rand_lr = RandomizedSearchCV(
    LogisticRegression(random_state=42),
    param_distributions=param_dist_lr,
    n_iter=30,
    cv=cv_strategy,
    scoring=scoring,
    random_state=42,
    n_jobs=-1
)
rand_lr.fit(X_train, y_train)

print("Best Params (Logistic Regression):", rand_lr.best_params_)
Enter fullscreen mode Exit fullscreen mode

βœ… This step allows us to explore a wide range of regularization strengths (C values) using a logarithmic scale and two different penalty types, enabling more flexibility than grid search.

🎯 2. K-Nearest Neighbors (KNN) with RandomizedSearchCV

from sklearn.neighbors import KNeighborsClassifier

param_dist_knn = {
    'n_neighbors': np.arange(3, 30),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

rand_knn = RandomizedSearchCV(
    KNeighborsClassifier(),
    param_distributions=param_dist_knn,
    n_iter=30,
    cv=cv_strategy,
    scoring=scoring,
    random_state=42,
    n_jobs=-1
)
rand_knn.fit(X_train, y_train)

print("Best Params (KNN):", rand_knn.best_params_)
Enter fullscreen mode Exit fullscreen mode

βœ… Here, we explore different neighbor values and distance metrics to optimize how the KNN model measures similarity between apps.

🎯 3. Support Vector Machine (SVM) with RandomizedSearchCV

from sklearn.svm import SVC

param_dist_svm = {
    'C': np.logspace(-2, 2, 20),
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

rand_svm = RandomizedSearchCV(
    SVC(probability=True, random_state=42),
    param_distributions=param_dist_svm,
    n_iter=20,
    cv=cv_strategy,
    scoring=scoring,
    random_state=42,
    n_jobs=-1
)
rand_svm.fit(X_train, y_train)

print("Best Params (SVM):", rand_svm.best_params_)
Enter fullscreen mode Exit fullscreen mode

βœ… This allows us to tune regularization and kernel types. SVC is sensitive to hyperparameters, so random search helps avoid exhaustive computation.

🎯 4. Decision Tree with RandomizedSearchCV

from sklearn.tree import DecisionTreeClassifier

param_dist_dt = {
    'max_depth': [None, 5, 10, 20, 50],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}

rand_dt = RandomizedSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_distributions=param_dist_dt,
    n_iter=20,
    cv=cv_strategy,
    scoring=scoring,
    random_state=42,
    n_jobs=-1
)
rand_dt.fit(X_train, y_train)

print("Best Params (Decision Tree):", rand_dt.best_params_)
Enter fullscreen mode Exit fullscreen mode

βœ… This tuning balances between underfitting and overfitting by varying tree depth and split criteria.

βœ… Fit the Best Models from RandomizedSearchCV

model_lr  = rand_lr.best_estimator_
model_knn = rand_knn.best_estimator_
model_svm = rand_svm.best_estimator_
model_dt  = rand_dt.best_estimator_

model_lr.fit(X_train, y_train)
model_knn.fit(X_train, y_train)
model_svm.fit(X_train, y_train)
model_dt.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

πŸ”§ We're using the best configurations found from RandomizedSearchCV to retrain each model on the full training data.

πŸ“Š Final Evaluation & ROC Curve Comparison

from sklearn.metrics import classification_report, roc_curve, auc
import matplotlib.pyplot as plt

models = {
    "Logistic Regression": model_lr,
    "KNN": model_knn,
    "SVM": model_svm,
    "Decision Tree": model_dt
}

plt.figure(figsize=(10, 7))
for name, model in models.items():
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    print(f"\n{name}")
    print(classification_report(y_test, y_pred, digits=4))

    fpr, tpr, _ = roc_curve(y_test, y_prob)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f"{name} (AUC = {roc_auc:.2f})")

plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()
Enter fullscreen mode Exit fullscreen mode

πŸ“ˆ This final visualization compares model performance using the Area Under the ROC Curve (AUC). A higher AUC indicates better discrimination between benign and malicious apps.


βœ… Final Model Selection

After evaluating all the models using RandomizedSearchCV and Stratified K-Folds, here's a quick summary of our key insights:

Logistic Regression performed the best overall:

βœ… High accuracy

βœ… High precision and recall for both classes

βœ… Lightweight and fast

βœ… Interpretable and easily deployable

While SVM showed slightly better precision in some cases, Logistic Regression provides a great balance between performance, simplicity, and interpretability - making it ideal for our malware classification use case.

So, we proceed with:

final_model = model_lr  # Logistic Regression as final model
Enter fullscreen mode Exit fullscreen mode

πŸ’Ύ Model Deployment

πŸ’Ύ 1. Save the Final Model

We'll use joblib to save the trained logistic regression model:

import joblib

# Save the final model
joblib.dump(final_model, 'logistic_malware_model.pkl')
feature_columns = X_train.columns.tolist()
joblib.dump(feature_columns, 'feature_columns.pkl')
Enter fullscreen mode Exit fullscreen mode

🧠 2. Load and Predict with New Data

Here's how to load the model and make predictions on new Android app permission data:

# Load model and features
model = joblib.load('logistic_malware_model.pkl')
features = joblib.load('feature_columns.pkl')

# Example: New app's permission vector (as a dictionary or DataFrame row)
import pandas as pd
new_app = pd.DataFrame([{
    'INTERNET': 1,
    'ACCESS_FINE_LOCATION': 1,
    'READ_SMS': 0,
    'RECORD_AUDIO': 0,
    # ... include all other columns used during training
    'Total_Permissions': 5,
    'Top4_Permission_Count': 2,
    'Suspicious_Flag': 1
}])

# Ensure correct column order
new_app = new_app.reindex(columns=features, fill_value=0)

# Predict
prediction = model.predict(new_app)
probability = model.predict_proba(new_app)[:, 1]

print("Prediction (0=Benign, 1=Malware):", prediction[0])
print("Malware Probability Score:", probability[0])
Enter fullscreen mode Exit fullscreen mode

βœ… Ready for Deployment

You can now:

  • Deploy this model in a Flask API or FastAPI server
  • Use it in Android threat detection tools
  • Wrap it in a web app for interactive prediction

🏁 Conclusion

We successfully built a Supervised Learning Classification model to detect malware in Android apps using only the permissions they request.

πŸ”‘ Key Takeaways:

πŸ” Performed thorough Exploratory Data Analysis (EDA) to uncover how different app permissions correlate with malware behavior.

🧠 Trained multiple ML models - Logistic Regression, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Decision Tree - to compare performance.

βš™οΈ Applied techniques like GridSearchCV, RandomizedSearchCV, and StratifiedKFold to fine-tune hyperparameters and improve model accuracy.

πŸ“Š Evaluated all models using classification metrics (Precision, Recall, F1-score) and ROC-AUC curves for robustness.

βœ… Finalized Logistic Regression as the best model - due to its lightweight, interpretable nature and strong performance metrics.

πŸ“Œ This model can now be integrated into mobile security tools to proactively flag suspicious apps before they reach the user.


πŸ’‘ Personal Reflection:

This project helped me revisit core concepts in machine learning and reinforced how critical foundational knowledge is. While it's tempting to always chase complex architectures like XGBoost or deep neural nets, this work reminded me that simple, interpretable models often deliver powerful results when well-tuned.

It's all about:

  • Thinking critically,
  • Iterating consistently,
  • And experimenting openly with different ML strategies.

πŸ” Sometimes, going back to basics is exactly what moves you forward.


πŸ”— Connect with Me

πŸ“Œ This project is part of my learning journey with NxtWave's CCBP 4.0 Academy.

πŸ“– Blog by Naresh B. A.

πŸ‘¨β€πŸ’» Aspiring Full Stack Developer | Passionate Machine Learning and AI Innovation

🌐 Portfolio: Naresh B A

πŸ“« Let's connect on LinkedIn | GitHub

πŸ’‘ Thanks for reading! If you found this useful, drop a like or comment feedback keeps the learning alive.

Top comments (0)