NARESH

Posted on Jul 27

Detecting Android Malware Using Only App Permissions: A Lightweight ML Approach

#machinelearning #datascience #security #programming

👋 Introduction

Hey everyone! 👋

Welcome to my capstone project "Malware Detection Using Android App Permissions" built as part of my learning journey with NxtWave. This project is a great example of how we can apply basic yet powerful machine learning concepts to solve real-world problems without diving into heavy or complex algorithms.

📱 The Problem?

Many Android apps request multiple permissions like access to your location, SMS, or microphone. While some are necessary, others may indicate malicious intent. So I asked: "Can we detect malware just by analyzing the permissions an app asks for?"

🚀 The Goal

Build a supervised machine learning classifier that determines whether an Android app is benign or malicious, using only the permissions it requests.

🔧 Why No Advanced Models (Yet)?

You won't find XGBoost, Random Forests, or deep learning here and that's on purpose. This project is about mastering the fundamentals of supervised learning, using models like Logistic Regression, KNN, SVM, and Decision Trees and making smart decisions using clean code and solid evaluation.

💡 What You'll Learn

How to clean and prepare real-world datasets
How to explore and visualize permission patterns
How to train and evaluate classifiers
The impact of hyperparameter tuning (and when not to use it!)
Why sometimes a simple model like Logistic Regression can outperform the rest

I hope this blog helps you understand the power of simplicity in machine learning. Feel free to drop your thoughts in the comments or ask questions I'd love to connect!

Let's dive in 👇

📥 Step 1: Load and View the Dataset

We'll begin by importing the dataset directly from the provided CSV URL and taking a quick look at its structure.

import pandas as pd

# Load dataset
url = 'https://new-assets.ccbp.in/frontend/content/aiml/classical-ml/android_dataset.csv'
data = pd.read_csv(url)

# Preview first 5 rows
data.head()

Now let's check the column names, dimensions, and basic information:

# Check column names
print(data.columns)

# Check dataset shape
print("Dataset shape:", data.shape)

# Info and data types
data.info()

# Basic statistical summary
data.describe()

👉 Outcome:

You'll see a dataset where each row represents an Android app, each column represents a permission, and the target column Result indicates if it's benign (0) or malware (1).

🧹 Step 2: Data Cleaning

The dataset contains many permission columns with long and redundant names. We'll clean these up for easier interpretation and handling during modeling.

✅ Rename and Standardize Column Names

import re

# Function to remove common prefixes
def clean_column(col_name):
    return re.sub(r'(android\.permission\.|com\.|me\.)', '', col_name)

# Apply to all columns except 'Result'
data.columns = [clean_column(col) if col != 'Result' else col for col in data.columns]
data.head()

🔁 Further Cleanup to Make Column Names Unique and Clear

def ultra_clean_column(col_name):
    if col_name == 'Result':
        return 'Result'
    parts = col_name.upper().split('.')
    ignore_keywords = {
        "COM", "ANDROID", "HTC", "HUAWEI", "SONY", "GOOGLE", "SEC", "OPPO",
        "VENDING", "GMS", "EVERYTHING", "LAUNCHER", "ALARM", "DEVICE",
        "PROVIDERS", "BADGER", "PERMISSION", "FINSKY", "INSERT", "COUNT",
        "GSF", "C2DM", "MAJEUR", "ANDDOES", "SONYMOBILE", "SAMSUNG", "AMAZON"
    }
    cleaned = [word for word in parts if word not in ignore_keywords and len(word) > 2]
    return "_".join(cleaned)

def clean_and_make_unique(df):
    cols = df.columns.tolist()
    cleaned_cols = [ultra_clean_column(col) for col in cols]
    seen = {}
    unique_cols = []
    for item in cleaned_cols:
        original_item = item
        count = 1
        while item in seen:
            item = f"{original_item}_{count}"
            count += 1
        seen[item] = True
        unique_cols.append(item)
    df.columns = unique_cols
    return df

# Apply deep cleaning
data = clean_and_make_unique(data)
data.head()

🧼 Remove Missing and Duplicate Records

# Check for missing values
print("Missing values per column:\n", data.isna().sum())

# Check for duplicates
print("Duplicate rows:", data.duplicated().sum())

# Drop duplicates
data.drop_duplicates(inplace=True)
print("Shape after dropping duplicates:", data.shape)

👉 Outcome:

Cleaner, readable column names.
Duplicates removed (important: dataset initially had ~70% duplicates!) we need to drop it since the model won't learn anything meaningful from it.
No missing values to worry about.

📊 Step 3: Exploratory Data Analysis (EDA)

import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x='Result', data=data, hue='Result', palette='viridis')
plt.title("Target Variable Distribution")
plt.xlabel("App Type (0: Benign, 1: Malware)")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

Benign apps outnumber malware apps, indicating a class imbalance in the dataset.

permission_counts = data.drop('Result', axis=1).sum().sort_values(ascending=False).head(10)

plt.figure(figsize=(10,6))
sns.barplot(x=permission_counts.values, y=permission_counts.index, palette='magma')
plt.title("Top 10 Most Requested Permissions")
plt.xlabel("Frequency")
plt.ylabel("Permissions")
plt.show()

Apps mostly request internet, network, and storage access; location and vibration are less common.

top_permissions = permission_counts.index.tolist()
malware_ratio = {}

for col in top_permissions:
    if col in data.columns:
        malware_ratio[col] = data[data[col] == 1]['Result'].mean()
    else:
        original_col_index = data.columns.get_loc(col.upper()) 
        print(f"Warning: Column '{col}' not found after cleaning. Skipping.")

plt.figure(figsize=(10,6))
sns.barplot(x=list(malware_ratio.values()), y=list(malware_ratio.keys()), palette='coolwarm', hue=list(malware_ratio.keys()))
plt.title("Malware Probability per Top 10 Permission")
plt.xlabel("Malware Rate (P(Malware|Permission))")
plt.ylabel("Permissions")
plt.xlim(0,1)
plt.show()

perm_cols = ['INTERNET', 'ACCESS_FINE_LOCATION', 'READ_SMS', 'RECORD_AUDIO']
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

for i, col in enumerate(perm_cols):
    ax = axes[i // 2, i % 2]
    ctab = pd.crosstab(data[col], data['Result'], normalize='index')
    ctab.plot(kind='bar', stacked=True, colormap='coolwarm', ax=ax)
    ax.set_title(f"{col} vs Result")
    ax.set_xlabel(f"{col} (0=No, 1=Yes)")
    ax.set_ylabel("Proportion")
    ax.legend(title='Result', labels=['Benign', 'Malware'])

plt.tight_layout()
plt.show()

READ_SMS and ACCESS_FINE_LOCATION are strongly linked to malware, while RECORD_AUDIO shows a higher link to benign apps.

malware_prob = data.groupby('Result').mean().T
malware_prob.columns = ['Benign_Prob', 'Malware_Prob']
malware_prob['Diff'] = malware_prob['Malware_Prob'] - malware_prob['Benign_Prob']
top_diff = malware_prob.sort_values('Diff', ascending=False).head(10)

top_diff[['Malware_Prob', 'Benign_Prob']].plot(kind='bar', figsize=(12, 6))
plt.title("Top 10 Permissions with Highest Malware Bias")
plt.ylabel("Probability")
plt.xlabel("Permission")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

READ_PHONE_STATE_1, BOOT_COMPLETED, and LOCATION permissions show a strong malware bias, with very low benign association.

top4 = top_diff.index[:4].tolist()
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

for ax, col in zip(axes.flatten(), top4):
    sns.countplot(data=data, x=col, hue='Result', palette='Set2', ax=ax)
    ax.set_title(f"{col} vs Result")
    ax.set_xlabel(f"{col} (0 = No, 1 = Yes)")
    ax.set_ylabel("Count")
    ax.legend(title="Result", labels=["Benign", "Malware"])

plt.tight_layout()
plt.suptitle("Top 4 Discriminative Permissions vs Result", y=1.02, fontsize=14)
plt.show()

Permissions like READ_PHONE_STATE_1 and RECEIVE_BOOT_COMPLETED are strongly associated with malware, making them key discriminators.

perm_only = data.drop(columns=['Result'])
corr_matrix = perm_only.corr()

plt.figure(figsize=(15, 12))
sns.heatmap(corr_matrix, cmap='YlGnBu', linewidths=0.1)
plt.title("Correlation Heatmap Among Permissions")
plt.show()

I've performed 8-10 key analyses, but remember EDA isn't just a formality, it's the compass of any data-driven solution. It uncovers hidden patterns, exposes bias, and tells the story your model needs to hear. Feel free to explore even deeper data always has more to say.

🛠 Feature Engineering

We'll extract meaningful features from the existing permission data to help our machine learning models make better predictions.

✅ 1. Total Permissions Used by Each App

# Calculate the total number of permissions used by each app
data['Total_Permissions'] = data.drop(columns='Result').sum(axis=1)

📌 Explanation:

We're summing across all permission columns (except Result) for each app to get a single feature: how many total permissions the app requests. Malicious apps tend to request more permissions.

✅ 2. Count of Top 4 Most Discriminative Permissions

# Top 4 permissions most strongly associated with malware
top4 = ['READ_PHONE_STATE_1', 'BOOT_COMPLETED', 'RECEIVE_SMS', 'RECEIVE_BOOT_COMPLETED']
data['Top4_Permission_Count'] = data[top4].sum(axis=1)

📌 Explanation:

We identified 4 key permissions during EDA that correlate highly with malware. This feature captures how many of those risky permissions are used per app.

✅ 3. Suspicious Permission Flag

# Flag apps that use any known suspicious permissions
suspicious_perms = ['ACCESS_FINE_LOCATION', 'ACCESS_COARSE_LOCATION', 'RECORD_AUDIO', 'READ_SMS', 'READ_LOGS']
data['Suspicious_Flag'] = data[suspicious_perms].sum(axis=1).apply(lambda x: 1 if x > 0 else 0)

📌 Explanation:

This is a binary flag:

1 → If an app uses any suspicious permissions

0 → If it uses none

It's a quick indicator of potentially dangerous behavior.

🔄 Data Preparation Splitting Features and Target

We'll now separate our dataset into:

X → All the input features (independent variables)

y → The target variable (Result) that indicates whether an app is benign (0) or malware (1)

# Split features and target
X = data.drop(columns='Result')
y = data['Result']

📌 Explanation:

We drop the Result column from X because it's what we want to predict - it becomes our y.

✂️ Train-Test Split

Now let's split the data into training and test sets:

from sklearn.model_selection import train_test_split

# Use stratification to maintain class balance in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

📌 Explanation:

test_size=0.2 → 20% of the data is reserved for testing.
stratify=y → Ensures both sets maintain the same class distribution (malware vs benign).

🚀 Model Training & Evaluation

We will now train and evaluate four classification models:

Logistic Regression
K-Nearest Neighbors (KNN)
Support Vector Machine (SVM)
Decision Tree

1️⃣ Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, RocCurveDisplay

# Instantiate and train
log_reg = LogisticRegression(solver='liblinear', random_state=42)
log_reg.fit(X_train, y_train)

# Predict
y_pred_lr = log_reg.predict(X_test)
y_proba_lr = log_reg.predict_proba(X_test)[:, 1]

# Evaluate
print("Logistic Regression")
print(classification_report(y_test, y_pred_lr))
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba_lr))

# ROC Curve
RocCurveDisplay.from_estimator(log_reg, X_test, y_test)
plt.title("Logistic Regression - ROC Curve")
plt.show()

📌 Explanation:

Logistic Regression is a linear model that's lightweight and interpretable.

We also check ROC-AUC to evaluate its classification ability on imbalanced data.

2️⃣ K-Nearest Neighbors (KNN)

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

y_pred_knn = knn.predict(X_test)
y_proba_knn = knn.predict_proba(X_test)[:, 1]

print("K-Nearest Neighbors")
print(classification_report(y_test, y_pred_knn))
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba_knn))

RocCurveDisplay.from_estimator(knn, X_test, y_test)
plt.title("KNN - ROC Curve")
plt.show()

📌 Explanation:

KNN is a distance-based model; it finds the k-nearest samples and predicts based on majority class.

Performance heavily depends on the choice of k.

3️⃣ Support Vector Machine (SVM)

from sklearn.svm import SVC

svm = SVC(probability=True, kernel='rbf', random_state=42)
svm.fit(X_train, y_train)

y_pred_svm = svm.predict(X_test)
y_proba_svm = svm.predict_proba(X_test)[:, 1]

print("Support Vector Machine")
print(classification_report(y_test, y_pred_svm))
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba_svm))

RocCurveDisplay.from_estimator(svm, X_test, y_test)
plt.title("SVM - ROC Curve")
plt.show()

📌 Explanation:

SVM with RBF kernel handles non-linear patterns well.

Setting probability=True allows ROC curve plotting.

4️⃣ Decision Tree

from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)

y_pred_dt = dt.predict(X_test)
y_proba_dt = dt.predict_proba(X_test)[:, 1]

print("Decision Tree")
print(classification_report(y_test, y_pred_dt))
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba_dt))

RocCurveDisplay.from_estimator(dt, X_test, y_test)
plt.title("Decision Tree - ROC Curve")
plt.show()

📌 Explanation:

Decision Trees are simple to interpret and capture non-linear relationships.

May overfit without proper pruning or depth control.

As you can see, Logistic Regression already gives a good result, but let's try to improve the accuracy through hyperparameter tuning to see if we can achieve a better model.

🔧 Hyperparameter Tuning

We'll now use GridSearchCV to fine-tune each model by testing various combinations of parameters to find the best performing ones. This helps in optimizing model performance.

1️⃣ Logistic Regression - Grid Search

from sklearn.model_selection import GridSearchCV

param_grid_lr = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

grid_lr = GridSearchCV(LogisticRegression(random_state=42), param_grid_lr, cv=5, scoring='accuracy')
grid_lr.fit(X_train, y_train)

print("Best Parameters (Logistic Regression):", grid_lr.best_params_)
print("Best Score:", grid_lr.best_score_)

📌 Explanation:

C controls regularization strength (smaller values = stronger regularization).
penalty chooses between L1 (sparse) and L2 (ridge) regularization.
liblinear is used for small datasets and supports L1.

2️⃣ K-Nearest Neighbors - Grid Search

param_grid_knn = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

grid_knn = GridSearchCV(KNeighborsClassifier(), param_grid_knn, cv=5, scoring='accuracy')
grid_knn.fit(X_train, y_train)

print("Best Parameters (KNN):", grid_knn.best_params_)
print("Best Score:", grid_knn.best_score_)

📌 Explanation:

n_neighbors: number of nearest neighbors.
weights: uniform (equal) or distance-weighted voting.
metric: distance calculation method.

3️⃣ Support Vector Machine - Grid Search

param_grid_svm = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

grid_svm = GridSearchCV(SVC(probability=True, random_state=42), param_grid_svm, cv=5, scoring='accuracy')
grid_svm.fit(X_train, y_train)

print("Best Parameters (SVM):", grid_svm.best_params_)
print("Best Score:", grid_svm.best_score_)

📌 Explanation:

C: regularization.
kernel: transformation applied to input.
gamma: influence of a single data point.

4️⃣ Decision Tree - Grid Search

param_grid_dt = {
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}

grid_dt = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid_dt, cv=5, scoring='accuracy')
grid_dt.fit(X_train, y_train)

print("Best Parameters (Decision Tree):", grid_dt.best_params_)
print("Best Score:", grid_dt.best_score_)

📌 Explanation:

max_depth: depth of the tree.
min_samples_split: minimum samples to split an internal node.
criterion: function to measure quality of split.

✅ Retrain Models with Best Hyperparameters (from GridSearchCV)

We'll now use the best parameters obtained from GridSearchCV to retrain our models and evaluate them on the test set.

🔁 Refit All Models

# Best models from GridSearch
model_lr  = grid_lr.best_estimator_
model_knn = grid_knn.best_estimator_
model_svm = grid_svm.best_estimator_
model_dt  = grid_dt.best_estimator_

# Fit models to training data
model_lr.fit(X_train, y_train)
model_knn.fit(X_train, y_train)
model_svm.fit(X_train, y_train)
model_dt.fit(X_train, y_train)

📊 Evaluate All Models

from sklearn.metrics import classification_report, roc_curve, auc
import matplotlib.pyplot as plt

models = {
    "Logistic Regression": model_lr,
    "KNN": model_knn,
    "SVM": model_svm,
    "Decision Tree": model_dt
}

plt.figure(figsize=(10, 7))
for name, model in models.items():
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    print(f"\n{name}")
    print(classification_report(y_test, y_pred, digits=4))

    fpr, tpr, _ = roc_curve(y_test, y_prob)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f"{name} (AUC = {roc_auc:.2f})")

# Plot random classifier
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

🧠 Outcome Insight:

Each model is now optimized and re-evaluated using the best hyperparameters. You'll clearly observe how well each performs on the ROC curve and through the precision/recall/f1 metrics.

From our observations, the base models performed significantly better than the hyperparameter-tuned versions. Therefore, the tuned models are not suitable in this case. Let's explore alternative approaches such as using StratifiedKFold, RandomizedSearchCV, and increasing the number of cross-validation folds to see if we can further improve the model's performance.

🔁 RandomizedSearchCV with StratifiedKFold Cross-Validation

Now we will perform more robust tuning using RandomizedSearchCV combined with StratifiedKFold to preserve label proportions across folds and improve model generalization.

⚙️ Define Cross-Validation Strategy

from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV
import numpy as np

cv_strategy = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
scoring = 'accuracy'

🎯 1. Logistic Regression with RandomizedSearchCV

from sklearn.linear_model import LogisticRegression

param_dist_lr = {
    'C': np.logspace(-3, 2, 100),
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

rand_lr = RandomizedSearchCV(
    LogisticRegression(random_state=42),
    param_distributions=param_dist_lr,
    n_iter=30,
    cv=cv_strategy,
    scoring=scoring,
    random_state=42,
    n_jobs=-1
)
rand_lr.fit(X_train, y_train)

print("Best Params (Logistic Regression):", rand_lr.best_params_)

✅ This step allows us to explore a wide range of regularization strengths (C values) using a logarithmic scale and two different penalty types, enabling more flexibility than grid search.

🎯 2. K-Nearest Neighbors (KNN) with RandomizedSearchCV

from sklearn.neighbors import KNeighborsClassifier

param_dist_knn = {
    'n_neighbors': np.arange(3, 30),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

rand_knn = RandomizedSearchCV(
    KNeighborsClassifier(),
    param_distributions=param_dist_knn,
    n_iter=30,
    cv=cv_strategy,
    scoring=scoring,
    random_state=42,
    n_jobs=-1
)
rand_knn.fit(X_train, y_train)

print("Best Params (KNN):", rand_knn.best_params_)

✅ Here, we explore different neighbor values and distance metrics to optimize how the KNN model measures similarity between apps.

🎯 3. Support Vector Machine (SVM) with RandomizedSearchCV

from sklearn.svm import SVC

param_dist_svm = {
    'C': np.logspace(-2, 2, 20),
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

rand_svm = RandomizedSearchCV(
    SVC(probability=True, random_state=42),
    param_distributions=param_dist_svm,
    n_iter=20,
    cv=cv_strategy,
    scoring=scoring,
    random_state=42,
    n_jobs=-1
)
rand_svm.fit(X_train, y_train)

print("Best Params (SVM):", rand_svm.best_params_)

✅ This allows us to tune regularization and kernel types. SVC is sensitive to hyperparameters, so random search helps avoid exhaustive computation.

🎯 4. Decision Tree with RandomizedSearchCV

from sklearn.tree import DecisionTreeClassifier

param_dist_dt = {
    'max_depth': [None, 5, 10, 20, 50],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}

rand_dt = RandomizedSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_distributions=param_dist_dt,
    n_iter=20,
    cv=cv_strategy,
    scoring=scoring,
    random_state=42,
    n_jobs=-1
)
rand_dt.fit(X_train, y_train)

print("Best Params (Decision Tree):", rand_dt.best_params_)

✅ This tuning balances between underfitting and overfitting by varying tree depth and split criteria.

✅ Fit the Best Models from RandomizedSearchCV

model_lr  = rand_lr.best_estimator_
model_knn = rand_knn.best_estimator_
model_svm = rand_svm.best_estimator_
model_dt  = rand_dt.best_estimator_

model_lr.fit(X_train, y_train)
model_knn.fit(X_train, y_train)
model_svm.fit(X_train, y_train)
model_dt.fit(X_train, y_train)

🔧 We're using the best configurations found from RandomizedSearchCV to retrain each model on the full training data.

📊 Final Evaluation & ROC Curve Comparison

from sklearn.metrics import classification_report, roc_curve, auc
import matplotlib.pyplot as plt

models = {
    "Logistic Regression": model_lr,
    "KNN": model_knn,
    "SVM": model_svm,
    "Decision Tree": model_dt
}

plt.figure(figsize=(10, 7))
for name, model in models.items():
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    print(f"\n{name}")
    print(classification_report(y_test, y_pred, digits=4))

    fpr, tpr, _ = roc_curve(y_test, y_prob)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f"{name} (AUC = {roc_auc:.2f})")

plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

📈 This final visualization compares model performance using the Area Under the ROC Curve (AUC). A higher AUC indicates better discrimination between benign and malicious apps.

✅ Final Model Selection

After evaluating all the models using RandomizedSearchCV and Stratified K-Folds, here's a quick summary of our key insights:

Logistic Regression performed the best overall:

✅ High accuracy

✅ High precision and recall for both classes

✅ Lightweight and fast

✅ Interpretable and easily deployable

While SVM showed slightly better precision in some cases, Logistic Regression provides a great balance between performance, simplicity, and interpretability - making it ideal for our malware classification use case.

So, we proceed with:

final_model = model_lr  # Logistic Regression as final model

💾 Model Deployment

💾 1. Save the Final Model

We'll use joblib to save the trained logistic regression model:

import joblib

# Save the final model
joblib.dump(final_model, 'logistic_malware_model.pkl')
feature_columns = X_train.columns.tolist()
joblib.dump(feature_columns, 'feature_columns.pkl')

🧠 2. Load and Predict with New Data

Here's how to load the model and make predictions on new Android app permission data:

# Load model and features
model = joblib.load('logistic_malware_model.pkl')
features = joblib.load('feature_columns.pkl')

# Example: New app's permission vector (as a dictionary or DataFrame row)
import pandas as pd
new_app = pd.DataFrame([{
    'INTERNET': 1,
    'ACCESS_FINE_LOCATION': 1,
    'READ_SMS': 0,
    'RECORD_AUDIO': 0,
    # ... include all other columns used during training
    'Total_Permissions': 5,
    'Top4_Permission_Count': 2,
    'Suspicious_Flag': 1
}])

# Ensure correct column order
new_app = new_app.reindex(columns=features, fill_value=0)

# Predict
prediction = model.predict(new_app)
probability = model.predict_proba(new_app)[:, 1]

print("Prediction (0=Benign, 1=Malware):", prediction[0])
print("Malware Probability Score:", probability[0])

✅ Ready for Deployment

You can now:

Deploy this model in a Flask API or FastAPI server
Use it in Android threat detection tools
Wrap it in a web app for interactive prediction

🏁 Conclusion

We successfully built a Supervised Learning Classification model to detect malware in Android apps using only the permissions they request.

🔑 Key Takeaways:

🔍 Performed thorough Exploratory Data Analysis (EDA) to uncover how different app permissions correlate with malware behavior.

🧠 Trained multiple ML models - Logistic Regression, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Decision Tree - to compare performance.

⚙️ Applied techniques like GridSearchCV, RandomizedSearchCV, and StratifiedKFold to fine-tune hyperparameters and improve model accuracy.

📊 Evaluated all models using classification metrics (Precision, Recall, F1-score) and ROC-AUC curves for robustness.

✅ Finalized Logistic Regression as the best model - due to its lightweight, interpretable nature and strong performance metrics.

📌 This model can now be integrated into mobile security tools to proactively flag suspicious apps before they reach the user.

💡 Personal Reflection:

This project helped me revisit core concepts in machine learning and reinforced how critical foundational knowledge is. While it's tempting to always chase complex architectures like XGBoost or deep neural nets, this work reminded me that simple, interpretable models often deliver powerful results when well-tuned.

It's all about:

Thinking critically,
Iterating consistently,
And experimenting openly with different ML strategies.

🔁 Sometimes, going back to basics is exactly what moves you forward.

🔗 Connect with Me

📌 This project is part of my learning journey with NxtWave's CCBP 4.0 Academy.

📖 Blog by Naresh B. A.

👨‍💻 Aspiring Full Stack Developer | Passionate Machine Learning and AI Innovation

🌐 Portfolio: Naresh B A

📫 Let's connect on LinkedIn | GitHub

💡 Thanks for reading! If you found this useful, drop a like or comment feedback keeps the learning alive.