π Introduction
Hey everyone! π
Welcome to my capstone project "Malware Detection Using Android App Permissions" built as part of my learning journey with NxtWave. This project is a great example of how we can apply basic yet powerful machine learning concepts to solve real-world problems without diving into heavy or complex algorithms.
π± The Problem?
Many Android apps request multiple permissions like access to your location, SMS, or microphone. While some are necessary, others may indicate malicious intent. So I asked: "Can we detect malware just by analyzing the permissions an app asks for?"
π The Goal
Build a supervised machine learning classifier that determines whether an Android app is benign or malicious, using only the permissions it requests.
π§ Why No Advanced Models (Yet)?
You won't find XGBoost, Random Forests, or deep learning here and that's on purpose. This project is about mastering the fundamentals of supervised learning, using models like Logistic Regression, KNN, SVM, and Decision Trees and making smart decisions using clean code and solid evaluation.
π‘ What You'll Learn
- How to clean and prepare real-world datasets
- How to explore and visualize permission patterns
- How to train and evaluate classifiers
- The impact of hyperparameter tuning (and when not to use it!)
- Why sometimes a simple model like Logistic Regression can outperform the rest
I hope this blog helps you understand the power of simplicity in machine learning. Feel free to drop your thoughts in the comments or ask questions I'd love to connect!
Let's dive in π
π₯ Step 1: Load and View the Dataset
We'll begin by importing the dataset directly from the provided CSV URL and taking a quick look at its structure.
import pandas as pd
# Load dataset
url = 'https://new-assets.ccbp.in/frontend/content/aiml/classical-ml/android_dataset.csv'
data = pd.read_csv(url)
# Preview first 5 rows
data.head()
Now let's check the column names, dimensions, and basic information:
# Check column names
print(data.columns)
# Check dataset shape
print("Dataset shape:", data.shape)
# Info and data types
data.info()
# Basic statistical summary
data.describe()
π Outcome:
You'll see a dataset where each row represents an Android app, each column represents a permission, and the target column Result indicates if it's benign (0) or malware (1).
π§Ή Step 2: Data Cleaning
The dataset contains many permission columns with long and redundant names. We'll clean these up for easier interpretation and handling during modeling.
β Rename and Standardize Column Names
import re
# Function to remove common prefixes
def clean_column(col_name):
return re.sub(r'(android\.permission\.|com\.|me\.)', '', col_name)
# Apply to all columns except 'Result'
data.columns = [clean_column(col) if col != 'Result' else col for col in data.columns]
data.head()
π Further Cleanup to Make Column Names Unique and Clear
def ultra_clean_column(col_name):
if col_name == 'Result':
return 'Result'
parts = col_name.upper().split('.')
ignore_keywords = {
"COM", "ANDROID", "HTC", "HUAWEI", "SONY", "GOOGLE", "SEC", "OPPO",
"VENDING", "GMS", "EVERYTHING", "LAUNCHER", "ALARM", "DEVICE",
"PROVIDERS", "BADGER", "PERMISSION", "FINSKY", "INSERT", "COUNT",
"GSF", "C2DM", "MAJEUR", "ANDDOES", "SONYMOBILE", "SAMSUNG", "AMAZON"
}
cleaned = [word for word in parts if word not in ignore_keywords and len(word) > 2]
return "_".join(cleaned)
def clean_and_make_unique(df):
cols = df.columns.tolist()
cleaned_cols = [ultra_clean_column(col) for col in cols]
seen = {}
unique_cols = []
for item in cleaned_cols:
original_item = item
count = 1
while item in seen:
item = f"{original_item}_{count}"
count += 1
seen[item] = True
unique_cols.append(item)
df.columns = unique_cols
return df
# Apply deep cleaning
data = clean_and_make_unique(data)
data.head()
π§Ό Remove Missing and Duplicate Records
# Check for missing values
print("Missing values per column:\n", data.isna().sum())
# Check for duplicates
print("Duplicate rows:", data.duplicated().sum())
# Drop duplicates
data.drop_duplicates(inplace=True)
print("Shape after dropping duplicates:", data.shape)
π Outcome:
- Cleaner, readable column names.
- Duplicates removed (important: dataset initially had ~70% duplicates!) we need to drop it since the model won't learn anything meaningful from it.
- No missing values to worry about.
π Step 3: Exploratory Data Analysis (EDA)
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x='Result', data=data, hue='Result', palette='viridis')
plt.title("Target Variable Distribution")
plt.xlabel("App Type (0: Benign, 1: Malware)")
plt.ylabel("Count")
plt.tight_layout()
plt.show()
- Benign apps outnumber malware apps, indicating a class imbalance in the dataset.
permission_counts = data.drop('Result', axis=1).sum().sort_values(ascending=False).head(10)
plt.figure(figsize=(10,6))
sns.barplot(x=permission_counts.values, y=permission_counts.index, palette='magma')
plt.title("Top 10 Most Requested Permissions")
plt.xlabel("Frequency")
plt.ylabel("Permissions")
plt.show()
- Apps mostly request internet, network, and storage access; location and vibration are less common.
top_permissions = permission_counts.index.tolist()
malware_ratio = {}
for col in top_permissions:
if col in data.columns:
malware_ratio[col] = data[data[col] == 1]['Result'].mean()
else:
original_col_index = data.columns.get_loc(col.upper())
print(f"Warning: Column '{col}' not found after cleaning. Skipping.")
plt.figure(figsize=(10,6))
sns.barplot(x=list(malware_ratio.values()), y=list(malware_ratio.keys()), palette='coolwarm', hue=list(malware_ratio.keys()))
plt.title("Malware Probability per Top 10 Permission")
plt.xlabel("Malware Rate (P(Malware|Permission))")
plt.ylabel("Permissions")
plt.xlim(0,1)
plt.show()
perm_cols = ['INTERNET', 'ACCESS_FINE_LOCATION', 'READ_SMS', 'RECORD_AUDIO']
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
for i, col in enumerate(perm_cols):
ax = axes[i // 2, i % 2]
ctab = pd.crosstab(data[col], data['Result'], normalize='index')
ctab.plot(kind='bar', stacked=True, colormap='coolwarm', ax=ax)
ax.set_title(f"{col} vs Result")
ax.set_xlabel(f"{col} (0=No, 1=Yes)")
ax.set_ylabel("Proportion")
ax.legend(title='Result', labels=['Benign', 'Malware'])
plt.tight_layout()
plt.show()
- READ_SMS and ACCESS_FINE_LOCATION are strongly linked to malware, while RECORD_AUDIO shows a higher link to benign apps.
malware_prob = data.groupby('Result').mean().T
malware_prob.columns = ['Benign_Prob', 'Malware_Prob']
malware_prob['Diff'] = malware_prob['Malware_Prob'] - malware_prob['Benign_Prob']
top_diff = malware_prob.sort_values('Diff', ascending=False).head(10)
top_diff[['Malware_Prob', 'Benign_Prob']].plot(kind='bar', figsize=(12, 6))
plt.title("Top 10 Permissions with Highest Malware Bias")
plt.ylabel("Probability")
plt.xlabel("Permission")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
- READ_PHONE_STATE_1, BOOT_COMPLETED, and LOCATION permissions show a strong malware bias, with very low benign association.
top4 = top_diff.index[:4].tolist()
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
for ax, col in zip(axes.flatten(), top4):
sns.countplot(data=data, x=col, hue='Result', palette='Set2', ax=ax)
ax.set_title(f"{col} vs Result")
ax.set_xlabel(f"{col} (0 = No, 1 = Yes)")
ax.set_ylabel("Count")
ax.legend(title="Result", labels=["Benign", "Malware"])
plt.tight_layout()
plt.suptitle("Top 4 Discriminative Permissions vs Result", y=1.02, fontsize=14)
plt.show()
- Permissions like READ_PHONE_STATE_1 and RECEIVE_BOOT_COMPLETED are strongly associated with malware, making them key discriminators.
perm_only = data.drop(columns=['Result'])
corr_matrix = perm_only.corr()
plt.figure(figsize=(15, 12))
sns.heatmap(corr_matrix, cmap='YlGnBu', linewidths=0.1)
plt.title("Correlation Heatmap Among Permissions")
plt.show()
I've performed 8-10 key analyses, but remember EDA isn't just a formality, it's the compass of any data-driven solution. It uncovers hidden patterns, exposes bias, and tells the story your model needs to hear. Feel free to explore even deeper data always has more to say.
π Feature Engineering
We'll extract meaningful features from the existing permission data to help our machine learning models make better predictions.
β 1. Total Permissions Used by Each App
# Calculate the total number of permissions used by each app
data['Total_Permissions'] = data.drop(columns='Result').sum(axis=1)
π Explanation:
We're summing across all permission columns (except Result) for each app to get a single feature: how many total permissions the app requests. Malicious apps tend to request more permissions.
β 2. Count of Top 4 Most Discriminative Permissions
# Top 4 permissions most strongly associated with malware
top4 = ['READ_PHONE_STATE_1', 'BOOT_COMPLETED', 'RECEIVE_SMS', 'RECEIVE_BOOT_COMPLETED']
data['Top4_Permission_Count'] = data[top4].sum(axis=1)
π Explanation:
We identified 4 key permissions during EDA that correlate highly with malware. This feature captures how many of those risky permissions are used per app.
β 3. Suspicious Permission Flag
# Flag apps that use any known suspicious permissions
suspicious_perms = ['ACCESS_FINE_LOCATION', 'ACCESS_COARSE_LOCATION', 'RECORD_AUDIO', 'READ_SMS', 'READ_LOGS']
data['Suspicious_Flag'] = data[suspicious_perms].sum(axis=1).apply(lambda x: 1 if x > 0 else 0)
π Explanation:
This is a binary flag:
1 β If an app uses any suspicious permissions
0 β If it uses none
It's a quick indicator of potentially dangerous behavior.
π Data Preparation Splitting Features and Target
We'll now separate our dataset into:
X β All the input features (independent variables)
y β The target variable (Result) that indicates whether an app is benign (0) or malware (1)
# Split features and target
X = data.drop(columns='Result')
y = data['Result']
π Explanation:
We drop the Result column from X because it's what we want to predict - it becomes our y.
βοΈ Train-Test Split
Now let's split the data into training and test sets:
from sklearn.model_selection import train_test_split
# Use stratification to maintain class balance in both sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
π Explanation:
- test_size=0.2 β 20% of the data is reserved for testing.
- stratify=y β Ensures both sets maintain the same class distribution (malware vs benign).
π Model Training & Evaluation
We will now train and evaluate four classification models:
- Logistic Regression
- K-Nearest Neighbors (KNN)
- Support Vector Machine (SVM)
- Decision Tree
1οΈβ£ Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, RocCurveDisplay
# Instantiate and train
log_reg = LogisticRegression(solver='liblinear', random_state=42)
log_reg.fit(X_train, y_train)
# Predict
y_pred_lr = log_reg.predict(X_test)
y_proba_lr = log_reg.predict_proba(X_test)[:, 1]
# Evaluate
print("Logistic Regression")
print(classification_report(y_test, y_pred_lr))
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba_lr))
# ROC Curve
RocCurveDisplay.from_estimator(log_reg, X_test, y_test)
plt.title("Logistic Regression - ROC Curve")
plt.show()
π Explanation:
Logistic Regression is a linear model that's lightweight and interpretable.
We also check ROC-AUC to evaluate its classification ability on imbalanced data.
2οΈβ£ K-Nearest Neighbors (KNN)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
y_proba_knn = knn.predict_proba(X_test)[:, 1]
print("K-Nearest Neighbors")
print(classification_report(y_test, y_pred_knn))
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba_knn))
RocCurveDisplay.from_estimator(knn, X_test, y_test)
plt.title("KNN - ROC Curve")
plt.show()
π Explanation:
KNN is a distance-based model; it finds the k-nearest samples and predicts based on majority class.
Performance heavily depends on the choice of k.
3οΈβ£ Support Vector Machine (SVM)
from sklearn.svm import SVC
svm = SVC(probability=True, kernel='rbf', random_state=42)
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)
y_proba_svm = svm.predict_proba(X_test)[:, 1]
print("Support Vector Machine")
print(classification_report(y_test, y_pred_svm))
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba_svm))
RocCurveDisplay.from_estimator(svm, X_test, y_test)
plt.title("SVM - ROC Curve")
plt.show()
π Explanation:
SVM with RBF kernel handles non-linear patterns well.
Setting probability=True allows ROC curve plotting.
4οΈβ£ Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
y_proba_dt = dt.predict_proba(X_test)[:, 1]
print("Decision Tree")
print(classification_report(y_test, y_pred_dt))
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba_dt))
RocCurveDisplay.from_estimator(dt, X_test, y_test)
plt.title("Decision Tree - ROC Curve")
plt.show()
π Explanation:
Decision Trees are simple to interpret and capture non-linear relationships.
May overfit without proper pruning or depth control.
As you can see, Logistic Regression already gives a good result, but let's try to improve the accuracy through hyperparameter tuning to see if we can achieve a better model.
π§ Hyperparameter Tuning
We'll now use GridSearchCV to fine-tune each model by testing various combinations of parameters to find the best performing ones. This helps in optimizing model performance.
1οΈβ£ Logistic Regression - Grid Search
from sklearn.model_selection import GridSearchCV
param_grid_lr = {
'C': [0.01, 0.1, 1, 10, 100],
'penalty': ['l1', 'l2'],
'solver': ['liblinear']
}
grid_lr = GridSearchCV(LogisticRegression(random_state=42), param_grid_lr, cv=5, scoring='accuracy')
grid_lr.fit(X_train, y_train)
print("Best Parameters (Logistic Regression):", grid_lr.best_params_)
print("Best Score:", grid_lr.best_score_)
π Explanation:
- C controls regularization strength (smaller values = stronger regularization).
- penalty chooses between L1 (sparse) and L2 (ridge) regularization.
- liblinear is used for small datasets and supports L1.
2οΈβ£ K-Nearest Neighbors - Grid Search
param_grid_knn = {
'n_neighbors': [3, 5, 7, 9],
'weights': ['uniform', 'distance'],
'metric': ['euclidean', 'manhattan']
}
grid_knn = GridSearchCV(KNeighborsClassifier(), param_grid_knn, cv=5, scoring='accuracy')
grid_knn.fit(X_train, y_train)
print("Best Parameters (KNN):", grid_knn.best_params_)
print("Best Score:", grid_knn.best_score_)
π Explanation:
- n_neighbors: number of nearest neighbors.
- weights: uniform (equal) or distance-weighted voting.
- metric: distance calculation method.
3οΈβ£ Support Vector Machine - Grid Search
param_grid_svm = {
'C': [0.1, 1, 10],
'kernel': ['linear', 'rbf'],
'gamma': ['scale', 'auto']
}
grid_svm = GridSearchCV(SVC(probability=True, random_state=42), param_grid_svm, cv=5, scoring='accuracy')
grid_svm.fit(X_train, y_train)
print("Best Parameters (SVM):", grid_svm.best_params_)
print("Best Score:", grid_svm.best_score_)
π Explanation:
- C: regularization.
- kernel: transformation applied to input.
- gamma: influence of a single data point.
4οΈβ£ Decision Tree - Grid Search
param_grid_dt = {
'max_depth': [None, 5, 10, 20],
'min_samples_split': [2, 5, 10],
'criterion': ['gini', 'entropy']
}
grid_dt = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid_dt, cv=5, scoring='accuracy')
grid_dt.fit(X_train, y_train)
print("Best Parameters (Decision Tree):", grid_dt.best_params_)
print("Best Score:", grid_dt.best_score_)
π Explanation:
- max_depth: depth of the tree.
- min_samples_split: minimum samples to split an internal node.
- criterion: function to measure quality of split.
β Retrain Models with Best Hyperparameters (from GridSearchCV)
We'll now use the best parameters obtained from GridSearchCV to retrain our models and evaluate them on the test set.
π Refit All Models
# Best models from GridSearch
model_lr = grid_lr.best_estimator_
model_knn = grid_knn.best_estimator_
model_svm = grid_svm.best_estimator_
model_dt = grid_dt.best_estimator_
# Fit models to training data
model_lr.fit(X_train, y_train)
model_knn.fit(X_train, y_train)
model_svm.fit(X_train, y_train)
model_dt.fit(X_train, y_train)
π Evaluate All Models
from sklearn.metrics import classification_report, roc_curve, auc
import matplotlib.pyplot as plt
models = {
"Logistic Regression": model_lr,
"KNN": model_knn,
"SVM": model_svm,
"Decision Tree": model_dt
}
plt.figure(figsize=(10, 7))
for name, model in models.items():
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
print(f"\n{name}")
print(classification_report(y_test, y_pred, digits=4))
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f"{name} (AUC = {roc_auc:.2f})")
# Plot random classifier
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()
π§ Outcome Insight:
Each model is now optimized and re-evaluated using the best hyperparameters. You'll clearly observe how well each performs on the ROC curve and through the precision/recall/f1 metrics.
From our observations, the base models performed significantly better than the hyperparameter-tuned versions. Therefore, the tuned models are not suitable in this case. Let's explore alternative approaches such as using StratifiedKFold, RandomizedSearchCV, and increasing the number of cross-validation folds to see if we can further improve the model's performance.
π RandomizedSearchCV with StratifiedKFold Cross-Validation
Now we will perform more robust tuning using RandomizedSearchCV combined with StratifiedKFold to preserve label proportions across folds and improve model generalization.
βοΈ Define Cross-Validation Strategy
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV
import numpy as np
cv_strategy = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
scoring = 'accuracy'
π― 1. Logistic Regression with RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
param_dist_lr = {
'C': np.logspace(-3, 2, 100),
'penalty': ['l1', 'l2'],
'solver': ['liblinear']
}
rand_lr = RandomizedSearchCV(
LogisticRegression(random_state=42),
param_distributions=param_dist_lr,
n_iter=30,
cv=cv_strategy,
scoring=scoring,
random_state=42,
n_jobs=-1
)
rand_lr.fit(X_train, y_train)
print("Best Params (Logistic Regression):", rand_lr.best_params_)
β This step allows us to explore a wide range of regularization strengths (C values) using a logarithmic scale and two different penalty types, enabling more flexibility than grid search.
π― 2. K-Nearest Neighbors (KNN) with RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
param_dist_knn = {
'n_neighbors': np.arange(3, 30),
'weights': ['uniform', 'distance'],
'metric': ['euclidean', 'manhattan']
}
rand_knn = RandomizedSearchCV(
KNeighborsClassifier(),
param_distributions=param_dist_knn,
n_iter=30,
cv=cv_strategy,
scoring=scoring,
random_state=42,
n_jobs=-1
)
rand_knn.fit(X_train, y_train)
print("Best Params (KNN):", rand_knn.best_params_)
β Here, we explore different neighbor values and distance metrics to optimize how the KNN model measures similarity between apps.
π― 3. Support Vector Machine (SVM) with RandomizedSearchCV
from sklearn.svm import SVC
param_dist_svm = {
'C': np.logspace(-2, 2, 20),
'kernel': ['linear', 'rbf'],
'gamma': ['scale', 'auto']
}
rand_svm = RandomizedSearchCV(
SVC(probability=True, random_state=42),
param_distributions=param_dist_svm,
n_iter=20,
cv=cv_strategy,
scoring=scoring,
random_state=42,
n_jobs=-1
)
rand_svm.fit(X_train, y_train)
print("Best Params (SVM):", rand_svm.best_params_)
β This allows us to tune regularization and kernel types. SVC is sensitive to hyperparameters, so random search helps avoid exhaustive computation.
π― 4. Decision Tree with RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
param_dist_dt = {
'max_depth': [None, 5, 10, 20, 50],
'min_samples_split': [2, 5, 10],
'criterion': ['gini', 'entropy']
}
rand_dt = RandomizedSearchCV(
DecisionTreeClassifier(random_state=42),
param_distributions=param_dist_dt,
n_iter=20,
cv=cv_strategy,
scoring=scoring,
random_state=42,
n_jobs=-1
)
rand_dt.fit(X_train, y_train)
print("Best Params (Decision Tree):", rand_dt.best_params_)
β This tuning balances between underfitting and overfitting by varying tree depth and split criteria.
β Fit the Best Models from RandomizedSearchCV
model_lr = rand_lr.best_estimator_
model_knn = rand_knn.best_estimator_
model_svm = rand_svm.best_estimator_
model_dt = rand_dt.best_estimator_
model_lr.fit(X_train, y_train)
model_knn.fit(X_train, y_train)
model_svm.fit(X_train, y_train)
model_dt.fit(X_train, y_train)
π§ We're using the best configurations found from RandomizedSearchCV to retrain each model on the full training data.
π Final Evaluation & ROC Curve Comparison
from sklearn.metrics import classification_report, roc_curve, auc
import matplotlib.pyplot as plt
models = {
"Logistic Regression": model_lr,
"KNN": model_knn,
"SVM": model_svm,
"Decision Tree": model_dt
}
plt.figure(figsize=(10, 7))
for name, model in models.items():
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
print(f"\n{name}")
print(classification_report(y_test, y_pred, digits=4))
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f"{name} (AUC = {roc_auc:.2f})")
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()
π This final visualization compares model performance using the Area Under the ROC Curve (AUC). A higher AUC indicates better discrimination between benign and malicious apps.
β Final Model Selection
After evaluating all the models using RandomizedSearchCV and Stratified K-Folds, here's a quick summary of our key insights:
Logistic Regression performed the best overall:
β
High accuracy
β
High precision and recall for both classes
β
Lightweight and fast
β
Interpretable and easily deployable
While SVM showed slightly better precision in some cases, Logistic Regression provides a great balance between performance, simplicity, and interpretability - making it ideal for our malware classification use case.
So, we proceed with:
final_model = model_lr # Logistic Regression as final model
πΎ Model Deployment
πΎ 1. Save the Final Model
We'll use joblib to save the trained logistic regression model:
import joblib
# Save the final model
joblib.dump(final_model, 'logistic_malware_model.pkl')
feature_columns = X_train.columns.tolist()
joblib.dump(feature_columns, 'feature_columns.pkl')
π§ 2. Load and Predict with New Data
Here's how to load the model and make predictions on new Android app permission data:
# Load model and features
model = joblib.load('logistic_malware_model.pkl')
features = joblib.load('feature_columns.pkl')
# Example: New app's permission vector (as a dictionary or DataFrame row)
import pandas as pd
new_app = pd.DataFrame([{
'INTERNET': 1,
'ACCESS_FINE_LOCATION': 1,
'READ_SMS': 0,
'RECORD_AUDIO': 0,
# ... include all other columns used during training
'Total_Permissions': 5,
'Top4_Permission_Count': 2,
'Suspicious_Flag': 1
}])
# Ensure correct column order
new_app = new_app.reindex(columns=features, fill_value=0)
# Predict
prediction = model.predict(new_app)
probability = model.predict_proba(new_app)[:, 1]
print("Prediction (0=Benign, 1=Malware):", prediction[0])
print("Malware Probability Score:", probability[0])
β Ready for Deployment
You can now:
- Deploy this model in a Flask API or FastAPI server
- Use it in Android threat detection tools
- Wrap it in a web app for interactive prediction
π Conclusion
We successfully built a Supervised Learning Classification model to detect malware in Android apps using only the permissions they request.
π Key Takeaways:
π Performed thorough Exploratory Data Analysis (EDA) to uncover how different app permissions correlate with malware behavior.
π§ Trained multiple ML models - Logistic Regression, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Decision Tree - to compare performance.
βοΈ Applied techniques like GridSearchCV, RandomizedSearchCV, and StratifiedKFold to fine-tune hyperparameters and improve model accuracy.
π Evaluated all models using classification metrics (Precision, Recall, F1-score) and ROC-AUC curves for robustness.
β
Finalized Logistic Regression as the best model - due to its lightweight, interpretable nature and strong performance metrics.
π This model can now be integrated into mobile security tools to proactively flag suspicious apps before they reach the user.
π‘ Personal Reflection:
This project helped me revisit core concepts in machine learning and reinforced how critical foundational knowledge is. While it's tempting to always chase complex architectures like XGBoost or deep neural nets, this work reminded me that simple, interpretable models often deliver powerful results when well-tuned.
It's all about:
- Thinking critically,
- Iterating consistently,
- And experimenting openly with different ML strategies.
π Sometimes, going back to basics is exactly what moves you forward.
π Connect with Me
π This project is part of my learning journey with NxtWave's CCBP 4.0 Academy.
π Blog by Naresh B. A.
π¨βπ» Aspiring Full Stack Developer | Passionate Machine Learning and AI Innovation
π Portfolio: Naresh B A
π« Let's connect on LinkedIn | GitHub
π‘ Thanks for reading! If you found this useful, drop a like or comment feedback keeps the learning alive.
Top comments (0)