How I Built a Password Strength Analyzer with Python

Hi, I'm Maryam Hareem, a Computer Science student passionate about cybersecurity and AI. One of my favorite projects so far was building a Password Strength Analyzer using Python. This tool checks how strong a password is based on things like length, digits, uppercase letters, and symbols.
In this blog, I'll guide you through how I designed and trained the analyzer using real-world datasets, scikit-learn, and visualization tools - plus how I learned to build smarter models using AI tips and good security practices.
If you're also a student or Python beginner, this project is a great place to start!
Libraries Used
The project uses several powerful Python libraries for data handling, visualization, and machine learning:
NumPy and Pandas: For efficient data manipulation and numerical computations.
Matplotlib and Seaborn: For creating informative and visually appealing charts.
Scikit-learn:
CountVectorizer and TfidfVectorizer: To convert passwords into numerical features.
train_test_split: For splitting the dataset into training and testing.
LogisticRegression, PassiveAggressiveClassifier, SVC, MultinomialNB, RandomForestClassifier: Various machine learning models to classify password strength.
accuracy_score, confusion_matrix, classification_report: For performance evaluation.
Scipy's csr_matrix: For handling sparse matrix representations efficiently.
Regular Expressions (re): For analyzing and cleaning password patterns.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import itertools
from scipy.sparse import csr_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix,classification_report
import seaborn as sns
import pandas as pd
import re

Step 1: Collecting and Preparing the Data
I used the Password Strength Classifier Dataset by bhavikbb on Kaggle, which contains around 670,000 unique passwords labeled as weak (0), medium (1), or strong (2) based on commercial password strength meters. https://www.kaggle.com/datasets/bhavikbb/password-strength-classifier-dataset?resource=download
How I Loaded the Data

# step 1:loading the csv file
password_strength_analyzer=pd .read_csv('data.csv',encoding ='UTF8',on_bad_lines='skip')
password_strength_analyzer = password_strength_analyzer.dropna(subset=['password']) # remove the empty rows
password_strength_analyzer['password'] = password_strength_analyzer['password'].apply(str) # make all the password string
password_strength_analyzer=password_strength_analyzer.rename(columns={'strength':'strength_category'})

Feature Engineering
To help the model learn, I extracted both manual and text-based features:

# step 2:adding new colounms or learning new features
def has_digit(password):
return bool(re.search(r'\d', password))
def has_lowercase(password):
return bool(re.search(r'[a-z]', password))
def has_uppercase(password):
return bool(re.search(r'[A-Z]', password))
def has_symbol(password):
return bool(re.search(r'[\W_]', password)) # \W matches any non-alphanumeric character
password_strength_analyzer['length'] =password_strength_analyzer['password'].apply(len)
password_strength_analyzer['has_digit'] = password_strength_analyzer['password'].apply(has_digit)
password_strength_analyzer['has_uppercase'] =password_strength_analyzer['password'].apply(has_uppercase)
password_strength_analyzer['has_lowercase'] =password_strength_analyzer['password'].apply(has_lowercase)
password_strength_analyzer['has_symbol'] =password_strength_analyzer['password'].apply(has_symbol)

Step 2: Training and Evaluating the Model
Once the password features were ready, I split the dataset into training and testing sets so I could train the model and then evaluate how well it performed on unseen data.
Train-Test Split

# step 3 dividing data into train and test data
x = password_strength_analyzer[['password', 'length', 'has_digit', 'has_uppercase', 'has_symbol','has_lowercase']]
y=password_strength_analyzer['strength_category']
x_train,x_test,y_train ,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

Text Vectorization with n-grams
I used CountVectorizer from scikit-learn with character-level n-grams. This turns each password into a vector of numeric counts based on its characters and sequences.
I used CountVectorizer with n-gram range (1, 3) to capture single characters, bigrams, and trigrams. You can ues (1, 1)(2,2)(3, 3)as well

# step 4 making spares matrix
ngram=(1,3)
vectorizer=CountVectorizer(ngram_range=ngram)
x_train_vectorized=vectorizer.fit_transform(x_train['password'])
x_test_vectorized=vectorizer.transform(x_test['password'])
x_train_manual = x_train[['length', 'has_digit', 'has_uppercase', 'has_symbol','has_lowercase']].astype(int).values
#We use .values to convert a DataFrame into a plain NumPy array,
#because functions like hstack() need simple numerical arrays (not labeled tables with column names and indexes like DataFrames).
x_test_manual = x_test[['length', 'has_digit', 'has_uppercase', 'has_symbol','has_lowercase']].astype(int).values

Combine Manual Features

from scipy.sparse import hstack
x_train_combined = hstack([x_train_vectorized, x_train_manual])
x_test_combined = hstack([x_test_vectorized, x_test_manual])

Model Training and Prediction
I started with Logistic Regression, a simple but effective classifier:

# step 6 train
# Initialize and train the classifier
classifier =LogisticRegression()
classifier.fit(x_train_combined,y_train)
# Make predictions
y_pred = classifier.predict(x_test_combined )`

step 3 :Evaluation: Accuracy and Confusion Matrix

# Evaluate the model
dict_report=classification_report(y_test, y_pred ,output_dict=True)
df_report=pd.DataFrame(dict_report).transpose() 
plt.figure(figsize=(10, 6))
sns.heatmap(df_report.iloc[:, :3], annot=True, cmap='YlGnBu', fmt='.2f')  # Only precision, recall, f1-score
plt.title('Classification Report Visualization (Logistic Regression)')
plt.xlabel('Metrics')
plt.ylabel('Strength Category')
plt.show()
confusion_matrix(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
# Create a heatmap for the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='YlOrRd', xticklabels=['Weak', 'Mediam','Strong'], yticklabels=['Weak', 'Mediam','Strong'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix for LogisticRegression')
plt.show()

Try It Yourself: Real-Time Password Checker
Finally, I added a simple function that lets you enter any password and check its strength:

def password_analyzer(password):
# Vectorize the password into n-grams
spares_password = vectorizer.transform([password])
# Extract manual features
length = len(password)
has_digit = int(bool(re.search(r'\d', password)))
has_lowercase = int(bool(re.search(r'[a-z]', password)))
has_upper = int(bool(re.search(r'[A-Z]', password)))
has_symbol =int(bool(re.search(r'[\W_]', password))) # int(any(not char.isalnum() for char in password))
manual_features = np.array([[length,has_lowercase, has_digit, has_upper, has_symbol]])
# Convert manual features to sparse format before stacking
manual_sparse = csr_matrix(manual_features)
# Combine n-gram vector + manual features
combined_features = hstack([spares_password, manual_sparse])
# Predict
predicted = classifier.predict(combined_features)[0]
strength = category[predicted]
print(f'Password: {password}\nPredicted Strength: {strength}')
password = input('Enter password to check its strength: ')
password_analyzer(password)

You can run this in your terminal and enter any password to see what the model predicts!
confusion matrix
The confusion matrix is a key evaluation metric used to assess the performance of a classification model. It compares the actual (true) values with the predicted values and shows how well the model is classifying the data.

Conclusion
This Password Strength Analyzer demonstrates how Machine Learning can be applied to solve real-world cybersecurity challenges in a simple and user-friendly way. By analyzing structural features of passwords and training multiple classification models, we can predict the strength of a password with good accuracy.
This project not only promotes cyber hygiene but also showcases how beginner-friendly tools like Scikit-learn, Pandas, and Seaborn can be used to build impactful applications. With further improvements - like adding a live web interface or expanding the dataset - this project can be extended into a full-fledged password auditing tool.

View the full Jupyter Notebook on GitHub

DEV Community

How I Built a Password Strength Analyzer with Python

Top comments (0)