Data Analyst Guide: Mastering Why Gen Z Job Applications Get Rejected (Real Talk)

Business Problem Statement

The current job market is highly competitive, and many Gen Z job applicants are facing rejection. As a data analyst, it's essential to understand the reasons behind these rejections and provide insights to improve their chances of getting hired. In this tutorial, we'll explore a real-world scenario where a company is struggling to attract and hire Gen Z talent, resulting in a significant ROI impact.

Let's consider a company that spends an average of $1,000 per job posting and receives around 100 applications per posting. However, only 10% of these applicants are hired, resulting in a significant waste of resources. By analyzing the reasons behind the rejections, we can identify areas for improvement and optimize the hiring process to increase the ROI.

Step-by-Step Technical Solution

Step 1: Data Preparation (Pandas/SQL)

First, we need to collect and prepare the data. Let's assume we have a dataset containing information about the job applicants, including their demographics, skills, and application status.

import pandas as pd
import numpy as np

# Sample dataset
data = {
    'ApplicantID': [1, 2, 3, 4, 5],
    'Name': ['John', 'Jane', 'Bob', 'Alice', 'Mike'],
    'Age': [22, 25, 28, 24, 26],
    'Skills': ['Python, Data Science', 'Marketing, Sales', 'Java, Development', 'Data Analysis, Visualization', 'Machine Learning, AI'],
    'ApplicationStatus': ['Rejected', 'Hired', 'Rejected', 'Rejected', 'Hired']
}

df = pd.DataFrame(data)

print(df)

We can also use SQL to retrieve the data from a database. For example:

CREATE TABLE JobApplicants (
    ApplicantID INT PRIMARY KEY,
    Name VARCHAR(255),
    Age INT,
    Skills VARCHAR(255),
    ApplicationStatus VARCHAR(255)
);

INSERT INTO JobApplicants (ApplicantID, Name, Age, Skills, ApplicationStatus)
VALUES
(1, 'John', 22, 'Python, Data Science', 'Rejected'),
(2, 'Jane', 25, 'Marketing, Sales', 'Hired'),
(3, 'Bob', 28, 'Java, Development', 'Rejected'),
(4, 'Alice', 24, 'Data Analysis, Visualization', 'Rejected'),
(5, 'Mike', 26, 'Machine Learning, AI', 'Hired');

SELECT * FROM JobApplicants;

Step 2: Analysis Pipeline

Next, we'll create an analysis pipeline to identify the reasons behind the rejections. We'll use the pandas library to perform data manipulation and analysis.

# Convert the Skills column to a list of skills
df['Skills'] = df['Skills'].apply(lambda x: x.split(', '))

# Create a new column to store the number of skills
df['NumSkills'] = df['Skills'].apply(len)

# Create a new column to store the average age of hired applicants
hired_applicants = df[df['ApplicationStatus'] == 'Hired']
avg_age = hired_applicants['Age'].mean()
df['AvgAge'] = avg_age

print(df)

Step 3: Model/Visualization Code

Now, we'll create a model to predict the likelihood of an applicant getting hired based on their skills and age. We'll use the scikit-learn library to train a logistic regression model.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Define the features and target variable
X = df[['NumSkills', 'Age']]
y = df['ApplicationStatus']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
print('Classification Report:')
print(classification_report(y_test, y_pred))

We can also use visualization techniques to gain insights into the data. For example, we can use a bar chart to show the distribution of skills among hired and rejected applicants.

import matplotlib.pyplot as plt

# Create a bar chart to show the distribution of skills
skills = df['Skills'].explode()
skills.value_counts().plot(kind='bar')
plt.title('Distribution of Skills')
plt.xlabel('Skill')
plt.ylabel('Count')
plt.show()

Step 4: Performance Evaluation

To evaluate the performance of our model, we'll use metrics such as accuracy, precision, and recall.

from sklearn.metrics import precision_score, recall_score

# Calculate the precision and recall
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print('Precision:', precision)
print('Recall:', recall)

Step 5: Production Deployment

Finally, we'll deploy our model to a production environment where it can be used to predict the likelihood of new applicants getting hired.

# Create a function to make predictions on new data
def predict_hire(applicant_data):
    # Convert the skills to a list
    skills = applicant_data['Skills'].split(', ')

    # Create a new dataframe with the applicant data
    new_df = pd.DataFrame({'NumSkills': [len(skills)], 'Age': [applicant_data['Age']]})

    # Make a prediction using the trained model
    prediction = model.predict(new_df)

    return prediction

# Test the function
applicant_data = {'Skills': 'Python, Data Science', 'Age': 25}
prediction = predict_hire(applicant_data)
print('Prediction:', prediction)

Metrics/ROI Calculations

To calculate the ROI of our model, we'll use the following metrics:

Cost per hire: $1,000
Number of hires: 10
Number of applicants: 100
Cost of applicants: $100,000 (100 applicants x $1,000 per applicant)

Using our model, we can predict the likelihood of an applicant getting hired and prioritize those with the highest likelihood. This can help reduce the cost of applicants and increase the number of hires.

# Calculate the ROI
cost_per_hire = 1000
num_hires = 10
num_applicants = 100
cost_of_applicants = num_applicants * cost_per_hire

# Calculate the ROI using the model
roi = (num_hires * cost_per_hire) / cost_of_applicants
print('ROI:', roi)

Edge Cases

To handle edge cases, we'll consider the following scenarios:

What if the applicant has no skills listed?
What if the applicant is under 18 or over 65?
What if the applicant has a high number of skills but is not hired?

To handle these edge cases, we can add additional features to our model, such as:

A flag to indicate if the applicant has no skills listed
A flag to indicate if the applicant is under 18 or over 65
A feature to capture the applicant's work experience

# Add additional features to handle edge cases
df['NoSkills'] = df['Skills'].apply(lambda x: 1 if len(x) == 0 else 0)
df['Under18'] = df['Age'].apply(lambda x: 1 if x < 18 else 0)
df['Over65'] = df['Age'].apply(lambda x: 1 if x > 65 else 0)
df['WorkExperience'] = df['Age'].apply(lambda x: x - 18)  # assume work experience starts at 18

print(df)

Scaling Tips

To scale our model, we can consider the following strategies:

Use a more powerful machine learning algorithm, such as a neural network
Use a larger dataset to train the model
Use distributed computing to train the model on multiple machines
Use a cloud-based platform to deploy the model

# Use a more powerful machine learning algorithm
from sklearn.ensemble import RandomForestClassifier

# Train a random forest classifier
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Evaluate the model
accuracy = accuracy_score(y_test, model.predict(X_test))
print('Accuracy:', accuracy)

By following these steps and considering edge cases and scaling strategies, we can develop a robust and accurate model to predict the likelihood of Gen Z job applicants getting hired.