Data Analyst Guide: Mastering Portfolio Projects That Impress Hiring Managers

Business Problem Statement

A leading e-commerce company, "EcomShop", wants to improve its customer retention rate by identifying the factors that influence customer churn. The company has collected data on customer demographics, purchase history, and customer support interactions. The goal is to develop a predictive model that can identify high-risk customers and provide recommendations to reduce churn.

The estimated ROI impact of this project is:

10% reduction in customer churn rate, resulting in a $1 million increase in annual revenue
5% increase in customer retention rate, resulting in a $500,000 increase in annual revenue

Step-by-Step Technical Solution

Step 1: Data Preparation (pandas/SQL)

First, we need to prepare the data for analysis. We will use pandas to load and manipulate the data, and SQL to query the database.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load data from CSV files
customers = pd.read_csv('customers.csv')
purchases = pd.read_csv('purchases.csv')
support_interactions = pd.read_csv('support_interactions.csv')

# Merge data into a single DataFrame
data = pd.merge(customers, purchases, on='customer_id')
data = pd.merge(data, support_interactions, on='customer_id')

# Handle missing values
data.fillna(data.mean(), inplace=True)

# Convert categorical variables to numerical variables
data['gender'] = data['gender'].map({'Male': 0, 'Female': 1})
data['support_channel'] = data['support_channel'].map({'Phone': 0, 'Email': 1, 'Chat': 2})

SQL queries to extract data from database:

SELECT * FROM customers;
SELECT * FROM purchases;
SELECT * FROM support_interactions;

Step 2: Analysis Pipeline

Next, we will develop an analysis pipeline to identify the factors that influence customer churn.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.drop('churn', axis=1), data['churn'], test_size=0.2, random_state=42)

# Scale data using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a random forest classifier
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train_scaled, y_train)

# Make predictions on testing set
y_pred = rfc.predict(X_test_scaled)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
print('Classification Report:')
print(classification_report(y_test, y_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

Step 3: Model/Visualization Code

We will use the trained model to make predictions on new data and visualize the results.

import matplotlib.pyplot as plt
import seaborn as sns

# Make predictions on new data
new_data = pd.DataFrame({'age': [25, 30, 35], 'purchase_history': [100, 200, 300], 'support_interactions': [5, 10, 15]})
new_data_scaled = scaler.transform(new_data)
new_pred = rfc.predict(new_data_scaled)

# Visualize results
plt.figure(figsize=(8, 6))
sns.countplot(x='churn', data=data)
plt.title('Customer Churn Distribution')
plt.show()

plt.figure(figsize=(8, 6))
sns.scatterplot(x='age', y='purchase_history', hue='churn', data=data)
plt.title('Customer Churn vs. Age and Purchase History')
plt.show()

Step 4: Performance Evaluation

We will evaluate the performance of the model using metrics such as accuracy, precision, recall, and F1 score.

from sklearn.metrics import precision_score, recall_score, f1_score

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print('Accuracy:', accuracy)
print('Precision:', precision)
print('Recall:', recall)
print('F1 Score:', f1)

Step 5: Production Deployment

We will deploy the model to a production environment using a Flask API.

from flask import Flask, request, jsonify
from sklearn.externals import joblib

app = Flask(__name__)

# Load trained model
model = joblib.load('rfc_model.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    new_data = pd.DataFrame([data])
    new_data_scaled = scaler.transform(new_data)
    pred = model.predict(new_data_scaled)
    return jsonify({'prediction': pred[0]})

if __name__ == '__main__':
    app.run(debug=True)

Metrics/ROI Calculations

We will calculate the ROI of the project by estimating the reduction in customer churn rate and the resulting increase in revenue.

# Estimate reduction in customer churn rate
reduction_in_churn_rate = 0.10

# Estimate increase in revenue
increase_in_revenue = 1000000

# Calculate ROI
roi = (increase_in_revenue / 1000000) * 100
print('ROI:', roi)

Edge Cases

We will handle edge cases such as missing values, outliers, and categorical variables with high cardinality.

# Handle missing values
data.fillna(data.mean(), inplace=True)

# Handle outliers
Q1 = data['purchase_history'].quantile(0.25)
Q3 = data['purchase_history'].quantile(0.75)
IQR = Q3 - Q1
data = data[~((data['purchase_history'] < (Q1 - 1.5 * IQR)) | (data['purchase_history'] > (Q3 + 1.5 * IQR)))]

# Handle categorical variables with high cardinality
data['support_channel'] = data['support_channel'].map({'Phone': 0, 'Email': 1, 'Chat': 2})

Scaling Tips

We will use techniques such as data parallelism, model parallelism, and distributed computing to scale the model.

# Use data parallelism
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data.drop('churn', axis=1), data['churn'], test_size=0.2, random_state=42)

# Use model parallelism
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train, y_train)

# Use distributed computing
from joblib import Parallel, delayed
def train_model(X_train, y_train):
    rfc = RandomForestClassifier(n_estimators=100, random_state=42)
    rfc.fit(X_train, y_train)
    return rfc
models = Parallel(n_jobs=-1)(delayed(train_model)(X_train, y_train) for _ in range(10))