Data Analyst Guide: Mastering Portfolio Projects That Impress Hiring Managers
As a data analyst, having a strong portfolio is crucial to showcasing your skills and experience to potential hiring managers. In this tutorial, we will walk through a real-world business problem and provide a step-by-step technical solution to help you master portfolio projects that impress.
Business Problem Statement
A retail company wants to analyze customer purchase behavior and identify the most profitable customer segments. The company has a large dataset of customer transactions, including demographic information, purchase history, and transaction amounts. The goal is to develop a predictive model that can identify high-value customers and provide recommendations for targeted marketing campaigns.
The company estimates that a 10% increase in sales from high-value customers can result in an additional $1 million in revenue per year. Therefore, the ROI impact of this project is significant, and the company is looking for a data analyst who can develop a robust and scalable solution.
Step-by-Step Technical Solution
Step 1: Data Preparation (pandas/SQL)
First, we need to prepare the data for analysis. We will use a combination of pandas and SQL to load, clean, and transform the data.
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
# Load data from database
engine = create_engine('postgresql://user:password@host:port/dbname')
query = """
SELECT *
FROM customer_transactions
WHERE transaction_date >= '2020-01-01'
"""
data = pd.read_sql_query(query, engine)
# Clean and transform data
data['transaction_date'] = pd.to_datetime(data['transaction_date'])
data['customer_age'] = data['customer_birthdate'].apply(lambda x: 2022 - x.year)
data['transaction_amount'] = data['transaction_amount'].apply(lambda x: x * 100) # convert to cents
# Handle missing values
data.fillna(data.mean(), inplace=True)
# Save data to CSV
data.to_csv('customer_transactions.csv', index=False)
-- Create table for customer transactions
CREATE TABLE customer_transactions (
transaction_id SERIAL PRIMARY KEY,
customer_id INTEGER NOT NULL,
transaction_date DATE NOT NULL,
transaction_amount DECIMAL(10, 2) NOT NULL,
customer_birthdate DATE NOT NULL
);
-- Insert sample data
INSERT INTO customer_transactions (customer_id, transaction_date, transaction_amount, customer_birthdate)
VALUES
(1, '2020-01-01', 100.00, '1990-01-01'),
(2, '2020-01-15', 200.00, '1995-06-01'),
(3, '2020-02-01', 50.00, '1980-03-01');
Step 2: Analysis Pipeline
Next, we will develop an analysis pipeline to identify the most profitable customer segments.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Split data into training and testing sets
X = data.drop(['customer_id', 'transaction_id'], axis=1)
y = data['customer_id']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train random forest classifier
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train, y_train)
# Make predictions on testing set
y_pred = rfc.predict(X_test)
# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
Step 3: Model/Visualization Code
Now, we will develop a predictive model to identify high-value customers and provide recommendations for targeted marketing campaigns.
import matplotlib.pyplot as plt
import seaborn as sns
# Develop predictive model
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)
# Make predictions on testing set
y_pred = lr.predict(X_test)
# Visualize results
plt.scatter(y_test, y_pred)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Predictive Model Performance")
plt.show()
# Identify high-value customers
high_value_customers = data[data['transaction_amount'] > 1000]
print("High-Value Customers:")
print(high_value_customers)
# Provide recommendations for targeted marketing campaigns
recommendations = []
for customer in high_value_customers['customer_id']:
recommendations.append({
"customer_id": customer,
"recommendation": "Targeted marketing campaign with personalized offers"
})
print("Recommendations:")
print(recommendations)
Step 4: Performance Evaluation
We will evaluate the performance of our predictive model using metrics such as mean absolute error (MAE) and mean squared error (MSE).
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Evaluate model performance
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
Step 5: Production Deployment
Finally, we will deploy our predictive model to a production environment using a cloud-based platform such as AWS SageMaker.
import boto3
# Create SageMaker session
sagemaker = boto3.client('sagemaker')
# Create SageMaker model
model_name = "customer_segmentation_model"
model = sagemaker.create_model(
ModelName=model_name,
ExecutionRoleArn="arn:aws:iam::123456789012:role/service-role/AmazonSageMaker-ExecutionRole-123456789012",
PrimaryContainer={
"Image": "763104351884.dkr.ecr.us-west-2.amazonaws.com/sagemaker-scikit-decision-trees:1.0.4",
"ModelDataUrl": "s3://my-bucket/model.tar.gz"
}
)
# Create SageMaker endpoint
endpoint_name = "customer_segmentation_endpoint"
endpoint = sagemaker.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName="customer_segmentation_config",
ProductionVariants=[
{
"VariantName": "variant-1",
"ModelName": model_name,
"InitialInstanceCount": 1,
"InstanceType": "ml.m5.xlarge"
}
]
)
# Deploy model to production
sagemaker.update_endpoint(
EndpointName=endpoint_name,
EndpointConfigName="customer_segmentation_config"
)
Metrics/ROI Calculations
We will calculate the ROI of our project by estimating the increase in sales from high-value customers.
# Calculate ROI
increase_in_sales = 1000000 # $1 million
roi = increase_in_sales / 100000 # $100,000 (project cost)
print("ROI:", roi)
Edge Cases
We will handle edge cases such as missing values, outliers, and non-linear relationships between variables.
# Handle missing values
data.fillna(data.mean(), inplace=True)
# Handle outliers
from sklearn.robust import HuberRegressor
hr = HuberRegressor()
hr.fit(X_train, y_train)
# Handle non-linear relationships
from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(degree=2)
X_train_poly = pf.fit_transform(X_train)
X_test_poly = pf.transform(X_test)
Scaling Tips
We will provide scaling tips such as using distributed computing, parallel processing, and cloud-based platforms.
# Use distributed computing
from joblib import Parallel, delayed
def train_model(X_train, y_train):
# Train model
pass
X_train_split = np.array_split(X_train, 10)
y_train_split = np.array_split(y_train, 10)
results = Parallel(n_jobs=10)(delayed(train_model)(X_train_split[i], y_train_split[i]) for i in range(10))
# Use parallel processing
from multiprocessing import Pool
def train_model(args):
# Train model
pass
pool = Pool(processes=10)
results = pool.map(train_model, [(X_train_split[i], y_train_split[i]) for i in range(10)])
pool.close()
pool.join()
# Use cloud-based platforms
from sklearn.externals import joblib
joblib.dump(rfc, "model.pkl")
By following this tutorial, you can develop a robust and scalable solution to identify high-value customers and provide recommendations for targeted marketing campaigns. Remember to handle edge cases, calculate ROI, and provide scaling tips to ensure the success of your project.
Top comments (0)