Data Analyst Guide: Mastering Portfolio Projects That Impress Hiring Managers

As a data analyst, having a strong portfolio is crucial to impress hiring managers and land your dream job. In this tutorial, we will work on a real-world project that demonstrates the skills and expertise required to succeed in the field.

Business Problem Statement

A retail company wants to analyze its customer purchase behavior and identify factors that influence sales. The goal is to develop a predictive model that can forecast sales and provide insights to improve marketing strategies. The company has provided a dataset containing customer demographics, purchase history, and sales data.

ROI Impact:

Increase sales by 10% through targeted marketing campaigns
Reduce customer churn by 5% through personalized recommendations
Improve customer satisfaction by 15% through data-driven product development

Step-by-Step Technical Solution

Step 1: Data Preparation (pandas/SQL)

First, we need to prepare the data for analysis. We will use pandas to handle missing values, encode categorical variables, and perform data normalization.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Load the dataset
data = pd.read_csv('customer_data.csv')

# Handle missing values
data.fillna(data.mean(), inplace=True)

# Encode categorical variables
data['gender'] = data['gender'].map({'Male': 0, 'Female': 1})
data['age_group'] = pd.cut(data['age'], bins=[18, 25, 35, 45, 55, 65], labels=[0, 1, 2, 3, 4])

# Normalize the data
scaler = StandardScaler()
data[['income', 'spend']] = scaler.fit_transform(data[['income', 'spend']])

We will also use SQL to extract relevant data from the database.

-- Create a table to store customer data
CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    name VARCHAR(255),
    email VARCHAR(255),
    age INT,
    income DECIMAL(10, 2),
    spend DECIMAL(10, 2)
);

-- Insert data into the table
INSERT INTO customers (customer_id, name, email, age, income, spend)
VALUES
(1, 'John Doe', 'john.doe@example.com', 30, 50000.00, 2000.00),
(2, 'Jane Doe', 'jane.doe@example.com', 25, 40000.00, 1500.00),
(3, 'Bob Smith', 'bob.smith@example.com', 40, 60000.00, 2500.00);

-- Extract relevant data from the table
SELECT * FROM customers
WHERE age BETWEEN 25 AND 45 AND income > 40000.00;

Step 2: Analysis Pipeline

Next, we will develop an analysis pipeline to identify factors that influence sales.

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.drop('sales', axis=1), data['sales'], test_size=0.2, random_state=42)

# Develop a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')

# Visualize the results
plt.scatter(y_test, y_pred)
plt.xlabel('Actual Sales')
plt.ylabel('Predicted Sales')
plt.title('Sales Prediction')
plt.show()

Step 3: Model/Visualization Code

We will use the following code to develop a predictive model and visualize the results.

import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# Develop a random forest regressor model
model = RandomForestRegressor()
param_grid = {'n_estimators': [100, 200, 300], 'max_depth': [5, 10, 15]}
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = grid_search.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')

# Visualize the results
sns.set()
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=y_pred)
plt.xlabel('Actual Sales')
plt.ylabel('Predicted Sales')
plt.title('Sales Prediction')
plt.show()

Step 4: Performance Evaluation

We will use the following metrics to evaluate the performance of the model.

from sklearn.metrics import mean_absolute_error, r2_score

# Calculate the mean absolute error
mae = mean_absolute_error(y_test, y_pred)
print(f'Mean Absolute Error: {mae:.2f}')

# Calculate the R-squared score
r2 = r2_score(y_test, y_pred)
print(f'R-squared Score: {r2:.2f}')

Step 5: Production Deployment

Finally, we will deploy the model in a production-ready environment.

import pickle

# Save the model to a file
with open('sales_model.pkl', 'wb') as f:
    pickle.dump(grid_search, f)

# Load the model from the file
with open('sales_model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

# Make predictions on new data
new_data = pd.DataFrame({'age': [30], 'income': [50000.00], 'spend': [2000.00]})
new_prediction = loaded_model.predict(new_data)
print(f'Predicted Sales: {new_prediction[0]:.2f}')

Metrics/ROI Calculations

We will use the following metrics to calculate the ROI of the project.

# Calculate the revenue increase
revenue_increase = (y_pred - y_test).sum()
print(f'Revenue Increase: ${revenue_increase:.2f}')

# Calculate the customer churn reduction
churn_reduction = (y_pred - y_test).sum() / y_test.sum()
print(f'Customer Churn Reduction: {churn_reduction:.2f}%')

# Calculate the customer satisfaction improvement
satisfaction_improvement = (y_pred - y_test).sum() / y_test.sum()
print(f'Customer Satisfaction Improvement: {satisfaction_improvement:.2f}%')

Edge Cases

We will handle the following edge cases.

Missing values: We will use the mean or median of the respective feature to replace missing values.
Outliers: We will use the interquartile range (IQR) method to detect and remove outliers.
Data imbalance: We will use oversampling or undersampling techniques to balance the data.

Scaling Tips

We will use the following scaling tips to improve the performance of the model.

Data parallelism: We will use data parallelism techniques to split the data into smaller chunks and process them in parallel.
Model parallelism: We will use model parallelism techniques to split the model into smaller components and process them in parallel.
Distributed computing: We will use distributed computing techniques to process the data and model on multiple machines.

By following these steps and using the provided code, you can develop a predictive model that can forecast sales and provide insights to improve marketing strategies. Remember to handle edge cases and use scaling tips to improve the performance of the model.