Data Analyst Guide: Mastering Portfolio Projects That Impress Hiring Managers

Business Problem Statement

Real Scenario + ROI Impact

Let's consider a real-world scenario where an e-commerce company wants to analyze customer purchasing behavior and identify the most profitable customer segments. The goal is to develop a data-driven strategy to increase sales and revenue.

Problem Statement:
Analyze customer data to identify the most profitable customer segments and develop a targeted marketing strategy to increase sales and revenue.

ROI Impact:
By identifying the most profitable customer segments, the company can expect to increase sales by 15% and revenue by 20% within the next 6 months, resulting in an estimated ROI of $1.2 million.

Step-by-Step Technical Solution

1. Data Preparation (pandas/SQL)

First, we need to prepare the data for analysis. We'll use a sample dataset containing customer information, purchase history, and demographic data.

SQL Query:

-- Create a sample dataset
CREATE TABLE customers (
    customer_id INT,
    name VARCHAR(255),
    email VARCHAR(255),
    age INT,
    income DECIMAL(10, 2),
    purchase_history VARCHAR(255)
);

INSERT INTO customers (customer_id, name, email, age, income, purchase_history)
VALUES
(1, 'John Doe', 'john.doe@example.com', 30, 50000.00, 'product1,product2'),
(2, 'Jane Doe', 'jane.doe@example.com', 25, 40000.00, 'product3,product4'),
(3, 'Bob Smith', 'bob.smith@example.com', 40, 60000.00, 'product1,product5'),
(4, 'Alice Johnson', 'alice.johnson@example.com', 35, 55000.00, 'product2,product6'),
(5, 'Mike Brown', 'mike.brown@example.com', 45, 70000.00, 'product3,product7');

-- Create a table for purchase history
CREATE TABLE purchases (
    purchase_id INT,
    customer_id INT,
    product_id VARCHAR(255),
    purchase_date DATE,
    purchase_amount DECIMAL(10, 2)
);

INSERT INTO purchases (purchase_id, customer_id, product_id, purchase_date, purchase_amount)
VALUES
(1, 1, 'product1', '2022-01-01', 100.00),
(2, 1, 'product2', '2022-01-15', 200.00),
(3, 2, 'product3', '2022-02-01', 50.00),
(4, 3, 'product1', '2022-03-01', 150.00),
(5, 4, 'product2', '2022-04-01', 250.00),
(6, 5, 'product3', '2022-05-01', 300.00);

Python Code:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the data from the SQL database
import sqlite3
conn = sqlite3.connect('database.db')
cursor = conn.cursor()

cursor.execute('SELECT * FROM customers')
customers = pd.DataFrame(cursor.fetchall(), columns=['customer_id', 'name', 'email', 'age', 'income', 'purchase_history'])

cursor.execute('SELECT * FROM purchases')
purchases = pd.DataFrame(cursor.fetchall(), columns=['purchase_id', 'customer_id', 'product_id', 'purchase_date', 'purchase_amount'])

# Merge the two datasets
merged_data = pd.merge(customers, purchases, on='customer_id')

# Convert the purchase history to a numerical value
merged_data['purchase_history'] = merged_data['purchase_history'].apply(lambda x: len(x.split(',')))

# Define the features and target variable
X = merged_data[['age', 'income', 'purchase_history']]
y = merged_data['product_id']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

2. Analysis Pipeline

Next, we'll develop an analysis pipeline to identify the most profitable customer segments.

Python Code:

# Train a random forest classifier on the training data
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = rfc.predict(X_test)

# Evaluate the model's performance
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Classification Report:')
print(classification_report(y_test, y_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

# Identify the most profitable customer segments
segment1 = merged_data[(merged_data['age'] >= 30) & (merged_data['income'] >= 50000)]
segment2 = merged_data[(merged_data['age'] < 30) & (merged_data['income'] < 50000)]

print('Segment 1:')
print(segment1.describe())
print('Segment 2:')
print(segment2.describe())

3. Model/Visualization Code

Now, we'll develop a model to predict the purchase amount based on the customer's age and income.

Python Code:

# Import the necessary libraries
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# Define the features and target variable
X = merged_data[['age', 'income']]
y = merged_data['purchase_amount']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a linear regression model on the training data
lr = LinearRegression()
lr.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = lr.predict(X_test)

# Evaluate the model's performance
print('Mean Absolute Error:', np.mean(np.abs(y_test - y_pred)))

# Visualize the predicted purchase amount
plt.scatter(X_test['age'], y_test)
plt.plot(X_test['age'], y_pred, color='red')
plt.xlabel('Age')
plt.ylabel('Purchase Amount')
plt.title('Predicted Purchase Amount vs Age')
plt.show()

4. Performance Evaluation

We'll evaluate the performance of the model using various metrics.

Python Code:

# Calculate the mean absolute error
mae = np.mean(np.abs(y_test - y_pred))
print('Mean Absolute Error:', mae)

# Calculate the mean squared error
mse = np.mean((y_test - y_pred) ** 2)
print('Mean Squared Error:', mse)

# Calculate the root mean squared error
rmse = np.sqrt(mse)
print('Root Mean Squared Error:', rmse)

5. Production Deployment

Finally, we'll deploy the model to a production environment.

Python Code:

# Import the necessary libraries
from flask import Flask, request, jsonify
from sklearn.externals import joblib

# Load the trained model
model = joblib.load('model.pkl')

# Create a Flask app
app = Flask(__name__)

# Define a route for predicting the purchase amount
@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    age = data['age']
    income = data['income']
    prediction = model.predict([[age, income]])
    return jsonify({'prediction': prediction[0]})

# Run the app
if __name__ == '__main__':
    app.run(debug=True)

Metrics/ROI Calculations

We'll calculate the ROI of the project based on the predicted purchase amount.

Python Code:

# Calculate the total predicted purchase amount
total_predicted_purchase_amount = np.sum(y_pred)

# Calculate the ROI
roi = (total_predicted_purchase_amount / 100000) * 100
print('ROI:', roi)

Edge Cases

We'll handle edge cases such as missing values and outliers.

Python Code:

# Handle missing values
merged_data.fillna(merged_data.mean(), inplace=True)

# Handle outliers
Q1 = merged_data['purchase_amount'].quantile(0.25)
Q3 = merged_data['purchase_amount'].quantile(0.75)
IQR = Q3 - Q1
merged_data = merged_data[~((merged_data['purchase_amount'] < (Q1 - 1.5 * IQR)) | (merged_data['purchase_amount'] > (Q3 + 1.5 * IQR)))]

Scaling Tips

We'll provide tips for scaling the project.

Use distributed computing frameworks such as Apache Spark or Hadoop to handle large datasets.
Use cloud-based services such as AWS or Google Cloud to deploy the model and handle high traffic.
Use containerization tools such as Docker to ensure consistency across different environments.
Use orchestration tools such as Kubernetes to manage and scale the deployment.