DEV Community

amal org
amal org

Posted on

Data Analyst Guide: Mastering Portfolio Projects That Impress Hiring Managers

Data Analyst Guide: Mastering Portfolio Projects That Impress Hiring Managers

Business Problem Statement

Real Scenario + ROI Impact

Let's consider a real-world scenario where an e-commerce company wants to analyze customer purchasing behavior and identify the most profitable customer segments. The goal is to develop a data-driven strategy to increase sales and revenue.

Problem Statement:
Analyze customer data to identify the most profitable customer segments and develop a targeted marketing strategy to increase sales and revenue.

ROI Impact:
By identifying the most profitable customer segments, the company can expect to increase sales by 15% and revenue by 20% within the next 6 months, resulting in an estimated ROI of $1.2 million.

Step-by-Step Technical Solution

1. Data Preparation (pandas/SQL)

First, we need to prepare the data for analysis. We'll use a sample dataset containing customer information, purchase history, and demographic data.

SQL Query:

-- Create a sample dataset
CREATE TABLE customers (
    customer_id INT,
    name VARCHAR(255),
    email VARCHAR(255),
    age INT,
    income DECIMAL(10, 2),
    purchase_history VARCHAR(255)
);

INSERT INTO customers (customer_id, name, email, age, income, purchase_history)
VALUES
(1, 'John Doe', 'john.doe@example.com', 30, 50000.00, 'product1,product2'),
(2, 'Jane Doe', 'jane.doe@example.com', 25, 40000.00, 'product3,product4'),
(3, 'Bob Smith', 'bob.smith@example.com', 40, 60000.00, 'product1,product5'),
(4, 'Alice Johnson', 'alice.johnson@example.com', 35, 55000.00, 'product2,product6'),
(5, 'Mike Brown', 'mike.brown@example.com', 45, 70000.00, 'product3,product7');

-- Create a table for purchase history
CREATE TABLE purchases (
    purchase_id INT,
    customer_id INT,
    product_id VARCHAR(255),
    purchase_date DATE,
    purchase_amount DECIMAL(10, 2)
);

INSERT INTO purchases (purchase_id, customer_id, product_id, purchase_date, purchase_amount)
VALUES
(1, 1, 'product1', '2022-01-01', 100.00),
(2, 1, 'product2', '2022-01-15', 200.00),
(3, 2, 'product3', '2022-02-01', 50.00),
(4, 3, 'product1', '2022-03-01', 150.00),
(5, 4, 'product2', '2022-04-01', 250.00),
(6, 5, 'product3', '2022-05-01', 300.00);
Enter fullscreen mode Exit fullscreen mode

Python Code:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the data from the SQL database
import sqlite3
conn = sqlite3.connect('database.db')
cursor = conn.cursor()

cursor.execute('SELECT * FROM customers')
customers = pd.DataFrame(cursor.fetchall(), columns=['customer_id', 'name', 'email', 'age', 'income', 'purchase_history'])

cursor.execute('SELECT * FROM purchases')
purchases = pd.DataFrame(cursor.fetchall(), columns=['purchase_id', 'customer_id', 'product_id', 'purchase_date', 'purchase_amount'])

# Merge the two datasets
merged_data = pd.merge(customers, purchases, on='customer_id')

# Convert the purchase history to a numerical value
merged_data['purchase_history'] = merged_data['purchase_history'].apply(lambda x: len(x.split(',')))

# Define the features and target variable
X = merged_data[['age', 'income', 'purchase_history']]
y = merged_data['product_id']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Enter fullscreen mode Exit fullscreen mode

2. Analysis Pipeline

Next, we'll develop an analysis pipeline to identify the most profitable customer segments.

Python Code:

# Train a random forest classifier on the training data
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = rfc.predict(X_test)

# Evaluate the model's performance
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Classification Report:')
print(classification_report(y_test, y_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

# Identify the most profitable customer segments
segment1 = merged_data[(merged_data['age'] >= 30) & (merged_data['income'] >= 50000)]
segment2 = merged_data[(merged_data['age'] < 30) & (merged_data['income'] < 50000)]

print('Segment 1:')
print(segment1.describe())
print('Segment 2:')
print(segment2.describe())
Enter fullscreen mode Exit fullscreen mode

3. Model/Visualization Code

Now, we'll develop a model to predict the purchase amount based on the customer's age and income.

Python Code:

# Import the necessary libraries
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# Define the features and target variable
X = merged_data[['age', 'income']]
y = merged_data['purchase_amount']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a linear regression model on the training data
lr = LinearRegression()
lr.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = lr.predict(X_test)

# Evaluate the model's performance
print('Mean Absolute Error:', np.mean(np.abs(y_test - y_pred)))

# Visualize the predicted purchase amount
plt.scatter(X_test['age'], y_test)
plt.plot(X_test['age'], y_pred, color='red')
plt.xlabel('Age')
plt.ylabel('Purchase Amount')
plt.title('Predicted Purchase Amount vs Age')
plt.show()
Enter fullscreen mode Exit fullscreen mode

4. Performance Evaluation

We'll evaluate the performance of the model using various metrics.

Python Code:

# Calculate the mean absolute error
mae = np.mean(np.abs(y_test - y_pred))
print('Mean Absolute Error:', mae)

# Calculate the mean squared error
mse = np.mean((y_test - y_pred) ** 2)
print('Mean Squared Error:', mse)

# Calculate the root mean squared error
rmse = np.sqrt(mse)
print('Root Mean Squared Error:', rmse)
Enter fullscreen mode Exit fullscreen mode

5. Production Deployment

Finally, we'll deploy the model to a production environment.

Python Code:

# Import the necessary libraries
from flask import Flask, request, jsonify
from sklearn.externals import joblib

# Load the trained model
model = joblib.load('model.pkl')

# Create a Flask app
app = Flask(__name__)

# Define a route for predicting the purchase amount
@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    age = data['age']
    income = data['income']
    prediction = model.predict([[age, income]])
    return jsonify({'prediction': prediction[0]})

# Run the app
if __name__ == '__main__':
    app.run(debug=True)
Enter fullscreen mode Exit fullscreen mode

Metrics/ROI Calculations

We'll calculate the ROI of the project based on the predicted purchase amount.

Python Code:

# Calculate the total predicted purchase amount
total_predicted_purchase_amount = np.sum(y_pred)

# Calculate the ROI
roi = (total_predicted_purchase_amount / 100000) * 100
print('ROI:', roi)
Enter fullscreen mode Exit fullscreen mode

Edge Cases

We'll handle edge cases such as missing values and outliers.

Python Code:

# Handle missing values
merged_data.fillna(merged_data.mean(), inplace=True)

# Handle outliers
Q1 = merged_data['purchase_amount'].quantile(0.25)
Q3 = merged_data['purchase_amount'].quantile(0.75)
IQR = Q3 - Q1
merged_data = merged_data[~((merged_data['purchase_amount'] < (Q1 - 1.5 * IQR)) | (merged_data['purchase_amount'] > (Q3 + 1.5 * IQR)))]
Enter fullscreen mode Exit fullscreen mode

Scaling Tips

We'll provide tips for scaling the project.

  • Use distributed computing frameworks such as Apache Spark or Hadoop to handle large datasets.
  • Use cloud-based services such as AWS or Google Cloud to deploy the model and handle high traffic.
  • Use containerization tools such as Docker to ensure consistency across different environments.
  • Use orchestration tools such as Kubernetes to manage and scale the deployment.

Top comments (0)