DEV Community

amal org
amal org

Posted on

Data Analyst Guide: Mastering Portfolio Projects That Impress Hiring Managers

Data Analyst Guide: Mastering Portfolio Projects That Impress Hiring Managers

As a data analyst, having a strong portfolio is crucial to showcasing your skills and experience to potential hiring managers. In this tutorial, we will walk through a real-world business problem and provide a step-by-step technical solution to help you master portfolio projects that impress.

Business Problem Statement

A retail company wants to analyze customer purchase behavior and identify the most profitable customer segments. The company has a large dataset of customer transactions, including demographic information, purchase history, and transaction amounts. The goal is to develop a predictive model that can identify high-value customers and provide recommendations for targeted marketing campaigns.

The company estimates that a 10% increase in sales from high-value customers can result in an additional $1 million in revenue per year. Therefore, the ROI impact of this project is significant, and the company is looking for a data analyst who can develop a robust and scalable solution.

Step-by-Step Technical Solution

Step 1: Data Preparation (pandas/SQL)

First, we need to prepare the data for analysis. We will use a combination of pandas and SQL to load, clean, and transform the data.

import pandas as pd
import numpy as np
from sqlalchemy import create_engine

# Load data from database
engine = create_engine('postgresql://user:password@host:port/dbname')
query = """
    SELECT *
    FROM customer_transactions
    WHERE transaction_date >= '2020-01-01'
"""
data = pd.read_sql_query(query, engine)

# Clean and transform data
data['transaction_date'] = pd.to_datetime(data['transaction_date'])
data['customer_age'] = data['customer_birthdate'].apply(lambda x: 2022 - x.year)
data['transaction_amount'] = data['transaction_amount'].apply(lambda x: x * 100)  # convert to cents

# Handle missing values
data.fillna(data.mean(), inplace=True)

# Save data to CSV
data.to_csv('customer_transactions.csv', index=False)
Enter fullscreen mode Exit fullscreen mode
-- Create table for customer transactions
CREATE TABLE customer_transactions (
    transaction_id SERIAL PRIMARY KEY,
    customer_id INTEGER NOT NULL,
    transaction_date DATE NOT NULL,
    transaction_amount DECIMAL(10, 2) NOT NULL,
    customer_birthdate DATE NOT NULL
);

-- Insert sample data
INSERT INTO customer_transactions (customer_id, transaction_date, transaction_amount, customer_birthdate)
VALUES
    (1, '2020-01-01', 100.00, '1990-01-01'),
    (2, '2020-01-15', 200.00, '1995-06-01'),
    (3, '2020-02-01', 50.00, '1980-03-01');
Enter fullscreen mode Exit fullscreen mode

Step 2: Analysis Pipeline

Next, we will develop an analysis pipeline to identify the most profitable customer segments.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Split data into training and testing sets
X = data.drop(['customer_id', 'transaction_id'], axis=1)
y = data['customer_id']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train random forest classifier
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train, y_train)

# Make predictions on testing set
y_pred = rfc.predict(X_test)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
Enter fullscreen mode Exit fullscreen mode

Step 3: Model/Visualization Code

Now, we will develop a predictive model to identify high-value customers and provide recommendations for targeted marketing campaigns.

import matplotlib.pyplot as plt
import seaborn as sns

# Develop predictive model
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)

# Make predictions on testing set
y_pred = lr.predict(X_test)

# Visualize results
plt.scatter(y_test, y_pred)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Predictive Model Performance")
plt.show()

# Identify high-value customers
high_value_customers = data[data['transaction_amount'] > 1000]
print("High-Value Customers:")
print(high_value_customers)

# Provide recommendations for targeted marketing campaigns
recommendations = []
for customer in high_value_customers['customer_id']:
    recommendations.append({
        "customer_id": customer,
        "recommendation": "Targeted marketing campaign with personalized offers"
    })
print("Recommendations:")
print(recommendations)
Enter fullscreen mode Exit fullscreen mode

Step 4: Performance Evaluation

We will evaluate the performance of our predictive model using metrics such as mean absolute error (MAE) and mean squared error (MSE).

from sklearn.metrics import mean_absolute_error, mean_squared_error

# Evaluate model performance
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
Enter fullscreen mode Exit fullscreen mode

Step 5: Production Deployment

Finally, we will deploy our predictive model to a production environment using a cloud-based platform such as AWS SageMaker.

import boto3

# Create SageMaker session
sagemaker = boto3.client('sagemaker')

# Create SageMaker model
model_name = "customer_segmentation_model"
model = sagemaker.create_model(
    ModelName=model_name,
    ExecutionRoleArn="arn:aws:iam::123456789012:role/service-role/AmazonSageMaker-ExecutionRole-123456789012",
    PrimaryContainer={
        "Image": "763104351884.dkr.ecr.us-west-2.amazonaws.com/sagemaker-scikit-decision-trees:1.0.4",
        "ModelDataUrl": "s3://my-bucket/model.tar.gz"
    }
)

# Create SageMaker endpoint
endpoint_name = "customer_segmentation_endpoint"
endpoint = sagemaker.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName="customer_segmentation_config",
    ProductionVariants=[
        {
            "VariantName": "variant-1",
            "ModelName": model_name,
            "InitialInstanceCount": 1,
            "InstanceType": "ml.m5.xlarge"
        }
    ]
)

# Deploy model to production
sagemaker.update_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName="customer_segmentation_config"
)
Enter fullscreen mode Exit fullscreen mode

Metrics/ROI Calculations

We will calculate the ROI of our project by estimating the increase in sales from high-value customers.

# Calculate ROI
increase_in_sales = 1000000  # $1 million
roi = increase_in_sales / 100000  # $100,000 (project cost)
print("ROI:", roi)
Enter fullscreen mode Exit fullscreen mode

Edge Cases

We will handle edge cases such as missing values, outliers, and non-linear relationships between variables.

# Handle missing values
data.fillna(data.mean(), inplace=True)

# Handle outliers
from sklearn.robust import HuberRegressor
hr = HuberRegressor()
hr.fit(X_train, y_train)

# Handle non-linear relationships
from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(degree=2)
X_train_poly = pf.fit_transform(X_train)
X_test_poly = pf.transform(X_test)
Enter fullscreen mode Exit fullscreen mode

Scaling Tips

We will provide scaling tips such as using distributed computing, parallel processing, and cloud-based platforms.

# Use distributed computing
from joblib import Parallel, delayed
def train_model(X_train, y_train):
    # Train model
    pass
X_train_split = np.array_split(X_train, 10)
y_train_split = np.array_split(y_train, 10)
results = Parallel(n_jobs=10)(delayed(train_model)(X_train_split[i], y_train_split[i]) for i in range(10))

# Use parallel processing
from multiprocessing import Pool
def train_model(args):
    # Train model
    pass
pool = Pool(processes=10)
results = pool.map(train_model, [(X_train_split[i], y_train_split[i]) for i in range(10)])
pool.close()
pool.join()

# Use cloud-based platforms
from sklearn.externals import joblib
joblib.dump(rfc, "model.pkl")
Enter fullscreen mode Exit fullscreen mode

By following this tutorial, you can develop a robust and scalable solution to identify high-value customers and provide recommendations for targeted marketing campaigns. Remember to handle edge cases, calculate ROI, and provide scaling tips to ensure the success of your project.

Top comments (0)