amal org

Posted on Jan 6

Data Analyst Guide: Mastering Portfolio Projects That Impress Hiring Managers

#career #datascience #portfolio #tutorial

Data Analyst Guide: Mastering Portfolio Projects That Impress Hiring Managers

As a data analyst, having a strong portfolio is crucial to showcasing your skills and experience to potential hiring managers. In this tutorial, we will walk through a real-world business problem and provide a step-by-step technical solution to help you master portfolio projects that impress.

Business Problem Statement

A retail company wants to analyze customer purchase behavior and identify the most profitable customer segments. The company has a large dataset of customer transactions, including demographic information, purchase history, and transaction amounts. The goal is to develop a predictive model that can identify high-value customers and provide recommendations for targeted marketing campaigns.

The company estimates that a 10% increase in sales from high-value customers can result in an additional $1 million in revenue per year. Therefore, the ROI impact of this project is significant, and the company is looking for a data analyst who can develop a robust and scalable solution.

Step-by-Step Technical Solution

Step 1: Data Preparation (pandas/SQL)

First, we need to prepare the data for analysis. We will use a combination of pandas and SQL to load, clean, and transform the data.

import pandas as pd
import numpy as np
from sqlalchemy import create_engine

# Load data from database
engine = create_engine('postgresql://user:password@host:port/dbname')
query = """
    SELECT *
    FROM customer_transactions
    WHERE transaction_date >= '2020-01-01'
"""
data = pd.read_sql_query(query, engine)

# Clean and transform data
data['transaction_date'] = pd.to_datetime(data['transaction_date'])
data['customer_age'] = data['customer_birthdate'].apply(lambda x: 2022 - x.year)
data['transaction_amount'] = data['transaction_amount'].apply(lambda x: x * 100)  # convert to cents

# Handle missing values
data.fillna(data.mean(), inplace=True)

# Save data to CSV
data.to_csv('customer_transactions.csv', index=False)

-- Create table for customer transactions
CREATE TABLE customer_transactions (
    transaction_id SERIAL PRIMARY KEY,
    customer_id INTEGER NOT NULL,
    transaction_date DATE NOT NULL,
    transaction_amount DECIMAL(10, 2) NOT NULL,
    customer_birthdate DATE NOT NULL
);

-- Insert sample data
INSERT INTO customer_transactions (customer_id, transaction_date, transaction_amount, customer_birthdate)
VALUES
    (1, '2020-01-01', 100.00, '1990-01-01'),
    (2, '2020-01-15', 200.00, '1995-06-01'),
    (3, '2020-02-01', 50.00, '1980-03-01');

Step 2: Analysis Pipeline

Next, we will develop an analysis pipeline to identify the most profitable customer segments.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Split data into training and testing sets
X = data.drop(['customer_id', 'transaction_id'], axis=1)
y = data['customer_id']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train random forest classifier
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train, y_train)

# Make predictions on testing set
y_pred = rfc.predict(X_test)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Step 3: Model/Visualization Code

Now, we will develop a predictive model to identify high-value customers and provide recommendations for targeted marketing campaigns.

import matplotlib.pyplot as plt
import seaborn as sns

# Develop predictive model
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)

# Make predictions on testing set
y_pred = lr.predict(X_test)

# Visualize results
plt.scatter(y_test, y_pred)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Predictive Model Performance")
plt.show()

# Identify high-value customers
high_value_customers = data[data['transaction_amount'] > 1000]
print("High-Value Customers:")
print(high_value_customers)

# Provide recommendations for targeted marketing campaigns
recommendations = []
for customer in high_value_customers['customer_id']:
    recommendations.append({
        "customer_id": customer,
        "recommendation": "Targeted marketing campaign with personalized offers"
    })
print("Recommendations:")
print(recommendations)

Step 4: Performance Evaluation

We will evaluate the performance of our predictive model using metrics such as mean absolute error (MAE) and mean squared error (MSE).

from sklearn.metrics import mean_absolute_error, mean_squared_error

# Evaluate model performance
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)

Step 5: Production Deployment

Finally, we will deploy our predictive model to a production environment using a cloud-based platform such as AWS SageMaker.

import boto3

# Create SageMaker session
sagemaker = boto3.client('sagemaker')

# Create SageMaker model
model_name = "customer_segmentation_model"
model = sagemaker.create_model(
    ModelName=model_name,
    ExecutionRoleArn="arn:aws:iam::123456789012:role/service-role/AmazonSageMaker-ExecutionRole-123456789012",
    PrimaryContainer={
        "Image": "763104351884.dkr.ecr.us-west-2.amazonaws.com/sagemaker-scikit-decision-trees:1.0.4",
        "ModelDataUrl": "s3://my-bucket/model.tar.gz"
    }
)

# Create SageMaker endpoint
endpoint_name = "customer_segmentation_endpoint"
endpoint = sagemaker.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName="customer_segmentation_config",
    ProductionVariants=[
        {
            "VariantName": "variant-1",
            "ModelName": model_name,
            "InitialInstanceCount": 1,
            "InstanceType": "ml.m5.xlarge"
        }
    ]
)

# Deploy model to production
sagemaker.update_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName="customer_segmentation_config"
)

Metrics/ROI Calculations

We will calculate the ROI of our project by estimating the increase in sales from high-value customers.

# Calculate ROI
increase_in_sales = 1000000  # $1 million
roi = increase_in_sales / 100000  # $100,000 (project cost)
print("ROI:", roi)

Edge Cases

We will handle edge cases such as missing values, outliers, and non-linear relationships between variables.

# Handle missing values
data.fillna(data.mean(), inplace=True)

# Handle outliers
from sklearn.robust import HuberRegressor
hr = HuberRegressor()
hr.fit(X_train, y_train)

# Handle non-linear relationships
from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(degree=2)
X_train_poly = pf.fit_transform(X_train)
X_test_poly = pf.transform(X_test)

Scaling Tips

We will provide scaling tips such as using distributed computing, parallel processing, and cloud-based platforms.

# Use distributed computing
from joblib import Parallel, delayed
def train_model(X_train, y_train):
    # Train model
    pass
X_train_split = np.array_split(X_train, 10)
y_train_split = np.array_split(y_train, 10)
results = Parallel(n_jobs=10)(delayed(train_model)(X_train_split[i], y_train_split[i]) for i in range(10))

# Use parallel processing
from multiprocessing import Pool
def train_model(args):
    # Train model
    pass
pool = Pool(processes=10)
results = pool.map(train_model, [(X_train_split[i], y_train_split[i]) for i in range(10)])
pool.close()
pool.join()

# Use cloud-based platforms
from sklearn.externals import joblib
joblib.dump(rfc, "model.pkl")

By following this tutorial, you can develop a robust and scalable solution to identify high-value customers and provide recommendations for targeted marketing campaigns. Remember to handle edge cases, calculate ROI, and provide scaling tips to ensure the success of your project.

DEV Community

Data Analyst Guide: Mastering Portfolio Projects That Impress Hiring Managers

Data Analyst Guide: Mastering Portfolio Projects That Impress Hiring Managers

Business Problem Statement

Step-by-Step Technical Solution

Step 1: Data Preparation (pandas/SQL)

Step 2: Analysis Pipeline

Step 3: Model/Visualization Code

Step 4: Performance Evaluation

Step 5: Production Deployment

Metrics/ROI Calculations

Edge Cases

Scaling Tips

Top comments (0)