Data Analyst Guide: Mastering Data Analyst Mindset: Curiosity > Perfection

Business Problem Statement

In this tutorial, we'll explore a real-world scenario where a company, "E-commerce Inc.," wants to analyze its customer purchasing behavior to identify trends and opportunities for growth. The goal is to develop a data-driven approach to increase sales and improve customer satisfaction.

The company has an e-commerce platform where customers can purchase products from various categories. The platform generates a large amount of data, including customer demographics, purchase history, and product information. The company wants to leverage this data to:

Identify the most profitable customer segments
Develop targeted marketing campaigns
Optimize product offerings and pricing

The ROI impact of this analysis is significant, as it can help the company increase sales by 10-15% and improve customer satisfaction by 20-25%.

Step-by-Step Technical Solution

Step 1: Data Preparation (pandas/SQL)

First, we need to prepare the data for analysis. We'll use a combination of pandas and SQL to load, clean, and transform the data.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load data from SQL database
import sqlite3
conn = sqlite3.connect('ecommerce.db')
cursor = conn.cursor()

# SQL query to extract data
query = """
    SELECT 
        customer_id,
        age,
        gender,
        purchase_history,
        product_category,
        purchase_amount
    FROM 
        customer_data
"""

# Execute SQL query and load data into pandas dataframe
df = pd.read_sql_query(query, conn)

# Clean and transform data
df = df.dropna()  # remove missing values
df['purchase_history'] = pd.to_datetime(df['purchase_history'])
df['age'] = df['age'].astype(int)
df['gender'] = df['gender'].astype('category')
df['product_category'] = df['product_category'].astype('category')

# Split data into training and testing sets
X = df.drop(['purchase_amount'], axis=1)
y = df['purchase_amount']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 2: Analysis Pipeline

Next, we'll develop an analysis pipeline to identify the most profitable customer segments.

# Define a function to calculate customer lifetime value (CLV)
def calculate_clv(customer_id):
    # SQL query to extract purchase history for customer
    query = """
        SELECT 
            SUM(purchase_amount) AS total_spent
        FROM 
            customer_data
        WHERE 
            customer_id = ?
    """
    cursor.execute(query, (customer_id,))
    result = cursor.fetchone()
    total_spent = result[0]
    return total_spent

# Calculate CLV for each customer
df['clv'] = df['customer_id'].apply(calculate_clv)

# Identify top 10% of customers by CLV
top_customers = df.sort_values(by='clv', ascending=False).head(int(0.1 * len(df)))

# Analyze demographics of top customers
print(top_customers['age'].describe())
print(top_customers['gender'].value_counts())
print(top_customers['product_category'].value_counts())

Step 3: Model/Visualization Code

Now, we'll develop a model to predict customer purchase behavior and visualize the results.

# Define a random forest classifier to predict purchase amount
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Make predictions on test data
y_pred = rf.predict(X_test)

# Evaluate model performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Visualize results using a bar chart
import matplotlib.pyplot as plt
plt.bar(top_customers['product_category'].value_counts().index, top_customers['product_category'].value_counts())
plt.xlabel('Product Category')
plt.ylabel('Count')
plt.title('Top Product Categories for High-Value Customers')
plt.show()

Step 4: Performance Evaluation

We'll evaluate the performance of our analysis pipeline using metrics such as accuracy, precision, and recall.

# Define a function to calculate metrics
def calculate_metrics(y_test, y_pred):
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    return accuracy, precision, recall

# Calculate metrics for our model
accuracy, precision, recall = calculate_metrics(y_test, y_pred)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Step 5: Production Deployment

Finally, we'll deploy our analysis pipeline to a production environment using a cloud-based platform such as AWS or Google Cloud.

# Define a function to deploy model to production
def deploy_model(model, data):
    # Create a cloud-based API endpoint
    from flask import Flask, request, jsonify
    app = Flask(__name__)

    # Define a route for the API endpoint
    @app.route('/predict', methods=['POST'])
    def predict():
        # Get input data from request
        input_data = request.get_json()
        # Make predictions using the model
        predictions = model.predict(input_data)
        # Return predictions as JSON
        return jsonify(predictions)

    # Deploy the API endpoint to a cloud-based platform
    if __name__ == '__main__':
        app.run(debug=True)

# Deploy our model to production
deploy_model(rf, X_test)

Metrics/ROI Calculations

We'll calculate the ROI of our analysis pipeline using metrics such as revenue growth and customer satisfaction.

# Define a function to calculate ROI
def calculate_roi(revenue_growth, customer_satisfaction):
    # Calculate ROI using a formula
    roi = (revenue_growth * customer_satisfaction) / 100
    return roi

# Calculate ROI for our analysis pipeline
revenue_growth = 0.15  # 15% revenue growth
customer_satisfaction = 0.25  # 25% customer satisfaction
roi = calculate_roi(revenue_growth, customer_satisfaction)
print("ROI:", roi)

Edge Cases

We'll handle edge cases such as missing data, outliers, and non-linear relationships.

# Define a function to handle missing data
def handle_missing_data(df):
    # Impute missing values using a strategy
    from sklearn.impute import SimpleImputer
    imputer = SimpleImputer(strategy='mean')
    df = imputer.fit_transform(df)
    return df

# Define a function to handle outliers
def handle_outliers(df):
    # Remove outliers using a method
    from sklearn.ensemble import IsolationForest
    isolation_forest = IsolationForest(contamination=0.1)
    df = isolation_forest.fit_predict(df)
    return df

# Define a function to handle non-linear relationships
def handle_non_linear_relationships(df):
    # Transform data using a technique
    from sklearn.preprocessing import PolynomialFeatures
    polynomial_features = PolynomialFeatures(degree=2)
    df = polynomial_features.fit_transform(df)
    return df

Scaling Tips

We'll provide scaling tips such as using distributed computing, parallel processing, and data partitioning.

# Define a function to scale analysis pipeline
def scale_analysis_pipeline(df):
    # Use distributed computing to scale analysis pipeline
    from dask.distributed import Client
    client = Client()
    df = client.scatter(df)
    # Use parallel processing to scale analysis pipeline
    from joblib import Parallel, delayed
    parallel = Parallel(n_jobs=-1)
    df = parallel(delayed(process_data)(df) for df in df)
    # Use data partitioning to scale analysis pipeline
    from sklearn.model_selection import train_test_split
    df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
    return df_train, df_test