Data Analyst Guide: Mastering Data Analyst Mindset: Curiosity > Perfection
Business Problem Statement
In this tutorial, we'll explore a real-world scenario where a company, "E-commerce Inc.," wants to analyze its customer purchasing behavior to identify trends and opportunities for growth. The goal is to develop a data-driven approach to increase sales and improve customer satisfaction.
The company has an e-commerce platform where customers can purchase products from various categories. The platform generates a large amount of data, including customer demographics, purchase history, and product information. The company wants to leverage this data to:
- Identify the most profitable customer segments
- Develop targeted marketing campaigns
- Optimize product offerings and pricing
The ROI impact of this analysis is significant, as it can help the company increase sales by 10-15% and improve customer satisfaction by 20-25%.
Step-by-Step Technical Solution
Step 1: Data Preparation (pandas/SQL)
First, we need to prepare the data for analysis. We'll use a combination of pandas and SQL to load, clean, and transform the data.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load data from SQL database
import sqlite3
conn = sqlite3.connect('ecommerce.db')
cursor = conn.cursor()
# SQL query to extract data
query = """
SELECT
customer_id,
age,
gender,
purchase_history,
product_category,
purchase_amount
FROM
customer_data
"""
# Execute SQL query and load data into pandas dataframe
df = pd.read_sql_query(query, conn)
# Clean and transform data
df = df.dropna() # remove missing values
df['purchase_history'] = pd.to_datetime(df['purchase_history'])
df['age'] = df['age'].astype(int)
df['gender'] = df['gender'].astype('category')
df['product_category'] = df['product_category'].astype('category')
# Split data into training and testing sets
X = df.drop(['purchase_amount'], axis=1)
y = df['purchase_amount']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 2: Analysis Pipeline
Next, we'll develop an analysis pipeline to identify the most profitable customer segments.
# Define a function to calculate customer lifetime value (CLV)
def calculate_clv(customer_id):
# SQL query to extract purchase history for customer
query = """
SELECT
SUM(purchase_amount) AS total_spent
FROM
customer_data
WHERE
customer_id = ?
"""
cursor.execute(query, (customer_id,))
result = cursor.fetchone()
total_spent = result[0]
return total_spent
# Calculate CLV for each customer
df['clv'] = df['customer_id'].apply(calculate_clv)
# Identify top 10% of customers by CLV
top_customers = df.sort_values(by='clv', ascending=False).head(int(0.1 * len(df)))
# Analyze demographics of top customers
print(top_customers['age'].describe())
print(top_customers['gender'].value_counts())
print(top_customers['product_category'].value_counts())
Step 3: Model/Visualization Code
Now, we'll develop a model to predict customer purchase behavior and visualize the results.
# Define a random forest classifier to predict purchase amount
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# Make predictions on test data
y_pred = rf.predict(X_test)
# Evaluate model performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
# Visualize results using a bar chart
import matplotlib.pyplot as plt
plt.bar(top_customers['product_category'].value_counts().index, top_customers['product_category'].value_counts())
plt.xlabel('Product Category')
plt.ylabel('Count')
plt.title('Top Product Categories for High-Value Customers')
plt.show()
Step 4: Performance Evaluation
We'll evaluate the performance of our analysis pipeline using metrics such as accuracy, precision, and recall.
# Define a function to calculate metrics
def calculate_metrics(y_test, y_pred):
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
return accuracy, precision, recall
# Calculate metrics for our model
accuracy, precision, recall = calculate_metrics(y_test, y_pred)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
Step 5: Production Deployment
Finally, we'll deploy our analysis pipeline to a production environment using a cloud-based platform such as AWS or Google Cloud.
# Define a function to deploy model to production
def deploy_model(model, data):
# Create a cloud-based API endpoint
from flask import Flask, request, jsonify
app = Flask(__name__)
# Define a route for the API endpoint
@app.route('/predict', methods=['POST'])
def predict():
# Get input data from request
input_data = request.get_json()
# Make predictions using the model
predictions = model.predict(input_data)
# Return predictions as JSON
return jsonify(predictions)
# Deploy the API endpoint to a cloud-based platform
if __name__ == '__main__':
app.run(debug=True)
# Deploy our model to production
deploy_model(rf, X_test)
Metrics/ROI Calculations
We'll calculate the ROI of our analysis pipeline using metrics such as revenue growth and customer satisfaction.
# Define a function to calculate ROI
def calculate_roi(revenue_growth, customer_satisfaction):
# Calculate ROI using a formula
roi = (revenue_growth * customer_satisfaction) / 100
return roi
# Calculate ROI for our analysis pipeline
revenue_growth = 0.15 # 15% revenue growth
customer_satisfaction = 0.25 # 25% customer satisfaction
roi = calculate_roi(revenue_growth, customer_satisfaction)
print("ROI:", roi)
Edge Cases
We'll handle edge cases such as missing data, outliers, and non-linear relationships.
# Define a function to handle missing data
def handle_missing_data(df):
# Impute missing values using a strategy
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
df = imputer.fit_transform(df)
return df
# Define a function to handle outliers
def handle_outliers(df):
# Remove outliers using a method
from sklearn.ensemble import IsolationForest
isolation_forest = IsolationForest(contamination=0.1)
df = isolation_forest.fit_predict(df)
return df
# Define a function to handle non-linear relationships
def handle_non_linear_relationships(df):
# Transform data using a technique
from sklearn.preprocessing import PolynomialFeatures
polynomial_features = PolynomialFeatures(degree=2)
df = polynomial_features.fit_transform(df)
return df
Scaling Tips
We'll provide scaling tips such as using distributed computing, parallel processing, and data partitioning.
# Define a function to scale analysis pipeline
def scale_analysis_pipeline(df):
# Use distributed computing to scale analysis pipeline
from dask.distributed import Client
client = Client()
df = client.scatter(df)
# Use parallel processing to scale analysis pipeline
from joblib import Parallel, delayed
parallel = Parallel(n_jobs=-1)
df = parallel(delayed(process_data)(df) for df in df)
# Use data partitioning to scale analysis pipeline
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
return df_train, df_test
Top comments (0)