Data Analyst Guide: Mastering Linear Regression Assumptions Every Analyst Must Know

Business Problem Statement

In the retail industry, predicting sales is crucial for businesses to make informed decisions about inventory management, pricing, and marketing strategies. A real scenario is a company like Walmart, which has thousands of stores across the globe. If Walmart can accurately predict sales, it can optimize its inventory levels, reduce waste, and increase revenue. The ROI impact of accurate sales prediction can be significant, with potential benefits including:

Reduced inventory costs: 5-10% reduction in inventory costs can result in millions of dollars in savings
Increased revenue: 2-5% increase in sales can result in tens of millions of dollars in additional revenue
Improved customer satisfaction: accurate sales prediction can help ensure that products are available when customers want them, leading to increased customer satisfaction and loyalty

Step-by-Step Technical Solution

Step 1: Data Preparation

To prepare the data for analysis, we will use pandas to load the data and perform initial cleaning and preprocessing. We will also use SQL to retrieve the data from a database.

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import matplotlib.pyplot as plt
from sqlalchemy import create_engine

# Load data from SQL database
engine = create_engine('postgresql://user:password@host:port/dbname')
query = """
    SELECT 
        sales,
        advertising,
        promotion,
        seasonality,
        trend
    FROM 
        sales_data
"""
data = pd.read_sql_query(query, engine)

# Drop any rows with missing values
data.dropna(inplace=True)

# Convert categorical variables to numerical variables
data['seasonality'] = pd.get_dummies(data['seasonality'])

# Define features and target variable
X = data[['advertising', 'promotion', 'seasonality']]
y = data['sales']

-- Create table to store sales data
CREATE TABLE sales_data (
    id SERIAL PRIMARY KEY,
    sales DECIMAL(10, 2),
    advertising DECIMAL(10, 2),
    promotion DECIMAL(10, 2),
    seasonality VARCHAR(50),
    trend DECIMAL(10, 2)
);

-- Insert data into table
INSERT INTO sales_data (sales, advertising, promotion, seasonality, trend)
VALUES
    (1000.00, 100.00, 50.00, 'Winter', 0.10),
    (1200.00, 150.00, 75.00, 'Spring', 0.15),
    (1500.00, 200.00, 100.00, 'Summer', 0.20),
    (1800.00, 250.00, 125.00, 'Fall', 0.25);

Step 2: Analysis Pipeline

To analyze the data, we will use a linear regression model to predict sales based on advertising, promotion, and seasonality. We will also check for linear regression assumptions.

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on testing set
y_pred = model.predict(X_test)

# Check for linear regression assumptions
# 1. Linearity
plt.scatter(X_test['advertising'], y_test)
plt.xlabel('Advertising')
plt.ylabel('Sales')
plt.show()

# 2. Independence
from statsmodels.stats.stattools import durbin_watson
dw_stat = durbin_watson(y_test)
print('Durbin-Watson statistic:', dw_stat)

# 3. Homoscedasticity
from statsmodels.stats.diagnostic import het_goldfeldquandt
hg_stat = het_goldfeldquandt(y_test, X_test['advertising'])
print('Goldfeld-Quandt statistic:', hg_stat)

# 4. Normality
from scipy import stats
normality_stat = stats.shapiro(y_test)
print('Shapiro-Wilk statistic:', normality_stat)

# 5. No multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = variance_inflation_factor(X_test)
print('Variance inflation factor:', vif)

Step 3: Model/Visualization Code

To visualize the results, we will use matplotlib to plot the predicted sales against the actual sales.

# Plot predicted sales against actual sales
plt.scatter(y_test, y_pred)
plt.xlabel('Actual Sales')
plt.ylabel('Predicted Sales')
plt.show()

# Plot residuals against predicted sales
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.xlabel('Predicted Sales')
plt.ylabel('Residuals')
plt.show()

Step 4: Performance Evaluation

To evaluate the performance of the model, we will use metrics such as mean squared error and R-squared.

# Calculate mean squared error
mse = metrics.mean_squared_error(y_test, y_pred)
print('Mean squared error:', mse)

# Calculate R-squared
r2 = metrics.r2_score(y_test, y_pred)
print('R-squared:', r2)

Step 5: Production Deployment

To deploy the model in production, we will use a Python script that loads the data, makes predictions, and saves the results to a database.

# Load data from database
engine = create_engine('postgresql://user:password@host:port/dbname')
query = """
    SELECT 
        advertising,
        promotion,
        seasonality
    FROM 
        sales_data
"""
data = pd.read_sql_query(query, engine)

# Make predictions
predictions = model.predict(data)

# Save predictions to database
predictions_df = pd.DataFrame(predictions, columns=['predicted_sales'])
predictions_df.to_sql('predicted_sales', engine, if_exists='replace', index=False)

Edge Cases

Handling missing values: We will use imputation techniques such as mean or median imputation to handle missing values.
Handling outliers: We will use techniques such as winsorization or trimming to handle outliers.

Scaling Tips

Use distributed computing: We will use distributed computing frameworks such as Apache Spark to scale the model to large datasets.
Use parallel processing: We will use parallel processing libraries such as joblib to scale the model to large datasets.
Use cloud computing: We will use cloud computing platforms such as AWS or Google Cloud to scale the model to large datasets.

Metrics/ROI Calculations

Mean squared error: We will use mean squared error to evaluate the performance of the model.
R-squared: We will use R-squared to evaluate the performance of the model.
Return on investment (ROI): We will use ROI to evaluate the financial performance of the model.

# Calculate ROI
revenue = 1000000
cost = 500000
roi = (revenue - cost) / cost
print('Return on investment:', roi)

Note: This is a simplified example and may not reflect the actual ROI of a real-world project.

DEV Community

Data Analyst Guide: Mastering Linear Regression Assumptions Every Analyst Must Know

Data Analyst Guide: Mastering Linear Regression Assumptions Every Analyst Must Know

Business Problem Statement

Step-by-Step Technical Solution

Step 1: Data Preparation

Step 2: Analysis Pipeline

Step 3: Model/Visualization Code

Step 4: Performance Evaluation

Step 5: Production Deployment

Top comments (0)