Data Analyst Guide: Mastering Linear Regression Assumptions Every Analyst Must Know

Business Problem Statement

In the retail industry, understanding the relationship between advertising spend and sales revenue is crucial for making informed decisions about marketing budgets. Suppose we are a data analyst at a retail company, and our goal is to analyze the impact of advertising spend on sales revenue. We want to build a linear regression model to predict sales revenue based on advertising spend. However, to ensure the reliability and accuracy of our model, we need to check the linear regression assumptions.

Step-by-Step Technical Solution

Step 1: Data Preparation (pandas/SQL)

First, let's create a sample dataset using Python's pandas library. We'll also include a SQL query to demonstrate how to retrieve the data from a database.

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import matplotlib.pyplot as plt
import sqlite3

# Create a sample dataset
data = {
    'advertising_spend': [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000],
    'sales_revenue': [500, 700, 900, 1100, 1300, 1500, 1700, 1900, 2100, 2300]
}
df = pd.DataFrame(data)

# SQL query to retrieve data from a database
conn = sqlite3.connect('retail_database.db')
cursor = conn.cursor()
cursor.execute('''
    SELECT advertising_spend, sales_revenue
    FROM sales_data
''')
rows = cursor.fetchall()
df_sql = pd.DataFrame(rows, columns=['advertising_spend', 'sales_revenue'])
conn.close()

Step 2: Analysis Pipeline

Next, let's split our dataset into training and testing sets, and then create a linear regression model using scikit-learn.

# Split the dataset into training and testing sets
X = df[['advertising_spend']]
y = df['sales_revenue']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

Step 3: Model/Visualization Code

Now, let's use our model to make predictions and visualize the results.

# Make predictions using the model
y_pred = model.predict(X_test)

# Visualize the results
plt.scatter(X_test, y_test, label='Actual')
plt.plot(X_test, y_pred, label='Predicted', color='red')
plt.xlabel('Advertising Spend')
plt.ylabel('Sales Revenue')
plt.title('Linear Regression Model')
plt.legend()
plt.show()

Step 4: Performance Evaluation

To evaluate the performance of our model, we'll calculate the mean absolute error (MAE) and mean squared error (MSE).

# Calculate the mean absolute error (MAE)
mae = metrics.mean_absolute_error(y_test, y_pred)
print(f'MAE: {mae}')

# Calculate the mean squared error (MSE)
mse = metrics.mean_squared_error(y_test, y_pred)
print(f'MSE: {mse}')

Step 5: Production Deployment

Finally, let's deploy our model to a production environment. We'll create a function that takes in new data and returns a prediction.

# Create a function to make predictions
def predict_sales(advertising_spend):
    prediction = model.predict([[advertising_spend]])
    return prediction[0]

# Test the function
new_advertising_spend = 1200
predicted_sales = predict_sales(new_advertising_spend)
print(f'Predicted sales revenue for ${new_advertising_spend} advertising spend: ${predicted_sales}')

Linear Regression Assumptions

To ensure the reliability and accuracy of our model, we need to check the following linear regression assumptions:

Linearity: The relationship between the independent variable and dependent variable should be linear.
Independence: Each observation should be independent of the others.
Homoscedasticity: The variance of the residuals should be constant across all levels of the independent variable.
Normality: The residuals should be normally distributed.
No multicollinearity: The independent variables should not be highly correlated with each other.

We can check these assumptions using various statistical tests and visualizations.

# Check for linearity
plt.scatter(X, y)
plt.xlabel('Advertising Spend')
plt.ylabel('Sales Revenue')
plt.title('Scatter Plot')
plt.show()

# Check for independence
from statsmodels.stats.stattools import durbin_watson
dw_stat = durbin_watson(y_pred)
print(f'Durbin-Watson statistic: {dw_stat}')

# Check for homoscedasticity
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

# Check for normality
from scipy import stats
normality_test = stats.shapiro(residuals)
print(f'Shapiro-Wilk statistic: {normality_test.statistic}')
print(f'p-value: {normality_test.pvalue}')

# Check for multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = variance_inflation_factor(X)
print(f'Variance Inflation Factor (VIF): {vif}')

ROI Calculations

To calculate the return on investment (ROI) of our advertising spend, we can use the following formula:

ROI = (Gain from Investment - Cost of Investment) / Cost of Investment

In this case, the gain from investment is the predicted sales revenue, and the cost of investment is the advertising spend.

# Calculate the ROI
def calculate_roi(advertising_spend):
    predicted_sales = predict_sales(advertising_spend)
    roi = (predicted_sales - advertising_spend) / advertising_spend
    return roi

# Test the function
new_advertising_spend = 1200
roi = calculate_roi(new_advertising_spend)
print(f'ROI for ${new_advertising_spend} advertising spend: {roi:.2%}')

Edge Cases

To handle edge cases, we can add error checking and handling to our code. For example, we can check if the input advertising spend is valid (i.e., non-negative).

# Add error checking and handling
def predict_sales(advertising_spend):
    if advertising_spend < 0:
        raise ValueError('Advertising spend must be non-negative')
    prediction = model.predict([[advertising_spend]])
    return prediction[0]

Scaling Tips

To scale our model to larger datasets, we can use techniques such as:

Data parallelism: Split the data into smaller chunks and process them in parallel.
Model parallelism: Split the model into smaller components and train them in parallel.
Distributed computing: Use distributed computing frameworks such as Apache Spark or Hadoop to process large datasets.

We can also use techniques such as batch normalization and gradient clipping to improve the stability and performance of our model.

# Use batch normalization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Use gradient clipping
from keras.optimizers import Adam
optimizer = Adam(lr=0.001, clipnorm=1.0)