Prayson Wilfred Daniel

Posted on Sep 2, 2023 • Edited on Nov 19

Data Transformation is All You Need

#datascience #machinelearning #python

In machine learning and data science, sophisticated algorithms and models often grab the limelight. However, sometimes, the crux of a successful predictive model isn't just about choosing the right algorithm, but more about presenting the data in a manner that the algorithm can best learn from.

This short article, born from my procrastination, explored linear regression model in light of the power of data visualization and transformation. My goal is to show how a simple change in data representation can improve model performance.

Setting the Stage: Basic Linear Regression

To illustrate this point, consider a dataset that captures the relationship between TV advertisements and sales. We start our exploration by loading our data and visualizing the relationship.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split


# jupyter labextension install plotlywidget
pd.options.plotting.backend = "plotly"

ADS_DATA_RI = "https://raw.githubusercontent.com/justmarkham/scikit-learn-videos/master/data/Advertising.csv"

dataf = pd.read_csv(ADS_DATA_RI, index_col=0)

# split my data to test the performance of the model
X, y = dataf[["TV"]], dataf["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.8, random_state=7)


(dataf
     .assign(split=lambda d: np.where(d.index.isin(X_train.index), "train", "test"))
     .plot(kind="scatter",
           x="TV",
           y="Sales",
           color="split",
           trendline="ols",
           trendline_scope="overall", 
           trendline_color_override="darkblue",
           opacity=.7,
           title="Sales ~ TV Ads")
)

The scatter plot shows a not entirely linear relationship between TV advertisements and sales, and our linear regression model doesn't accurately capture sales’ trends for lower value TV ad spending.

Let's reverse engineer the linear model equation using scikit-learn:


from numpy.typing import NDArray
from sklearn.linear_model import LinearRegression
from sklearn import metrics

lm = LinearRegression(fit_intercept=True)
lm.fit(X_train, y_train)

def get_equation(features:NDArray[str], coef:NDArray[float], intercept:float,  y:str="Sales", ) -> str:
    equation = " + " .join([*[f"{coef:.5f} * {feature}" 
                   for feature, coef  in zip(features, coef)],
                 f"{intercept:.5f}"])

    return f"{y} ~ {equation}"


y_func = get_equation(features=lm.feature_names_in_, coef=lm.coef_, intercept=lm.intercept_)
print(y_func)
# 'Sales ~ 0.04727 * TV + 7.17408'

And the R-squared score, which gives an idea about how well our model explains the variability of the response data around its mean, is:

y_pred = lm.predict(X_test)
R2 = f"{metrics.r2_score(y_true=y_test, y_pred=y_pred):.7f}"
print(R2)
# '0.6779489'

However, on plotting the residuals, which is the difference between actual and predicted values, we observe a pattern:

...
(pd
     .DataFrame(y_test - y_pred)
     .rename(columns={"Sales": "Resuduals"})
     .assign(TV = X_test,
              Position = X_test.index)
     .plot(kind="scatter",
           y="Resuduals",
           x="Position",
           title=f"True Sales - Predicted Sales: {R2=} ",
           **fig_size
          )
     .add_hline(y=0)
     .update_traces(mode="markers", hovertemplate=None)
     .update_layout(hovermode="y unified")
)

This pattern in residuals suggests our linear model isn't perfect. The residuals shouldn’t exhibit any pattern if our model captured the underlying data's structure accurately.

Bend It Like Beckham

The visualization suggests that the relationship between TV ads and sales isn't strictly linear but appears to be inverse-exponential initially. This suggests that the first circa $1-$50 spent on advertising might yield slower returns in sales before picking up momentum.

A simple way to capture this potential relationship is by transforming both the input (TV ads) and the output (Sales) using a logarithm.

# Data Transformation

...
(dataf
     .assign(split=lambda d: np.where(d.index.isin(X_train.index), "train", "test"))
     .plot(kind="scatter",
           x="TV",
           y="Sales",
           color="split",
           trendline="ols",
           trendline_options=dict(log_x=True, log_y=True), # power of transformation
           trendline_scope="overall", 
           trendline_color_override="darkblue",
           opacity=.7,
           title="log(Sales) ~ log(TV) Ads + c",
           **fig_size)
)

Our transformed scatter plot indicates a more linear relationship between log-transformed TV ads and sales. This is a clear indication that our initial hypothesis about the logarithmic relationship might be true.

Transformed Linear Regression: A Better Fit?

Let's utilize the log-transformed data and fit a linear regression model:

from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import make_pipeline


def inverse_log10(X:float)->float:
    return 10 ** X

transformer = FunctionTransformer(func=np.log10, inverse_func=inverse_log10)
linear_model = TransformedTargetRegressor(regressor=LinearRegression(), func=np.log10, inverse_func=inverse_log10)

lmx = make_pipeline(transformer, linear_model)
lmx.fit(X_train, y_train)

The R-squared score for this new model is:

y_pred = lmx.predict(X_test)
R2 = f"{metrics.r2_score(y_true=y_test, y_pred=y_pred):.7f}"
print(R2)
# 0.6962380

The score is higher than our initial model (0.67). It indicates that our transformed model is explaining more variability in the sales data.

Visualizing our transformed model alongside the actual data, we get:

(dataf
     .assign(split=lambda d: np.where(d.index.isin(X_train.index), "train", "test"))
     .plot(kind="scatter",
           x="TV",
           y="Sales",
           color="split",
           opacity=.7,
           title="log10(Sales) ~ log10(TV) + c",
           **fig_size)


).add_traces(
    (dataf
  .assign(pred = lmx.predict(dataf[["TV"]]))
 .sort_values(by=["TV"])
 .plot(kind="line",
              x="TV",
              y="pred",
           **fig_size
              ).update_traces(line_color="darkblue")

    ).data)

Next Step: Bayesian Linear Regression Model

We've noticed that as TV ads increase in size, the values deviate more from our fitted line. Using Bayesian linear regression would be a better approach since we would not have a single line but a distribution of lines and thus capturing the uncertainty differently across TV Ads.

Say hello to PyMC. We can build our linear model as follows:

import arviz as az
import pymc as pm


log_Sales, log_TV = np.log10(y_train.to_numpy()), np.log10(X_train["TV"].to_numpy())

with pm.Model() as model:
    log_Sales_ = pm.MutableData("log_Sales", log_Sales)
    log_TV_ = pm.MutableData("log_TV", log_TV)

    β0 = pm.Normal("β0", mu=0, sigma=20, )
    β1 = pm.Normal("β1", mu=0, sigma=20,)
    sigma = pm.HalfNormal("σ", sigma=10)


    mu = β0 + β1 * log_TV_
    pm.Normal("log(Sales)", mu=mu, sigma=sigma, observed=log_Sales_, shape=mu.shape)

# sudo apt install graphviz
pm.model_to_graphviz(model)

With Bayesian modelling, we sample to capture the uncertainty in our model's parameters based on observed data and prior information.

with model:
    idata = pm.sample(3000, chains=4, target_accept=0.85, random_seed=42)

# to predict and evaluate our model
with model:
    pm.set_data({"log_TV":  np.log10(X_test["TV"].to_numpy())})


    pm.sample_posterior_predictive(
        idata,
        predictions=True,
        extend_inferencedata=True,
        random_seed=42,

    )

y_pred = 10 ** idata.predictions["log(Sales)"].mean(dim=["chain", "draw"]).data

R2, RMSE = metrics.r2_score(y_test, y_pred), metrics.mean_squared_error(y_test, y_pred, squared=False)

print(R2, RMSE)
# 0.6949903971655622, 2.8620992998783623

Plotting the fitted lines


from collections import defaultdict

container = defaultdict(list)

β0 = idata.posterior["β0"][:, ::25].to_numpy().flatten()
β1 = idata.posterior["β1"][:, ::25].to_numpy().flatten()

for b0, b1, i in zip(β0 , β1, range(len(β1))):
    container[f"Sales_{i}"].extend(10 ** (b0 + b1 * np.log10(X_train["TV"])))

container["Sales_mean"].extend(10 ** (β0.mean().item() + β1.mean().item () *  np.log10(X_train["TV"])))
container["Sales"].extend(y_train)
container["TV"].extend(X_train["TV"])


p = pd.DataFrame(container)

fig = (p
     .plot(kind="scatter",
           x="TV",
           y="Sales",
           opacity=.7,
           title=f"{R2=:.4f} {RMSE=:.4f} | log10(Sales) ~ log10(TV) + c",
           **fig_size)


)

# adding all possible lines
for column in p.columns[:-3]:
    fig = fig.add_traces(
    (p
     .sort_values(by=["TV"])
     .plot(kind="line",
                  x="TV",
                  y=column,

               **fig_size
                  ).update_traces(line_color="lightblue", opacity = 0.07,)

        ).data)

# add the mean 
fig.add_traces(
    (p
 .sort_values(by=["TV"])
 .plot(kind="line",
              x="TV",
              y="Sales_mean",
           **fig_size
              ).update_traces(line_color="darkblue")

    ).data)

What makes Bayesian modeling intriguing is that our initial model is just a beginning. We have complete freedom to create based on our imagination. Next, we can segment our data into TV ad buckets or bins and develop multiple linear models for each one. That could be a future article on itself 🤭.

Takeaway: 🍔

The main takeaway here is that the success of a predictive model doesn't always lie in choosing a more complex algorithm. Before you reach for a more intricate model, take a moment to step back and consider if a data transformation could simplify things. Remember, sometimes, data transformation is all you need.

Until then, keep on coding data science simplified.

DEV Community

Data Transformation is All You Need

Setting the Stage: Basic Linear Regression

Bend It Like Beckham

Transformed Linear Regression: A Better Fit?

Next Step: Bayesian Linear Regression Model

Takeaway: 🍔

Top comments (0)

Read next

Brain-Inspired Method Cuts Neural Networks by 90% Without Losing Accuracy

React JS vs Python: How to Choose the Best Fit for Your Project

Who should be your first data hire and when should you hire them?

ECCV 2024 Redux: Fast and Photo-realistic Novel View Synthesis from Sparse Images