Prayson Wilfred Daniel

Posted on Aug 26, 2023 • Edited on Sep 30, 2023

SRP: Refactoring Data Science Beyond Classes

#softwareengineering #datascience #python

Picture a grand symphony orchestra: woodwinds, brass, strings, and percussion, each section distinct yet harmoniously united. The conductor waves the baton, and every instrument plays its part, contributing to a cohesive, magnificent whole. This musical harmony, the flawless interplay of instruments, is what the SOLID Principles seek to instil in our data science canvas, with the Single Responsibility Principle (SRP) as instruments.

Woohoo! You're still here. You're thirsty for more! Not only have you journeyed this far, but you've also conquered "Building the Bedrock: Employing SOLID Principles in Data Science". I am here. Now, with that fire in our belly, we're all geared up to explorer how SRP can revolutionize Data Science code base.

The S in SOLID: The Maestros Behind the Curtain

SRP isn’t just a software design principle. It’s an art form, a meticulous choreography of roles and responsibilities. Beyond the confines of classes, SRP sings a computer science symphony, ensuring every 'instrument' - be it a function, a class, or a file - resonates with its unique note, yet contributes to the collective melody of the project.

If a file, a function or a class has more than one responsibility, it becomes coupled. A change to one responsibility often results to modification of the other responsibility. It becomes not only difficult to maintain, but also unreliable.

Given that the ultimate goal is to produce software that is not only robust and reliable but also easy to maintain, SRP becomes the heart and soul our software design principle. When each part of the software focuses on one primary responsibility, it becomes less tangled and, therefore, easier to understand and modify.

Let's Code: Creating Data Preprocessing Pipeline

In a bustling orchestra hall, imagine a Jupyter notebook resonating with Pythonic Pandas compositions: data transformations flowing like a symphony, doing pipeline data transformation, creating a harmonious tune. But as the concert progresses, the melodies intertwine, becoming complex and hard to follow.

Enter the SRP. Like a masterful composer, we rearrange the musical pieces, refactoring our notebook's intricate sections into separate scores. Each score now holds its unique melody, making the entire composition more resilient to change and a breeze to perform. Our once tangled symphony is now a harmonious ballet of data harmonies, each with its own spotlight.

# Task Prototype. Each Task has to implement *run* method 
# file: prototypes.py
from abc import ABC, abstractmethod

class Task(ABC):
    @property
    def name(self) -> str:
        return self.__class__.__name__

    @abstractmethod
    def run(self, *args, **kwargs):
        pass

Now, the instruments that do one thing well

# Tasks that can be divided into their own files
# from prototypes import Task

from typing import Literal

import numpy as np
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn import set_config

from srp import config

set_config(transform_output="pandas")


transformer = make_column_transformer(
    (
        FunctionTransformer(np.log),[
            config.COLUMNS_TO_LOG_TRANSFORM
        ],
    ),
    (
        OneHotEncoder(sparse_output=False),[
            config.COLUMNS_TO_ONEHOTENCODE,
        ],
    ),
    verbose_feature_names_out=False,
    remainder="passthrough",
)

class DropZerosTask(Task):
    def run(self, data: pd.DataFrame) -> pd.DataFrame:
        cleaned_data = data[~data.select_dtypes("number").eq(0).any(axis=1)]
        return cleaned_data

class DropColumnsTask(Task):
    def __init__(self, columns: list[str]):
        self.columns = columns

    def run(self, data: pd.DataFrame) -> pd.DataFrame:
        cleaned_data = data.drop(columns=self.columns)
        return cleaned_data

# TODO: decorator to save and load transformer
class TransformerTask(Task):
    def __init__(self, stage: Literal["train", "predict"] = "train"):
        self.stage = stage

    def run(self, data: pd.DataFrame) -> pd.DataFrame:
        if self.stage == "predict":
            cleaned_data = transformer.transform(data)
        else:
            cleaned_data = transformer.fit_transform(data)

        return cleaned_data

class Floats2IntsTask(Task):
    def __init__(self, columns: list[str]):
        self.columns = columns

    def run(self, data: pd.DataFrame) -> pd.DataFrame:
        cleaned_data = data
        cleaned_data[self.columns] = cleaned_data[self.columns].transform(
            pd.to_numeric,
            errors="coerce",
            downcast="integer",
        )
        return cleaned_data

This modular approach offers flexibility and testability. As needs and technologies change, it's much simpler to test, adjust or replace individual parts rather than overhaul the entire system.

Time to note the conductor, responsible for adding and calling instruments:

# DataTasks gather and runner
# file: data/datatasks.py
# from prototypes import Task

from __future__ import annotations
from queue import PriorityQueue
import pandas as pd
from loguru import logger


class DataTasks:
    def __init__(self, tasks: PriorityQueue = None) -> DataTasks:
        if tasks is None:
            self.tasks = PriorityQueue()

    def set_task(self, priority: int, task: Task) -> DataTasks:
        self.tasks.put((priority, task))
        return self

    def run(self, data: pd.DataFrame) -> pd.DataFrame:
        while not self.tasks.empty():
            _, task = self.tasks.get()
            logger.debug(f"priority: {_}, task: {task.name}")
            data = task.run(data)
        return data

Since Tasks need only to implement the run method that returns a DataFrame, this part of the software does not change with changes in Tasks. Thus crafting a piece that is both adaptable to future changes and resilient in its operation.

Finally, a grand symphony. A seamless pipeline that is robust, maintainable, and adaptable to change.

# Implementation
# from data.datatasks import DataTasks
# from data.tasks import ...
from srp import config

def process_data(data: pd.DataFrame) -> pd.DataFrame:
    data_chain = DataTasks()
    (
        data_chain.set_task(priority=2, task=DropZerosTask())
        .set_task(
            priority=1,
            task=DropColumnsTask(
                columns=[
                    config.COLUMNS_TO_DROP
                ]
            ),
        )
        .set_task(priority=3, task=TransformerTask())
        .set_task(
            priority=4,
            task=Floats2IntsTask(
                columns=[
                    config.COLUMNS_TO_FLOATS_TO_INTEGER
                ]
            ),
        )
        # Add more or remove tasks to the chain as needed
    )

    # Send data through the chain
    return data_chain.run(data)

# in the main
# import process_data

if __name__ == "__main__":
    import pandas as pd

    URI = "https://raw.githubusercontent.com/Ankit152/Fish-Market/main/Fish.csv"
    dataf = pd.read_csv(URI)
    clean_data = process_data(dataf)

The Guiding Stars: Avoiding the Black Holes

While SRP and design patterns offer a universe of possibilities, beware of the black holes of over-engineering and complexity. Not every problem demands this symphony; sometimes, a simple melody will do. We still need to tailor our approach to the narrative of our project.

The SRP began to transform our data science projects into masterpieces, cohesive yet complex, simple yet sophisticated. Every component, like a perfectly tuned instrument, plays its part, contributing to the entire symphony, and ensuring our projects are maintainable, scalable, and extensible.

The cosmic masterpieces' journey does not end here. It has just started. Our next SOLID guiding star is "O". The Open-Closed Principle (OCP). OCP guides us to build units that are open for extension but closed for modification. A saga for the next article in this series.

Up Next: "OCP: Refactoring the Data Science Project"

Until then, keep on coding data science SOLID-ly.

const artist = "Shahnoza Bekbulaeva";
console.log(artist)

DEV Community

SRP: Refactoring Data Science Beyond Classes

The S in SOLID: The Maestros Behind the Curtain

Let's Code: Creating Data Preprocessing Pipeline

The Guiding Stars: Avoiding the Black Holes

Top comments (0)

Read next

Diving Back into React: A Software Engineer’s Learning Journey

Be Careful When Using YAML in Python! There May Be Security Vulnerabilities

AI Search Capabilities Hit Roadblock: Transformers Struggle with Large-Scale Path Finding

Advent of Code '24 - Day9: Disk Fragmenter (Python)