Does non repetitive code really translates to better performance?

#programming #machinelearning #codequality #discuss

Problem Statement: I wanted a function which can convert a column values of a data frame to string values if they are a dictionary or a list. This is a requirement if you need to add the data to MySQL Server, as SQL does not support complex datatypes like lists and dictionaries.

This can be done using two approaches:

Approach 1: We create a function which takes in a dataframe and goes column by column, then iterate through the rows to apply the required logic.

# Approach 1 inputs the entire dataframe

def convert_to_string_all(df):
    """ This function applies the required operation throughout all columns"""

    for col in df.columns:
        df[col] =df[col].apply(lambda x: str(x) if isinstance(x,list) or isinstance(x,dict) else x)
    return df

test_all = convert_to_string_all(test_all)

Approach 2: We pass the column as input and then apply the logic to each rows. By using this approach we do not have to apply the logic to the entire data frame. We can just pass a specific column. The only problem is we would have to repeat our function call again and again.

def convert_to_string_few(df):
    """ This function applies the required operation to a single column"""

    result = df.apply(lambda x: str(x) if isinstance(x,list) or isinstance(x,dict) else x)
    return result

test_few["feature"] = convert_to_string_few(test_few["feature"])
test_few["imageURL"] = convert_to_string_few(test_few["imageURL"])
test_few["imageURLHighRes"] = convert_to_string_few(test_few["imageURLHighRes"])
test_few["also_view"] = convert_to_string_few(test_few["also_view"])
test_few["also_buy"] = convert_to_string_few(test_few["also_buy"])

DRY(Don't Repeat Yourself) is a very fundamental principle when we talk about clean code. But problem comes when we treat principles as a requirement. In this example we can clearly see that we compromise our performance by following the principle.

If we do an analysis of the performance:

Method 1 (All Columns):
O(c × n) where c = total columns
Processes all 10 columns regardless of content
Performs 10n total operations

Method 2 (Specific Columns):
O(k × n) where k = columns needing conversion
Processes only 5 columns containing lists/dictionaries
Performs 5n total operations

We see a 50% reduction in unnecessary work

On the other hand some people argue that it is an absolute requirement for our code to be non repeating as this will help in maintenance and scaling.
Many developers prioritize code readability and maintainability over performance .

This perspective argues that:
Code is read more often than written
Premature optimization leads to complexity
Modern hardware can handle inefficiencies

While these points have merit, they can lead to a dangerous complacency about computational waste, especially in data-intensive applications.

Conversely, some developers believe performance should be prioritized above all other considerations.

This approach risks:
Creating unmaintainable code
Over-engineering solutions
Ignoring the 80/20 rule of bottlenecks

So what is the way?

For me it has always been following a balanced approach. We do not have to blindly follow the principles of clean code, as they are suggestions and best practices and does not define the overall context of code. But we should also not ignore the requirement and need for maintainability. We should design better code structure which can work with both performance and maintainability.

DEV Community

Does non repetitive code really translates to better performance?

Top comments (0)