DEV Community

loading...
Cover image for 5 software engineering practices for Data Science

5 software engineering practices for Data Science

vikyale profile image Victoria Ubaldo ・4 min read

When we work in projects in data science we will find some roles and functions. With my experience in software engineering I was able to apply differents tasks and activities in data science projects. Depends on the project and the time that is counted, here are five basic concepts that every analyst or developer should learn or review about software engineering. Here we go!

Documentation:

The documentation allows us to be clear about the parts of a code, in addition to knowing the purpose of each component of the code. What will we see in data science projects? Differents files in Python, R, SQL, or Scala that are usually passed from analyst to analyst. Having this documentation saves us time in understanding and improves productivity.

The types of documentation are:

  • Line level:
# Lets read the dataset and check the content
titanic_data = pd.read_csv('drive/My Drive/Datasets/titanic-data.csv')
Enter fullscreen mode Exit fullscreen mode
#Having the mean of ages in the dataset
titanic_data["Age"].mean()
Enter fullscreen mode Exit fullscreen mode
  • Level Function or module:
def getExchangeRates(amount, exchange_rate): 
        """
        Parameters
        ----------
        amount : float
            a quantity of money
        exchange_rate : float
             rate at which one currency will be exchanged for another
        return : float
            amount with the exchange rate
        """

    return (amount*exchange_rate)
Enter fullscreen mode Exit fullscreen mode
  • Project level:
project_ml_bank/
│
├── project/  # Project source code
├── docs/
├── datasets/
├──────── train/
├──────── test/
├── README
├── HOW_TO_CONTRIBUTE
├── CODE_OF_CONDUCT
├── model.py
├── test.py
Enter fullscreen mode Exit fullscreen mode

Version control:

As we mentioned before, you will work with a lot of code, how do we keep it in an orderly manner, whether it is individual or team work? Ideally, use a version control repository. We install GIT and with a Github, Gitlab or Bitbucket account we create a repository that contains our code. Using the terminal or cmd with few command lines we can do these tasks.
In data science, it is also essential to use this version control because we iterate frequently until we obtain the appropriate indicators from our prediction model and possibly test with previous versions, improving the precision with other features or variables.

Testing:

When working with code, it is important to detect faults early and have confidence in the result. In data science finding errors is not always easy to detect them, what can we avoid with testing in data science projects? Here are some examples:

  • Incorrect encoding: the code does not detect UTF-8 encoding problems of the data (typically in dates, emails, coordinates).
  • Inappropriate results: the code does not perform a correct cleaning of the data.
  • Unexpected Results: The model code has a lot of BIAS (bias) when evaluated with real data.

How is it solved with testing?

To detect these errors in data science we must review the quality and accuracy (precision) of our analysis, in addition to the quality of the code.
The most used techniques are:

  • Test Driven Development: It is a development process where you write tests for each task before writing code that implements these tasks.
  • Unit Test: it is a type of test that only covers a unit of code, it can be a function, independently of the rest of the program.

Logging:

The logs help us to review the events during the execution of our program. For example, if you need to run a model with a super large dataset, you will leave it running overnight and will only review the log to see what is happening, if it finished successfully or if it has errors. Logging is the process of recording the messages that describe the events that occur while the software is running.

The log levels are:

  • DEBUG: level where you should use it to review each step or event that happens in the program.
  • ERROR: level where all the errors that have occurred are recorded.
  • INFO: level where all the actions that are suggestions or informative are recorded by the system, regularly scheduled operations are typical.

Some of the tips for a proper log is: be professional and clear, concise and use normal capitalization (upper and lowercase), provide any useful information and choose the appropriate logging level.

Code Review:


Code reviews help a team promote best programming practices and prepare code for production. It also helps in reviewing standards, ensuring code is readable, and sharing knowledge with the entire team.

Conclusions

These points are basic to apply in some tasks in data science projects and it has helped me to have order and quality. In data science the software perspective is essential, with your software engineering experience you can contribute and grow in this exciting world of data, reducing errors and improving you and your team in productivity.

I invite you to follow me:

Linkedin: https://www.linkedin.com/in/victoriaubaldo/
Twitter: https://twitter.com/VikyAle

Discussion (0)

pic
Editor guide