Bhaskar Karambelkar

Posted on Apr 14, 2017

Data Scientists and Software Engineering

#datascience #softwareengineering

Originally published on my personal blog

Coding != Software Engineering

As has been mentioned ad nauseam: Any one can code. What has not been said enough is that there is a lot more to coding than merely assembling a set of instructions in a programming language of your choice to make the machine do what you want. Not as catchy as 'Any one can code!', is it?

As a professional software developer turned data scientist, I feel compelled to share some software engineering wisdom with my fellow data scientist who may have followed a non coding heavy path. If you are an academic researcher turned data scientist, or perhaps a data analyst used to point-and-click GUI tools, or excel (the horror!), then learning and being able to code in a programming language can be a liberating, exhilarating, but also a very scary experience. But fear no more. This post and the follow up posts in this series are just for professionals like you. This series will introduce you and encourage you to explore software engineering in more detail, in order to become proficient in writing good code no matter the programming language.

So What is Software Engineering?

Instead of referencing a formal definition which you can easily look up using Google, let me tell you what I think the practice of software engineering aims to accomplish. Writing more or less working code is the easy part. What software engineering aims to accomplish is making the code portable, concise, relatively bug free, secure, performant within given constraints, and reusable, with limited man-power and budget. And believe me, this is not as easy as you might think it is, nor is this process a natural extension of the practice of coding. By that I mean that you can't learn software engineering by coding more and more, let alone master it. Proper software engineering is an art+science on to it self, of which coding skills are an important but nonetheless only a small piece.

The reason I chose to explain software engineering this way, is because, the term software engineering is in itself somewhat controversial and debated. So instead of drowning you in the controversy about the term, I present my understanding of the intent behind software engineering. But be forewarned that this is one person's opinion so take it for what it's worth.

So Why Must Data Scientists Care?

For many a reasons. As I mentioned previously more and more data analysis is now done in code rather than point-and-click GUI tools. This places the added burden of learning how to code on an data analyst / researcher who may not have had exposure to coding before. Even if you had taken a programming language class before, it was mostly to teach you the syntax of the programming language rather than teach you about software engineering.

The implications of the above are they you may write code that works but it may have all sorts of issues. It may not be portable on account of using non-portable APIs/Libraries. It may not be optimized/concise and as a result non performant at scale. It may have a large surface area for bugs and security vulnerabilities. It may not be maintainable in the long run, and hence prevent you from reproducing your results in the future. Given all this, I would go as far as to argue that if you care about the veracity and reproducibility of your research/analysis then you absolutely must care about software engineering.

Cliff Notes for Software Engineering

I must warn you this page is only meant to get you started on software engineering, and not tell you all that there is about it. Even then, following most of my advice below will make you a better coder (ahem software engineer).

Have a basic understanding of how computer systems work. The hardware, the software (including OS/Kernel), the network, the Internet. This will go a long way, I promise.
Try and pick up at least 2 or 3 programming languages. Broaden your repertoire.
Learn about various programming paradigms like imperative, object-oriented, functional etc.
Many a modern languages don't strictly fall under a single paradigm. Form a habit of recognizing which features adhere to which paradigms.
Embrace each programing language's idiosyncrasies rather that fight them. If in doubt always remember that people lot smarter than you put hours into developing the language/API/library you are using.
Have a good understanding of the standard library of a programming language. It will prevent you from inefficiently duplicating functionality that was at your disposal from the get go.
In addition to the standard library research a bit on some leading 3rd party libraries/APIs available. Someone always has the problem solved before you. Your skill lies in finding it rather than duplicating it.
Start picking up on how to distinguish efficient vs inefficient code. Efficiency can defined in terms of performance, conciseness, resource consumption etc.
Teach yourself the principles of application security and secure coding. Not being a full time software developer doesn't alleviate you from writing secure code.
Think beyond your immediate use case. Think of use cases in future or use cases by users other than yourself. "It suffices my needs" is a narrow mindset.
Write less code and more comments. Think of that someone who has to read your code six months or an year from now. Even if that someone is you, I can tell you from experience that reading properly commented code can do wonders to lower your stress levels.
Be critical of your coding abilities rather than being confident about them. Let that imposter syndrome be your motivation to improve.
Automate your testing. Be it unit tests or integration tests, take out the human as much as possible from the equation.
Learn about software delivery pipelines. Continuous integration (CI), automated deployments, devops are not just buzzwords. They play a critical part in your overall product development.
Familiarize yourself with distributed computing, cloud environments, virtualization and container technologies.

The last four ones are special and deserve to be separated out from the rest.
Always follow them no matter how big/small your program/script is, and how much you are pressed for time. Excuse the shouting because they are that much important.

NEVER CODE WITHOUT A VERSION CONTROL SYSTEM, PREFERABLY GIT
DON'T HARD CODE STRINGS, NUMBERS, FILE/DIRECTORY NAMES EVER!
ALWAYS METICULOUSLY DOCUMENT YOUR "CLEVER" HACKS. ALWAYS!

Finally...

DON'T BLINDLY COPY CODE FROM STACK OVERFLOW.

Anything More?

Yes! A lot more. Over the course of this series I will expand on each of my bullet points in a separate blog post that will deep dive in to the point. In the mean time feel free to look up software engineering, the controversy around it. If you have any comments to share find me on Twitter (link at the bottom).

Top comments (15)

Fernando Calatayud • Apr 17 '17

All looked good... until you adviced to add comments. Good code doesn't need comments because it's self explaining, if your code is more readable with comments add them... or try to write better code

Bhaskar Karambelkar • Apr 17 '17

Thanks for reading the post and your comments. I appreciate a good thoughtful discussion. Although I never suspected that that point would be controversial.

Goodness of code is a subjective measure and something not easy to quantify. So saying that code doesn't need documentation, if it's good code is somewhat of a hard sell. In similar vein then someone can say good code doesn't need unit tests because it's good code and ergo bug free.

Having said that, I will say that writing good comments is an skill and art on to itself. Simply describing what the code is doing is not good commenting. Describing why the code is taking the approach it has taken can be illuminating for someone who is reading the code and wondering about the choices the coder made in his implementation.

Fernando Calatayud • Apr 19 '17

I have a constructive suggestion... paste here a piece of code which benefits from your comments, and I'll try to write the uncommented version. You'll decide which version is better ;-)

mrseanpaul81 • Apr 18 '17

I couldn't agree more... Ultimately the comment will deviate from the code so will become a lie!

Your code should be your comment (there may be a very rare exception but it should be extremely rare like one comment per year!)

Fernando Calatayud • Apr 19 '17

Indeed, that's the main point: people rarely update the comments when the code gets changed, so the comment becomes misleading.

But it's not the only reason... most times, comments are a smell that some piece of code is too complex. Of course, the problem is the complex code, not the comments... but if you feel the need of commenting, better refrain and try to refactor it instead.

About when to comment... in my case, I do it when I write ugly code purposely. It may be due to a performance tweak, ugly but needed, or because the cleaner version generates an unexpected problem. Both of them happens rarely, but may happen, and then better advice the next one against trying to refactor (or be very careful).

Dr Janet Bastiman • Jun 15 '17

Great post - I have a similar drafted somewhere. It's difficult seeing individuals coming out of academia without an understanding of how to fit into an engineering team.

Re the comments issue - for a data scientist making the transition to an engineer, comments are helpful. Going from a solo effort to a team effort and the mind shift of portability/reuse isn't going to happen overnight. Magic numbers, poorly named variables etc will sneak in occasionally. In my team, we accept this and so I encourage comments, particularly on the "clever hacks". When they get to a point that the comments aren't necessary then they get dropped.

I have an extra rule regarding git that I make sure I state explicitly: no developing in the master branch - it's easy to give a quick overview of source control and forget about branching :)

Unfrozen Caveman Dev • Apr 18 '17

Exactly! I've got several years under my belt as a full stack web developer, and am currently moving more and more into data integration. Data scientists, business intelligence devs., analysts, etc. are all still doing software engineering. The ETL tools I'm using act a lot like code -- and are in fact extensible with code. The ideas are the same: make it readable, think about testing, automation, etc.

Everyone in my list above uses some combination of Python, R, Javascript, Java, C#, VBscript, etc. for modeling, mockups, analysis, etc. Even full on database engineering involves the same sorts of worries that programmers have in terms of input vs. output, how to best model data, usage, etc.

Finally, I'd even note that I used to work with actuaries and insurance underwriters. They set up massive Excel spreadsheets with dozens of macros and pivot tables to model and calculate various rating inputs for risk scenarios and charging for insurance premiums. If that's not software engineering, I don't know what is.

Espoir Murhabazi • May 14 '17

Thanks a lot for the article !! As a newbie comming from school i discover that making things done are not the most important but making it well done is the most important ! I will follow all suggestions given in your article !!
I've started by learning git , haven't finished yet( branch still gives me headaches)
I but now I'm using it already in my project!
Next will learn TDD
And CI
All the best

Bhaskar Karambelkar • Jun 15 '17

Thanks Espoir, and all the best for your learning.

Phil Ashby • Apr 24 '17

Thanks for a nice start on thinking about engineering, not just cutting code :)

I find a lot of articles and advice assume that the person coding^H^H^H engineering software already understands their problem/task, this is frequently not the case! One of the best lectures I ever attended (back in the 90's!) was with Grady Booch, who advocated modelling the problem, using flexible, physical objects (quite possibly post-it notes and string, although a whiteboard and pen also works) to properly understand what 'things' you are dealing with, how they interact and how they behave, before choosing a language and committing to code, as the effort required to change a code model is typically much greater..

My take on the comments/no-comments discussion: comments are "labels on the trees in the forest", they can help if describing non-obvious choices but they come with maintenance costs, and don't provide the bigger picture that a "map of the forest" can (the aforementioned model of the problem).

Full disclosure: I'm a technical architect by day, code junkie by night!

t0ss • Jun 15 '17 • Edited

Regarding the discussion below, "Don't write comments" is a mantra I was once guilty of shouting at every opportunity as well. It really seems like something we just yell anytime comments are mentioned now.
Write good comments, when you need them. Other devs new to your codebase will thank you for it. No matter how clean you write it, there are times you need to explain why you're doing something. If you're making a programmer backtrack through interfaces, objects, functions, etc to see why something is happening you're not writing clean code.

Bhaskar Karambelkar • Jun 15 '17

Thanks for your comments! I am merely making suggestions based on my experiences and I know that there are people who are completely anti commenting. To each his own I guess. :)

Bill White • Jun 15 '17

Consider that the comments could be for YOU, the developer, sometime down the road (years possibly!), not necessarily for anybody else. If for others, then yes, you are in team development territory and has to fit into the last stage of the software engineering calculus-- maintenance and extension. The software development lifecycle (SDLC) is for serious software, like aircraft avionics, high speed finance, controls systems and thus cannot cut corners on comments, documentation and training manuals/delivery. Writing web interfaces, storefronts and the like that get changed by the day are something different, rarely anything I would considering engineering. I still have code/custom hardware running from 1995 (Visual BASIC 3), and thank goodness it's commented!

When I was a child I wrote no comments, then I put away childish things and learned to write in coherent sentences.