DEV Community

Cover image for State of Data Science 2021: Popularity of Python
Renan Moura
Renan Moura

Posted on • Originally published at renanmf.com

State of Data Science 2021: Popularity of Python

Python continues to be an excellent choice if you are entering the data science field.

Python still dominates and is the most popular language, particularly among younger generations.

88% of students surveyed are learning Python in preparation for a data science career.

63% of the respondents said they use it frequently or always.

71% of educators are teaching Python.

Most used programming languages

It is also interesting to notice SQL raking 2nd place right after Python.

Most structured data is still in relational databases, so a good knowledge of both Python and SQL are a must to deal with data.

The good news is that they are both very accessible and good to begin working with code.

Comments about the other languages

R is an alternative to Python, but I don't see any advantage in learning it if you are already in the Python path since R won't bring anything to the table that Python doesn't.

Then we have JavaScript and HTML/CSS, which makes sense since your results won't live in a Word document on your computer, a good way to display them is on the web with nice interactivity.

Bash/Shell are super useful, the command line is one of the most powerful tools in a coder's tool belt, not only that, but many tools that deal with data engineering like Hadoop rely heavily on the command line interfaces that can be easily automated with a nice shell script.

If you are wondering why Java ranks so high in this list, Hadoop, Hive, HDFS, etc. are made in Java, for instance, and many data pipelines depend on JVM powered tools like Kafka.

So while you may never touch Java as a Data Scientist, you will most probably have to deal with it as Data Engineer at some point.

C/C++ ranks high due to the number of libraries coded in these languages for high performance.

Python's most used Machine Learning frameworks and libraries like Pandas are implemented in C/C++ while Python just provides a nicer API to work with.

The other languages (C#, TypeScript, PHP, Rust, Julia and Go), although they have their place, of course, would not be the subject of further studies from my point of view at the moment.

They are used for more specific use cases or simply fall into "that's what I and my team knows best".

The best contender here would be Julia to replace Python, but it still has ways to go before deserving the time and energy to learn it.

Go would be the high level performant alternative to Java, but it doesn't have the ecosystem with as many tools behind it yet.

So, out of this list, the ones I think will pay you the most dividends for your investment in time and effort are Python, SQL, JavaScript, HTML/CSS, Bash/Shell, and Java.

These languages are more than enough to put you in any stage of a Data Science project or pipeline.

You can read the full report on State of Data Science 2021

Read this article directly on my blog.

Discussion (0)