I recently answered this question in quora, I didn't phrase the question, but it's a good starting point. I basically stay away from language debates as you will see, but this one really interested me. As I have debated with myself alot and was researching this specific question for myself, I basically wanted to know which one should I use for my next data project and here are my personal insights. (Please let me know what you think! :)
This is how I treat R, Scala, Python VS, which to choose saga. I basically use each for it’s better strength, here is the recipe. This is my personal view and usage of the languages.
Use R as a replacement for a spreadsheet **. Together with **RStudio it makes a killer statistics, plotting and data analytics application.
You can take log files, parse them, graph them, pivot table them, filter. And all with great support from RStudio - it’s a killer data analysis language and workspace, you should study as a replacement for spreadsheet workings.
Do you want to grep some lines from a text file no problem just use: dateLines <- grep(x = mylog, pattern = "--", value = TRUE). It’s a backfiring arrow, it’s both easy to write - once you know the command you need to use! - It’s many times very difficult to figure out what is the correct command to use, practice is the key, note taking is the key, you need time for that, consider do you have the time for that, if not just use it as your little spreadsheet and use it from time to time until you get better with it, save a note or doc with useful R command’s and you will find that with a few commands + few plotting commands you are a small king in it’s realm. This example of grep is only one of a million of crazy abilities and matrix manipulation and plotting and RStudio will have you doing analytics like crazy on data.
If you have no time for the above I still highly recommend you to install RStudio and use it from time to time, get the hang of it, there is nothing like it so far that I know that is so good for quick data analysis, quick statistics, just give it a shot and try to replace your routine calculations, quick data manipulations tasks with it.
You can also move on and do machine learning in R, it has extremely powerful libraries for that (rpart, caret, e1071, …) and by all means if you and your teams are fluent with it feel free to move on, but me personally would use it only for speculations and quick analysis or quick models, I stop there, it can be very quick but this is when I turn to language number 2 python.
Use Python for small to medium sized data processing applications. Python tough introduced some type checking in recent releases (which is awesome), is an interpreted language (just like R) but it's a more of a standard programming language, as such you have the great benefit of speed of programming. You just write your code and run. However the caveat is that you don’t have the amazing compiler and features (the good ones not the kitchen sink one) from scala. Therefore as long as your project is small to medium sized.
It is going to be very helpful as you will utilize NLTK, matplotlib, numpy, pandas, and you will have great time and happy path learning and using them. This will take you on the fast route to machine learning, with great examples bundled in the libraries.
I’m not saying you cannot do it with R or scala with great success I’m saying as for my personal use this is the best most intuitive way to do that I use it for what’s it’s best.
I want quick analysis of csv I turn to R. I want a bulletproof fast app to scale in time I use scala. If my project is expected to be one big with many developers this is where I turn to language/framework number 3 - to java/scala.
Use scala(or java) for larger robust projects to ease maintenance. While many would argue that scala is bad for maintenance, I would argue that it’s not necessarily the case. java and scala with their mostly super strongly typed and compiled features, make them a great language for the large scale. You have spark opennlp libraries for your machine learning and big data. They are robust, they work in scale, it’s true it would take you longer time to code than in python but the maintenance and onboarding of new personal would be easier, at least in my cases.
Data is modeled with case classes.
Proper function signatures.
Proper separation of concerns.
While the above could be applied in any of the above languages it’s goes more naturally with scala/java.
But if you don’t have time or want to work with them all, then this is what I would do:
R - Research, plot, data analysis.
Python - small/medium scale project to build models and analyze data, fast startup or small team.
Scala/Java - Robust programming with many developers and teams, less machine learning utilities than python and R, but, it makes up by the increased code maintenance for multiple many developers teams.
It’s a challenge to learn them all and i’m still in this challenge, and it’s a true headache but at the end you benefit. If you want only one of them I would ask:
- Am I managing a project with many teams, many workers, speed is not the topmost priority, stability is the priority - java/scala.
- A few personal project I need quick results, I need quick machine learning on a startup - python.
- I just want to hack on my laptop data analysis and enhance my spreadsheet data analysis, machine learning skills - R.