Python isn't going Anywhere

#python #machinelearning

I've seen a few articles recently predicting the demise of Python for machine learning and data science in favor of the faster, the simpler, the better-for-all-things-machine-learning Julia language. I've heard it mentioned in meetings at work and at a recent conference I attended. This article is one example. Every time I hear it or see it I have a pretty visceral reaction.

I don't buy it.

At the time, in the moment, I didn't have anything like logic backing me up on that, it was just a feeling. I've spent a couple of days thinking it through and I'm convinced that my skepticism of the impending demise of Python is warranted. I really don't buy it. Here's why.

Okay real quick obviously this is an opinion piece and I'm just one person. But I've been doing ML - and specifically ML in production - for a while now, so naturally I've got some thoughts and arguments behind my feelings. I'm not trying to start a flame war. As you'll see shortly, I don't care about any programming language at all. I care about deployed models and shipped products. That's why I'm skeptical.

Julia is Better (for Models)

I'll start by saying this: Julia is a better language for data science and machine learning. It's really really fast. It's very expressive, combining the simplicity of Python with the metaprogramming capabilities of R and LISP-y languages. It's really pleasant to work with. At the end of the day it's a technical language, closer to Matlab / R than Python. That's what makes it more effective to build high-powered machine learning algorithms with than Python. That's also why it won't unseat Python.

Technical languages are specialized. That's kind of the point - it's faster and easier to build your model / algorithm in a language designed for models and algorithms. However, models don't make money. Deployed models make money. And that's where the technical languages turn up short.

Python makes money tho

Deploying a model is an immense amount of work, and a very significant and very challenging side of that work doesn't involve the model at all. You need a web server, containers, database connections, monitoring, CI/CD, package and version management ... you get the idea. That's all the stuff that the software engineers (or if you're lucky machine learning engineers) ~~have~~ get to deal with and solve. Your company is probably not paying data scientists to do that work. For one thing, a data scientist that can do the work is really hard to find. The practical aspects of using software to make money is not part of standard data science curriculum. Also, most data scientists just don't want to do it. That's fair.

But is your software engineering team going to learn your language that's optimized for machine learning, taking on the very significant risk of deploying something they're not only unfamiliar with, but also lacks the tooling for all that stuff I mentioned above?

Python has all that stuff. And it's been there for years. The reason is simple: Python isn't a technical language. Immediately that means there are more web services and products you use on a daily basis running on Python than Matlab, R and Julia combined, multiplied by at least 100, probably a whole lot more. There are significantly more Python developers out there than data scientists. And most of them probably know more about shipping software - meaning making money with software - than the average data scientist.

So which is more economical: develop machine learning libraries in Python so your models can plug right in to all that stuff without rewriting it, or implementing web servers, security / authentication, CI/CD and testing, deployment, monitoring and alerting, etc. in the Best Technical Language Evar?

What would it take?

So what would it take for a Julia or whatever technical language of the future to dethrone Python? I can think of three things, only one of which seems remotely possible.

Julia is so much better than Python that Python isn't worth learning. No data scientists learn Python, so companies that want Data Science Money have to adopt Julia. Julia wins.
Some new machine learning hotness comes along that is implemented in Julia first. Because Julia is so much better for this sort of thing, companies eat the cost of adopting and deploying it to use the Hot New Thing in machine learning. It takes too long for Python to get it, and Python for DS and ML gets dusted.
Software gets released that makes Julia easy and fast to interoperate with Python. Models get developed in Julia, and are deployed with Python (or whatever ... doesn't matter), and nobody knows the difference or cares. All internet language flame wars cease. Pandas no longer become endangered, but the pandas library does.

Julia is way better

Point numero uno: Julia exists right now and is competing with Python right now. Is it really that much easier? Yes it's easier and yes it's simpler. I can import Python packages directly into Julia and can get basically the best of both worlds. But is it so much better that companies are willing to shell out money?

For other general purpose languages like Java or C, that answer is yes. It's hard to write prototype software in those languages. Machine learning needs fast iteration cycles to work, and Java / C doesn't cut it. Development is too slow. Not for Python though. Python meets the basic requirement of being fast enough (mostly because machine learning libraries are actually written in C with Python bindings) to make the work happen and flexible enough for prototyping. It's also got all the production bells and whistles needed to get the software out into the world and making money. Because of that it's not hard for industry as a whole to tell data scientists to suck it up and index by zero.

New hotness, just for Julia

This actually happened, just not with Julia. When deep learning became the sweet hotness that all companies needed, there wasn't much software that could do this stuff efficiently. Early implementer advantage went to this thing called Torch. When industry started exploring and deploying deep learning, Torch was there. Torch is written in Lua: a fast, simple but fairly specialized and not widely adopted language. Did the world pivot to Lua so we could get deep learning?

No. Python ate deep learning. Facebook literally rewrote Torch in Python and made PyTorch. The reason Python ate deep learning (and will probably eat the Next Hot Thing in ML too) is simple. Shipped software is the dog. ML is the tail. The tail does not wag the dog. No matter how popular data science gets, there will always be more developers than data scientists because software developers get the software making money. A one-time investment in porting a library or model to Python (which again is hugely flexible because it can bind to superfast C libraries) is much cheaper than building a dev team and all the associated tooling to deploy in a specialized language.

Everyone plays nice

The final path is plausible. If we can guarantee a straightforward and efficient interoperation between Julia and Python (or really whatever runtime we want to deploy in) then presumably it won't matter which language the model is built in. This is kind of starting to happen already. In the data engineering world, Apache Spark is king. Its core is written in Scala, which means it runs on the JVM. It has Python bindings, including user defined functions. For a long time Python UDFs were the slowest thing on the block in the Spark world, because to execute arbitrary Python code from a Java runtime meant copying and transferring data via (essentially) shell pipes. Then there came this little feature called the Pandas UDF, which allows Python to execute in Spark without copying memory across runtimes. How? A piece of magic called Apache Arrow.

Apache Arrow is an in-memory representation for columnar data that is standard across runtimes. That means that I can use Java bindings to read Arrow data frames generated by Python, or vice versa. Or I can use Julia to generate a data frame and share it efficiently with the Python runtime that's doing the web service thing. I actually think Arrow is the most important open source project in the data science and machine learning space precisely because it will remove critical efficiency issues between the tooling ecosystems for data engineering, data analysis, model development, and model deployment. If Julia's going to be supreme monarch of data science and machine learning, this is probably how it would happen. That said, right now Python has the most libraries. Am I really going to use Julia to import Python packages only to export the results back to Python?

In Conclusion

Unseating Python is hard because it has one key advantage over technical languages like Julia: it isn't one. Most software isn't deployed with technical languages, but with general ones. And deployed software makes the money. That means it's more economical to move machine learning and data science to Python than to move everything else to Julia (or Matlab, or R). Until some tool like Arrow comes along that enables these runtimes to work together so that nobody has to know or care what made the model, I don't think Python is going anywhere.