I used to love Jupyter. I still think they are a wonderful tool for many tasks like exploratory data analysis and presenting insights to colleagues nicely and easily. However, while they are great for data science some of the time, other times they are a headache. Like any software tool, they have their downsides. Here are the five worst things about Jupyter Notebooks for data science:
Jupyter Notebooks are terrible for code versioning. The problem is that they are stored as JSON files, which are basically just a bunch of nested dictionaries. This means that when you try to diff two Jupyter Notebooks, you just get a bunch of meaningless data. This makes working in a team with several notebooks extreme tedious and difficult
Jupyter Notebooks have a non-linear workflow. This is b This means that you can execute cells out of order, which can lead to confusion and errors. This is of course also one of the big selling points for Jupyter, but is only useful for early data analysis and exploration and therefore ends up being a downside more often then not.
Jupyter is not well suited for running long, asynchronous tasks. This is because Jupyter is designed to keep all cells in a notebook running in the same kernel. This means that if one cell is running a long, asynchronous task, it will block the execution of other cells.
This can be a major problem when you're working with data that takes a long time to process, or when you're working with real-time data that needs to be updated regularly. In these cases, it can be much better to use a tool like Dask, which is designed for parallel computing.
Jupyter can be slow to start up, and it can be slow to execute code. This is because Jupyter is an interactive tool, and it has to load the entire notebook in memory in order to provide the interactive features.
If you're working with large data sets or large notebooks, this can be a major problem. Jupyter is simply not designed to be used with large data sets.
This is just my opinion, but not having linting and code styling warnings is a big downside for Jupyter. IDE features are simply too convinient - like the ability to jump between function declarations, code styling and other features make it a lesser developer experience compared to a full fledged IDE.
Now, this is a bit of a lie because I have been using Jupyter through Pycharm Proffessional, being able to use pycharm's debugger in cells is often the best of both worlds.
It's often important to consider where computations are run. For code that’s easy to put into Docker, deploying to a cloud solution is easy. For notebooks, there are also good options, though you’re more locked into specific solutions.
If you want to look into Jupyter notebooks, it’s definitely worth looking into Amazon SageMaker and/or Kubeflow.
In conclusion, Jupyter Notebooks are not the ideal tool for data science projects. They are ideal for prototyping, but for you own sanity, migrate away from them before writing serious production code.