Machine Learning competition & research code sucks. What to do about it?
For having used code from Kaggle competitions a few times already, we realized it wasn’t full of rainbows, unicorns, and leprechauns. It’s rather like a Frankenstein. A Frankenstein is a work made of glued parts of other works and badly integrated. Machine Learning competition code in general, as well as machine learning research code, suffers from deep architectural issues. What to do about it? Using neat design patterns is the solution.
It’s so common to see code coming from Kaggle competitions which doesn’t have the right abstractions for later deploying the pipeline to production - and with logic reason, kagglers have no incentives to prepare for deploying code to production, as they only need to win the competition by generating good results. The same goes with most research code for which papers are published, often, just beating the benchmarks is what is sought by coders. The situation is roughly the same in academia where researchers too often just try to get results to publish a paper and ditch the code after. Moreover, many machine learning coders didn’t learn to code properly in the first place, which makes things harder.
Here are a few examples of bad patterns we’ve seen:
- Coding a pipeline using bunch of manual small “main” files to be executed one by one in a certain order, in parallel or one after the other. Yes, we saw that many times.
- Forcing to use disk persistence between the aforementioned small main files, which strongly couple the code to the storage and makes it impossible to run the code without saving to disk. Yes, many times.
- Making the disk persistence mechanism different for each of those small “main” files. For example, using a mix of JSON, then CSV, then Pickles and sometimes HDF5 and otherwise raw numpy array dumps. Or even worse : mixing up many databases and bigger frameworks instead of keeping it simple and writing to disks. Yikes! Keep it simple!
- Provide no instructions whatsoever on how to run things in the good order. Bullcrap is left as an exercise for the reader.
- Have no unit tests, or unit tests that yes-do-test the algorithm, but that also requires writing to disks or using what was already written to disks. Ugh. And by the time you execute that untested code again, you end up with an updated dependency for which no installation version was provided and nothing work as it did anymore.
Those bad patterns doesn’t only apply to code written in programming competition environments (such as this code of mine written in a rush - yes, I can do it too when unavoidably pressured). Here are some examples of code with checkpoints using the disks:
- Most winning Kaggle competition code. We dove many times in such code, and it never occured to us to see the proper abstractions.
- BERT. Bear with me - just try to refactor “run_squad.py” for a second, and you’ll realize that every level of abstraction are coupled together. To name a few, the console argument parsing logic is mixed up at the same level of the model definition logic, full of global flag variables. Not only that, the model definition logic is mixed in all of this, along with the data loading and saving logic that uses the cloud, in one huge file of more than 1k lines of code in a small project.
- FastText. The Python API is made for loading text files from disks, training on that, and dumping on disk the result. Couldn’t dumping on disk and using text files as training input be optional?
Companies can sometimes draw inspiration from code on Kaggle, I’d advise them to code their own pipelines to be production-proof, as taking such competition as-is is risky. There is a saying that competition code is the worst code for companies to use, and even that the people winning competitions are the worst once to hire - because they write poor code.
I wouldn’t go that far in that saying (as I myself most of the time earn podiums in coding competitions) - it’s rather that competition code is written without thinking of the future as the goal is to win. Ironically, it’s at that moment that reading Clean Code and Clean Coder gets important. Using good pipeline abstractions helps machine learning projects surviving.
Here are the things you want when building a machine learning pipeline which goal is to be sent to production:
- You ideally want a pipeline than can process your data by calling just one function and not lots of small executable files. You might have some caching enabled if things are too slow to run, but you keep caching as minimal as possible, and your caching might not be checkpoints exactly.
- Having the possibility to not use any data checkpoints between pipeline steps simply. You want to be able to deactivate all your pipeline’s checkpoints easily. Checkpoints are good for training the model, debugging it and actively coding the pipeline, but in production it’s just heavy and it must be easily disableable.
- The ability to scale your ML pipeline on a cluster of machines.
- Finally, you want the whole thing to be robust to errors and to do good predictions. Automatic Machine Learning can help here.
And if your goal is instead to continue to do competitions, please at least note that I started winning more competitions after learning how to do clean code. So the solutions above should as well apply in competitions, you’ll have more mental clarity, and as a result, more speed too, even if designing and thinking about your code beforehand seems like taking a lot of precious time. You’ll start saving even in the short run, quickly. Now that we’ve built a framework easing the process of writing clean pipelines, we have a hard time picturing how we’d get back to our previous habits anytime. Clean must be the new habit.
In all cases, using good patterns and good practices will almost always save time even in the short or medium term. For instance, using the pipe and filter design pattern with Neuraxle.
It’s hard to write good code when pressured by the deadline of a competition. With no incentives to build reusable and deployable code, it can be hard. We created Neuraxle to easily allow for the good abstractions to be used when in a rush. As a result, it’s a good thing that competition code be refactored into Neuraxle code, and it’s a good idea to write all your future code using a framework like Neuraxle.