DEV Community

Dang Hoang Nhu Nguyen
Dang Hoang Nhu Nguyen

Posted on

[BTY] Day 10: Real-time machine learning: challenges and solutions - Huyen Chip

This post has "captured many data scientists’ pain points", including me, about how to move our pipelines in real-time.

She did good work at outlining the solutions for (1) online prediction and (2) continual learning, with step-by-step use cases, considerations, and technologies required for each level.

Take a look: https://huyenchip.com/2022/01/02/real-time-machine-learning-challenges-and-solutions.html?fbclid=IwAR3eOWi-PNMg6AZTb5GOEdFlOZ8IsQ1SY_PdOtKV-xW99Rb6LbpnZgx7mwU#continual-learning.

Notes

I leave here some notes for my future self. 🤗

Online Prediction

To move your ML workflow to the "Online prediction with complex streaming + batch features" stage, you have to prepare an efficient stream processing engine, feature store, and model store. How?

Bandits for model evaluation: This new method has been shown to be a lot more data-efficient than A/B testing. In many cases, bandits are even the optimal methods.

Contextual bandits as an exploration strategy?

Continual Learning

Most companies do stateless retraining – the model is trained from scratch each time. Continual learning means allowing stateful training – the model continues training on new data (fine-tuning).

Stage 1. Manual, Stateless Retraining
Any company that has adopted ML less than 3 years ago and doesn’t have an ML platform team – is in this stage.

Stage 2. Automated Retraining
For most companies, the retraining frequency is set based on gut feeling – e.g. “once a day seems about right” or “let’s kick off the retraining process each night when we have idle compute”.

Different models in your pipeline might require different retraining schedules.
You'll need a model store to automatically version and store all the code/artifacts needed to reproduce a model (S3 bucket, SageMaker or MLFlow).
You’ll need to write scripts to automate your workflow and configure your infrastructure to automatically sample your data, extract features, and process/annotate labels for retraining (CRON scheduler such as Airflow, Argo, ...)

Log and wait (feature reuse)?

Stage 3. Automated, Stateful Training
The main thing you need at this stage is a better model store.

  • Model lineage: you want to not just version your models but also track their lineage – which model fine-tunes on which model.

  • Streaming features reproducibility: you want to be able to time-travel to extract streaming features in the past and recreate the training data for your models at any point in the past in case something happens and you need to debug.

4. Continual Learning
Instead of updating your models based on a fixed schedule, continually update your model whenever data distributions shift and the model’s performance plummets.

The holy grail is when you combine continual learning with edge deployment.

You’ll need the following:

  • A mechanism to trigger model updates. This trigger can be time-based (e.g. every 5 minutes), performance-based (e.g. whenever a model performance plummets), or drift-based (e.g. whenever data distributions shift).

  • Better ways to continually evaluate your models. The hard part is to ensure that the updated model is working properly. Because you’re updating your models to adapt to changing environments, a stationary test set no longer suffices. You might want to incorporate backtest, progressive evaluation, test in production including shadow deployment, A/B test, canary analysis, and bandits.

  • An orchestrator to automatically spin up instances to update and evaluate your models without interrupting the existing prediction service.

Top comments (0)