Julien Simon

Posted on Aug 2, 2022 • Originally published at julsimon.Medium on Aug 2, 2022

Machine Learning Datastores Are Coming

#data #machinelearning #innovation #deeplearning

They are. And it’s about damn time. Read on!

Creative Commons

Relationalaurus Rex

Once upon a time, enterprise data was tabular, largely static, and low volume. Relational databases ruled the Earth. IT teams relied on them to ingest and index data that their business applications could query. Applications workflows looked something like this:

A business application receives data through its user interface, or Electronic Data Interchange (EDI).
Said business application writes that data in a relational database, using a well-defined schema.
Other applications retrieve that data with queries written in the ubiquitous Structured Query Language (SQL), either for transactional or analytical purposes.
More often than not, applications post-process results with “business rules”, i.e. custom code applying particular business practices, and proceed with doing whatever it is they do.

Over time, we added caching, connection pools, object-relational mappers (ORM, urgh), and other “middleware” that would supposedly simplify our life. Mostly, that stuff ended up making database vendors richer. Still, life was reasonably good and predictable, and the IT world hummed along.

Out of the blue, the user-generated content asteroid hit. A 1,000-feet tsunami swept across IT platforms. Massive, relentless, unstructured, ever-changing. Marketers worldwide went crazy with new monetization ideas based on extracting insights and signals from that content.

However, clicks, likes, emails, tweets, product reviews, images, and videos broke all existing assumptions. Their diversity and volume made it clear that the traditional way of storing and processing data would quickly cease to function. IT teams rushed to design and implement solutions that could help them cope, using new tools like object storage, Extract-Transform-Load (ETL), NoSQL, Map Reduce, and more.

This approach was reasonably successful for analytics on unstructured text data (say, web server logs containing product or banner clicks). Soon enough, we were back in charted waters, where we could write and run SQL-like queries on billions of rows to build nice dashboards and reports.

That was only a partial victory. Images, videos and speech didn’t easily fit in. And whatever the data modality, delivering real-time insights to business applications was insanely difficult.

Enter Machine Learning

Machine Learning (ML) lets us replace a big pile of data with a much-smaller statistical model, making it much faster to get answers. It also grants us the gift of second sight: we can now gaze into the crystal ball and predict in milliseconds that our loyal customer Bob wants to buy a new barbecue grill, so let’s recommend him that.

Thanks to Deep Learning (DL), the uncanny child of neural networks and Graphics Processing Units (GPU), we can extend that predictive power to pretty much any form of unstructured data: images, videos, audio, speech, protein sequences, and more.

ML and DL are truly amazing, but how do they impact the way we design and build business applications? Quite a lot, as we now have to build models! A typical ML training workflow goes something like this:

Ingest user and company data from different sources: relational databases, web server logs, social networks, anything goes!
Clean it, organize it and dump it all in a “data lake” (ooooooh).
Send your bright data scientists to sail on that lake in search of a new Eldorado, hoping that they come back with a full cargo of high-value data they can solve real-life business problems with.
Train models on that data.
Deploy these models.

Or course, the next step is to let applications predict with these models, either in real-time or in batch mode. Let’s start with the former:

An application receives data through a user interface or an API and logs it for further use (backtesting, training, etc.).
It reads additional data from a low-latency datastore: user preferences, extra features required for prediction, etc.
It pre-processes the data and sends it for prediction to one or more models.
It post-processes results, writes them in a datastore, and displays (or sends) them to the caller.
If available, it captures and stores user feedback for further use (“was this useful?”, “did this answer your question?”, etc.).

Batch prediction is a bit different:

A scheduler starts a batch processing job.
The job reads lots of data from one or more datastores. It could be as simple as pulling rows from a single relational database or as complex as an ETL process loading and joining data from different datastores.
The job predicts each item with one or more models.
The job writes the results in a datastore.
Business applications read results and use them in their workflows.

That’s where we are today. This new way of building applications is spawning a new industry, which we decided to call MLOps: data science platforms, feature stores, training and deployment services, orchestration tools, etc. Indeed, from data preparation to training and predicting, workflows are getting increasingly complex, with many moving parts and way too much IT plumbing. Meanwhile, vendors make money, and Lamborghini dealerships rejoice. Same old same old.

For all their greatness, ML and MLOps are already becoming too complicated. Can we pause for a minute, step back and consider how we could simplify things again?

Here are a few whacky (?) ideas on how adding ML capabilities to datastores could improve development experience and agility. By datastores, I mean any place we store data today: relational or NoSQL databases, data lakes, object stores, etc. All of these should have a role to play.

If anything below turns out to be a billion-dollar idea, please consider sending me a t-shirt. Or a Lamborghini. Much obliged!

1 — Data? We have lots! Which one do you need?

In the pre-ML world, relational databases were the single source of truth. You knew where to go to grab the original copy of your data. Transactions guaranteed integrity and traceability.

The situation is much fuzzier now. The same data lives in different places and formats. Data stored in a SQL table could be exported to a data lake for staging, then processed by an ETL system and stored in the data lake again, potentially in different formats to account for different downstream workflows. There, it could be picked up by an ML team and joined to a dataset. Feature selection, feature engineering, and data augmentation would take place. Maybe the data would then be pushed into a feature store. At training time, data would be shuffled, sampled, split, and stored on the training infrastructure.

After all this, good luck tracing a particular feature in your training dataset back to its original row in your relational database, which may have been updated or deleted since then…With an amazing data engineering team, you can pull it off and keep everything neatly organized and versioned in your high-gloss data lake. That still sounds like a ton of work, and the further training data lives from its source, the more likely it is that something will go wrong.

How about ML-native datastores that could automatically build engineered data in place, and keep track of versioning and lineage (inserts, updates, deletes)?

I’d love to be able to attach a Python processing script to a table (or equivalent) and let the datastore automatically build and manage the corresponding engineered data. I could configure when to run the code: periodically, when a certain percentage of the raw data has changed, or on demand. Then, I could access the engineered data in a view (or equivalent), alongside versioning and lineage metadata. To further simplify the data preparation process, the datastore could implement built-in transforms (say normalization, imputation, tokenization, etc.) that I could run automatically without writing any code.

While I was literally writing this post, FeatureByte launched :) You can read more on VentureBeat. Good timing!

Benefits:

Raw data and engineered data in the same place.
Built-in lineage and versioning.
Simpler ETL.
Less data movement.

2 — Want your predictions for here or to go?

Today, ML predictions require dedicated code and infrastructure. Typically, an application reads data and sends it to a prediction service, which runs more code to invoke a model and send results back to the application. This requires a bit of data movement, IT plumbing, and boilerplate code that we should try to eliminate.

I can see a lot of use cases where predictions could be run directly inside the datastore when data is added/updated and used to automatically populate new attributes/metadata. Here are some examples.

Product catalogs:

Translate product descriptions automatically to account for multi-lingual websites and applications.
Classify images to extract labels to improve search and quality assurance (does the picture show the correct object?)

Voice of the customer (support emails, online reviews, social media, etc.):

Extract entities to understand what product or service the customer is mentioning.
Analyze sentiment and emotion.

Knowledge bases:

Extract entities (company names, product names, etc.) and classify documents to improve search with reliable metadata.
And also: translation, summarization, etc.

Semantic search:

Extract features from text documents (I’m looking at you, vector databases).
Generate image-to-text descriptions.
Transcribe audio with speech-to-text.

The list goes on. You can probably do this today with User-Defined Functions (UDF) and external APIs, but that involves additional code, IT plumbing, and data movement. Meh.

Instead, I’d love to be able to attach metadata to a particular column/attribute/object describing the model to use, when to run prediction (synchronously, asynchronously, scheduled), and where to store results. The model would be fetched automatically from a model repository, loaded in a container, and run inside the datastore.

At prediction time, the datastore automatically converts the data to the appropriate inference format and invokes the model. Predictions would be stored in the same place as the original data, and I could retrieve it using whatever query language the datastore supports. Why does it have to be more complicated than this?

Benefits:

Raw data and predictions data in the same place.
No data movement.
Less plumbing.
Simpler application code.

3 — You say tomato, I say tomato

ML algorithms expect training data in a well-defined technical format: CSV, libsvm, JSON, protobuf, TFRecord, Parquet, you name it. Most of the time, datastores cannot provide data in these formats. This impedance mismatch forces us to write conversion code, which we should eliminate.

I’d love to pull data in the exact format that an algorithm expects. datastores provide seamless export options for the most popular ML formats. A good example is the little-known SageMaker Spark SDK, which automatically converts training data from DataFrame to protobuf (the format that many SageMaker built-in algorithms expect) and uploads it to S3. Perfect.

For more flexibility, I should be able to run custom code for feature engineering, feature selection, and advanced formatting. The place to run this is in the datastore, and nowhere else. Snowflake’s Snowpark is an interesting step in that direction.

Benefits:

One-click/one-line export to the appropriate format.
Simpler ETL.
No data movement.

4 — Closing the loop

If a datastore is storing pre-processed data and can export it to the right training format, it’s a step away from being able to train models. So why shouldn’t it?

Many enterprise ML projects use traditional algorithms like linear regression and classification (scikit-learn, XGBoost, etc.). Most of the time, training will amount to minutes on a CPU. These jobs are screaming for commoditization.

I’d love to be able to run training in place:

Write a query to build the training set,
Export it to the appropriate training format,
Pick an algorithm and set hyperparameters, or better yet, use AutoML,
Train inside the datastore (and deploy in place!).

Retraining could be configured automatically on a schedule or when a certain percentage of data has been added or updated.

Could we do the same for Deep Learning models? With pre-trained Transformers and few-shot learning, the prediction cost for unstructured data could be reasonable enough to run in place. For tabular data, we could also consider using a TabTransformer model pre-trained on a big pile of unlabeled tabular data and fine-tune it on a bit of our labeled data (this nice blog post explains the whole process). The compute cost would be higher, but who said we couldn’t have hardware accelerators in datastore systems?

Benefits:

Simpler workflow.
Fresher models.
No data movement.

Conclusion

There you have it. Everybody is claiming to simplify ML, but it’s never been so complicated… Hopefully, we’re at the peak or close to it.

For everyone’s sake, we need to start simplifying, standardizing, and commoditizing workflows. My crystal ball tells me that datastores have a significant role to play. That next wave of innovation hasn’t really started,

Watch the skies. That asteroid is coming.