Yann Barraud

Posted on Oct 5, 2022

A few (unordered) thoughts about data (1/2)

#data #organization #vision

TL;DR

Data is the new oil.
Data is the new bacon.
Data is the new sun.

Many companies started a data lab in the past few years. Most of them not delivering tangible value. Why is that? How should we make this change? Here are a few night thoughts about this.

Are we at the dawn of the "data era"?

A look back in the mirror

By the end of the 18^th century, in England mainly, then rest of Europe, a movement called Industrialisation was initiated, with the harness of new technologies, such as steam machines, railways and loom a bit later. This came with a major financial investment to drive to radical transformation of most jobs but entire societies.

Combining loom and steam energy, textile industry changed drastically. But it did not happen suddenly and brutally. In England, it took almost a century to see mechanical looms being generalized, when in the U.S., it took 50 years to see the world first mechanical cotton factory.

Interesting fact, since those changes took so long, some historians are now naming this "industrial revolution" an "industrial era".

Is data bringing value?

Many of you already heard the launch of this data lab here, and this big company launching its own data factory, or whatever they call it.

Many companies started to build data lakes, to fuel this transformation and feed their data scientists with the required data. (We'll come back on the data lake pattern and technologies later).

In parallel to this, we got to understand (maybe a little too late) that some companies, mainly digital and internet ones were mainly fed by the data they were collecting, offering free services. Then came the moto "if it is free, then you are the product". Which is basically true. And step by step, regulators started to look at this important topic and issued laws, such as GDPR, LGPD or whatever you call it locally. And at some point, a resistance against those companies started, both for personal usage or companies level, with next cloud options considered, like Gaia-X, Bleu or S3ns.

A few data freedom projects

e-mails
Tutanota
ProtonMail

Mobile OS
e.foundation

Files store
pCloud
NextCloud

... to name just a few

So, looking at the value of those companies, can we have a doubt about the fact that data brings value? Just as the industrial evolutions brought value two centuries ago, data is also bringing value.

Then, why lots of data initiatives are failing?

The 'loom out of the factory' syndrome

My first take on those failures is what I call 'the loom out of the factory' syndrome. What does it mean?

Imagine you are a tissue manufacturer in 1820 in England. You heard about those new steam powered looms, and you bought one. You then hire specialists, build a new building, close to your existing factory, and you get the stuff installed. Exciting, isn't it? Your new team gets trained, and your new experts are producing cotton fabric, experimenting tens of technical means. But nothing they produce can actually be sold, because it does not fit into your current processes, or does not meet your customers and market requirements.

So all in all, what you have now is a joyful expensive team having fun with new industrial tools, delivering no value. And beside, the production team, delivering value, but frustrated about the noise and glory the useless team gets.

Still, rejoice, you can communicate about the use of those brilliant new tools you are using. Maybe some investors will get interested about your company. Kudos! Still, in the next years your company will die facing the industrial ones really using the tools.

Do you know where I'm heading to?

Facing the production challenge

Fact is, and many companies got it now, the issue with data projects is not building a data science model. If you listen to Lak, and I bet you should, the model itself is only a small part of the process. Somewhere between 10 or 20% of the effort.

All the rest is about data collection and professionalization of the project.

So, if you concentrate your effort in building the perfect model, with the best ROC curve, and taking months to improve by 0.001%, congratulations, you are having fun. And while you are spending money in this improvements, competitors might have productionalized a way simpler model doing just better than previous solution. And then earning money from this. Which might then justify further investments in improving the model. You know the drill?

Putting a data science project into production is hard. And even harder if you don't consider it at the very beginning of the project. Put the loom out of the factory won't help.

What can then?

The rise of the data science engineer

One of the tricks, or issue, whatever you call it, is maybe about how you consider the data scientist job. From my experience, which is not the absolute truth, hiring data scientists, putting them together with no constraints, and letting them have fun with models might not be the right approach. This leads to a diva effect.

There once was this belief that dropping a data scientist into a data lake, he will automagically create value.
Drop a data scientist into a data lake, he will drown.

From my perspective, most companies don't need data scientists, but rather data science engineers.

Because, let's face it, what are your data scientists bread and butter?

Data cleaning
Feature extraction
Models experimentations and tuning

The models they are using are from most known toolkits, such as (not exhaustive):

So all-in-all, maybe you don't need a data scientist, who has limited or no knowledge of software development. You might rather consider a good data engineer with a good understanding of those frameworks, with the ability to produce production ready software.

Spoiler alert
Want some fun? During years, I've seen most project doing good with a smart random forrest. Then GBM models, like xgboost. So, nothing rocket science a smart data engineer can't handle...

Data science tools and AutoML

Complementary option to consider are the data science studios and automations platforms which are rising on the market.

For exemple, you could consider using:

Dataiku which helps build and productionalize data pipelines
H2o driverless automating feature engineering, model training, hyper parameters optimization and model selection
saagie, a DataOps platform to ease deploying to productions.

Not talking about the cloud providers options like AWS Sagemaker, Google AI Hub or OVH MLServing.

They are all addressing the issue a different way, with pros and cons. Maybe you need them, maybe not. Up to you. But this is an option worth considering.

The ideal setup ?

If I had the courage, I'd say that a good setup might look like the following :

90% of the staff being smart data engineers
10% PhD-like data scientists, building new models in an experimental way

Then discuss with your team whether you need tooling or not. Various things to consider:

What is my goal? Ease the production or automate the process?
Should I consider data science artefacts as regular software components and then integrate this into my software tooling and processes?
Do I want to bring autonomy to the business to build data science projects?
...

But still, if you have the perfect team, it is still an IT team and an IT project. So what's next?

Data driven? Data company? What does it mean?

What would make a company data driven? Is it because you are making your decisions based on data-backed facts?

Again, Google here has an interesting approach. They are using data at each and every level where it makes sense to improve or automate things. This is what it means to be a data company. Then every product you release, either internally or externally, can take benefit or data parts in its design.

Data part? What do you mean? You got it right. Data oriented software pieces are part of the product you deliver. It is not the product itself. I loved the example of Google Translate product. If you have it installed, you'll see that the product is doing translation. Yeah. Captain Obvious is me. Well. This was thanks to the efforts of coders, and thanks to 500K lines of code. Then, because Google collected enough data, they could reduce the core codebase to... 500 lines of code using a CCN. But its is not all about this. They also built a feature to translate pictures. So a model to recognize text on a picture. Then one to perform OCR. Then using the translation model. Then a model to figure out the size of the translated text to replace the existing in the picture.

And it is just an illustration. In the end, what's the idea? Basically, can you identify parts of your product where data can help? Don't think of data parts as a product by themselves. They just are parts of your software design. With different requirements. Just a lego bricks in the end.

Ask yourself some guiding questions:

Is it something worth automating? (If you do it once a year, maybe not. If it is manageable by a human being, maybe not. If...)
Do I have enough data available to learn from?
Can I build a model to do this?
...

If 1^st is no, you are done. Congrats!

If 2^nd is no, then think about how to collect enough data to build your model. It should be part of your deign process. Meanwhile, either your build some regular software, or postpone the delivery.

Here comes data governance

Obviously, if you chose at some point to build those data products, you'll meet someone in a corridor, hanging around, empty eyes, despaired, and if you ask him, he'll tell you he is looking for data...

The point is, either you let things go in an ad-hoc mode, or else you want to ease the production of data products. If you want something a bit faster, you'd certainly go for the second.

In fact there, I'd say best option is ad-hoc to start and build a data governance organization in parallel.

But the basic idea is : provide a way for the teams to know what data is available, where does it sit, with the right associated metrics (freshness, ...), and lineage, so that they know what this data has been through (you could consider tools like Kensu or Sifflet for this).

Then, next step one will certainly ask for is a shared data model. A word on this. In a big company, with various entities, and different data models and dictionaries, build a common data model ~~can be~~ will be cumbersome. Go that way only if it is really necessary.

If you have then an easy way to know, understand and access the data, you are not that bad.

Which leads to another usual issue. Data culture.

It is all about culture

In fact, if you want all this to flow, it needs to infuse within every layer of your organization.

That's where acculturation is key, not only for execs, but every one within the company. Some companies are doing this rather good. But it is only done because of a strong will to change drastically the company culture. Or, more precisely, evolve it.

Whatever your business, data can be of some use everywhere. From HR to general ledger, up to your core business. It can transform your activity or make it simpler or more efficient. This is what being a data company means. The moment when any actor of the value chain will identify where data can potentially help, you will be able to say you made it. And that's probably why it is a data era and not a data revolution.

Data science is not about predicting the future, it is about predicting how likely the past is to repeat

DEV Community