How to Become a Data Scientist in 2026

#ai #datascience #machinelearning #deeplearning

How I got here

On principle, you will never catch me parading myself as a some sort of expert data scientist. Technically, that's what I do in my day job, but I know I still have so much to learn because the field is broad, and to truly become expert requires dangerously ambitious levels of work ethic. I think I'm a functional data scientist who learns more as I encounter new problems daily.

I'm writing this piece because in the last week or two, precisely three people have asked me questions related to transitioning into data science. As such, I thought to unify my thoughts around the topic so that I can refer anyone else who asks here--if anyone else ever asks.

This article assumes you're already familiar with some of the data science entails such as data analysis, model training, prediction, etc, so I will not be doing a lecture series, just addressing some of the disconnects I have observed in conversation with people looking to transition to the field.

Initial Excitement

In 2026, it's easy to see what claude or chatGPT is doing and go "What sorcery is this? I must learn this trick!" and then reach out to the closest person you know who has ever mentioned anything about data or machine learning to find out how you can transition into AI. First of all, transitioning into "AI" is such a broad way to look at it. It is analogous to saying "I want to emigrate to Africa, show me how". But that's forgivable too.

To cut short your initial excitement, or maybe redirect it, playing with a locally hosted LLM or making API calls to the DeepSeek endpoint is not data science, or machine learning or "AI". It's coding. And if you want to go down that route, you're better of focusing on software engineering. I say this because when you work with LLMs, the finished models to be specific, it's like using any other SaaS API out there. The difference being that you're interacting with a much less deterministic interface. But the rest of the work you do around it is pretty much a deterministic software engineering journey.

In the next section, I introduce you to what data science, machine learning, AI or whatever it's being called by the time you read this article is really about.

Theoretical Foundations

One of the most important concepts you need to understand in the field of data science is Statistics and Probability. It helps to be able to actually run the calculations, but it's not entirely necessary, thanks to the myriad of pre-built toolboxes out there. However, you cannot escape understanding these concepts, because you need to understand them to understand what the tools are doing and what the outcomes represent.

Ideally, the next step is to learn the mathematics behind simple models like linear and logistic regression. This is very key, because even the more complex systems like neural networks rely on conceptually simple units of calculations encapsulated in more complex training loops which are generally able to extract more relationships between the data points supplied and the target being solved for.

And then ofcourse, you need to spend time understanding the concept of model evaluation. What sort of evaluation would accurately represent the performance of your model in the use case it is being developed for? As a quick example, in a fraud dataset, where realistically speaking, fraud cases might represent only 1 to 5% of all records, a metric like naive accuracy has a high chance of telling you your model is 95 to 99% accurate because in principle and by the nature of your dataset, if the model just predicts that all predictions are non-fraudulent, it will be accurate 95 to 99% of the time, but in reality, it has specifically missed what we wanted it to catch in the first place. So in this case, do we want to explore other metrics such as F1, Area Under the Curve or Matthew's Correlation Coefficient which are able to account for model performance across each prediction target class?

Practice

When learning about data science and machine learning via courses, you will be excited about modeling. In reality, you will begin to understand why it is called data science. Modelling, especially in today's world where robust toolboxes exist, model building is probably just 10% of the data scientist's job.

In reality, you will have to deal with messy data, semi-structured data and data with issues upstream which might lead to missing values or inconsistent data representation.

It is your job to carry out data analysis to first of all, understand what is wrong with the data, how the data might be misleading, and the hidden gems hiding in the data. Together, this process is part exploratory data analysis (EDA) and feature engineering.

EDA entails looking into the data to understand what the data looks like, what fields are numerical, what features are categorical, what categorical features are nominal or ordinal (rank ordered by some criteria or simply separate classes)?

Are there missing values? Can we afford to drop all records with missing values or do we need to find a way to fill the missing values. Based on what that feature represents, what would be the best way to fill missing values without distorting the true statistical properties of that feature?

Are there records which are outliers? What defines an outlier within the context of this problem? Should we drop all outliers or apply a mathematical transformation of that feature to minimise the effect of outliers on that feature?

Are there features that can be combined together via some mathematical representation to reflect a more contextual measure based on the particular business case being solved for? Should I transform date of birth in to a numerical field as "age" or transform locations into cluster mappings by the longitude and latitude so the model can account for the effect of where a record is originating from?

Instead of telling you specifics, I have presented questions. This is a glimpse into the thoughts that run through the mind of a data scientist transforming data into a decision system.

The most important bit of all this is the requirement to understand business context. The same dataset with different business objectives will require entirely different thought processes and evaluation of outcomes.

The full picture

When you start your journey, everything will live in a Jupyter Notebook and your dataset will be a CSV file. But as time passes, you will realise you need more to build an actual machine learning system. At some point you will realise you need to learn production grade coding, how to write clean reusable code because transformations to your training data must be applied at inference time with only a few tweaks here and there. How to build systems that are maintainable, resource efficient and scalable.

You might have to source your own data from APIs, do some web scraping, connect to and pull data from a database. You might have to create a sandbox environment with Docker, run a series of iterations and track the entire stack of iterations with ML Flow, create an endpoint for accessing your model with FastAPI, deploy your model on a virtual machine on AWS and track data drift and model performance with Evidently.

Depending on the kind of organisation you end up working for, you might work strictly within the confines of exploring the data and building models, or the entire stack from data sourcing up to model serving.

This entire process requires being open and being willing to learn constantly. I started with Jupyter Notebooks, now I spend hours debugging on Linux to setup NVIDIA drivers so deep learning models can train and run more efficiently.

Notice how I haven't made mention of LLMs but we have said so much? Yeah, working with large language models suddenly becomes contextual. If you want to work with large language models within the context of being a data scientist, it involves training or fine-tuning models much like any other model, with text data and then evaluating the outputs which are next token predictions, to see if the assigned probabilities of the next set of possible tokens it has generated is actually similar with respect to what the next set of tokens from the text corpus actually is. This is a simplification of what LLMs go through to become usable, but you get the point; we're back to good old statistics and probability.

If this is what you want, then yes, all I have said is in line with what you need to do to get into AI. If on the other hand you just want to play with locally hosted LLMs via Ollama or vLLM or perhaps the Gemini API, what you need to do is learn how to code, not become a data scientist.

I hope I have been able to shorten your journey based on your actual desired path.