Odinaka Joy

Posted on Jun 3 • Edited on Aug 9

A Beginner’s Guide to the Data Science Workflow

#datascience #programming #ai #learning

Artificial Intelligence is about building systems that mimics human.
Machine Learning is a subset of Artificial Intelligence (AI) and it is an approach to achieve AI by building systems that can find pattern in a set of data.
Deep Learning is a subset of Machine Learning (ML). It is one of the techniques for implementing ML.

What then is Data Science? Data Science overlap all three above (AI, ML, DL). This field simply means analyzing data and then doing something with it.

Data science can seem intimidating at first, with all the tools, libraries, and buzzwords floating around. But at its core, it’s simply about using data to solve real-world problems. This is a walk through of the essential stages of the data science workflow, what they mean, why they matter, and how Python can help, based on what I have learned as a beginner navigating this exciting field.

💡 What is Data Science?

Data Science is the field of extracting meaningful insights from data using a combination of statistics, programming, and domain knowledge. Whether you are analyzing customer behavior, forecasting sales, or detecting anomalies in sensor readings, the goal is the same: To turn raw data into actionable information.

For beginners, it’s tempting to jump straight into tools like Pandas, Scikit-learn, or TensorFlow, but NO. It’s essential to understand the overall workflow that guides any data science project. Jumping straight into code can feel satisfying, but without a clear roadmap, you may spend hours cleaning the wrong variables or building models that don’t address the real problem. Learning the data science workflow helps you think like a problem-solver, not just a tool user.

Data Science Practical Guide

Create a framework
Match Data Science and Machine Learning tools
Learn by doing

📌 A Data Science Workflow

✍️ 1. Problem definition

Understand the problem and define the questions you want to answer.
Question: What problem are we trying to solve?

Will a simple hand-coded instruction based system work? If yes, no machine learning
Match the problem to the main types of Machine Learning
- Supervised Learning: You have data with labels (includes both input features and their corresponding correct output) which can be a classification or regression type. An example is "Predict heart disease with health records"
- Unsupervised Learning: You have data with no labels (contains only the input features — no known or provided output labels). So you are to use data patterns to generate labels (output). An example is "Use customer purchases to determine which customers are similar to each other"
- Reinforcement Learning: This involves having a computer program perform some actions within a defined space. You reward it (for doing it right) or punish it (for doing it wrong). An example is "An AI playing chess tries moves and learns from win/loss outcomes"
- Transfer Learning: Used when the problem is similar to another case. It is a technique where a model pretrained on one task is reused for a different but related task. An example is "Using a model trained on millions of general images to classify X-ray images after a little fine-tuning"

✍️ 2. Data Collection

Once the problem is clear, gather relevant data. You might collect data from CSVs, APIs, web scraping, or databases. After collection, understand its format and limitations.
Question: What type of data do we have available?

Structured Data: These are data that is organized in a predefined format like rows and columns, making it easy to store, search, and analyze. They are often stored in: Relational databases (like MySQL, PostgreSQL), spreadsheets (Excel, CSV). They are easily analyzed with tools like SQL, pandas, Excel.
Unstructured Data: These are data that doesn’t follow a clear format. It can’t easily be stored in tables or rows. It requires more processing to extract meaning or structure. They are stored in: Files, document repositories, cloud storage. Examples are: Text (Emails, PDFs, social media posts), Media (Images, videos, audio), Logs (Server logs, clickstreams).
Semi-structured Data: These ones falls in between. They are not as rigid as structured data, but has some organization. Example: JSON, XML, HTML

There is also another category of data within the two (Structured and Unstructured) above to note:

Static Data: These are data that doesn't change over time.
Streaming Data: These are data that changes regularly

✍️ 3. Success Criteria (Initial Evaluation)

Define what "success" looks like before you begin modeling. This helps guide decisions later. For example:

If we can reach 95% accuracy in predicting heart disease, we will proceed with deployment

Different types of evaluation metrics:

Classification	Regression	Recommendation
Accuracy	Mean Absolute Error (MAE)	Precision@K
Precision	Mean Squared Error (MSE)	Recall@K
Recall (Sensitivity)	Root Mean Squared Error (RMSE)	Mean Average Precision (MAP)
F1 Score	R-squared (R²)	Normalized Discounted Cumulative Gain (NDCG)
ROC-AUC Score	Adjusted R²	Hit Rate
Confusion Matrix	Mean Absolute Percentage Error (MAPE)	Coverage
Log Loss		Diversity
Matthews Corr. Coeff. (MCC)

✍️ 4. Features

Features refers to the different form of inputs within the data you collected. Example: age, gender, heart rate, etc. You identify the feature variables and target variables (if available). Feature variables are used to predict target variable

Example of a health record data:

ID	Weight	Sex	Heart Rate	Chest Pain	Heart Disease
1	120kg	M	81	4	Yes
2	98kg	F	75	2	No
3	110kg	M	90	3	Yes
4	85kg	F	65	1	No
5	105kg	M	78	4	Yes

Question: What do we already know about the data?

Types of features:

Numerical features: Examples are Weight, Heart Rate, Chest Pain
Categorical features: Examples are Sex, Heart Disease
Derived features: These are features you add using the existing ones. Example: "Visits Per Year"

This stage involves:

4.1. Data Cleaning

Raw data is rarely clean. This step involves handling missing values, fixing errors, and removing duplicates. Tools often used are Pandas and NumPy

4.2. Exploratory Data Analysis (EDA)

This is where you explore patterns, trends, and relationships in your features using visualizations and statistics. Tools often used are Pandas, Matplotlib and Seaborn.
Some EDA based on our data sample: Heart Disease Frequency per Chest Pain Type, Age versus Max Heart Rate for Heart Disease, Heart Disease Frequency according to Sex, etc

4.3. Feature Engineering and Encoding

At this stage, you can create new features or alter existing ones to make your model smarter.
Question: Feature coverage - How many samples have different features? Ideally, every sample has the same features.

Feature encoding is the process of converting categorical (non-numeric) data into a numerical format so that machine learning models can understand and work with it.

✍️ 5. Model Building

At this stage, you choose one or more models, train them on your dataset, and make predictions. Some common tools used at this stage are scikit-learn, PyTorch, TensorFlow.
Question: Based on our problem and data, what model should we use?

Parts of Modeling

Choosing and training training
Tuning a model
Model comparison

Data Splitting
The most important concept about machine learning is Data Splitting.

The training dataset which is 70 to 80% of the total data
The validation dataset which is 10 to 15% of the total data
The test dataset which is 10 to 15% of the total data

You train the model on the training dataset, tune the model on the validation dataset and test/compare the model on the test dataset.

The idea here is Generalization - the ability for a machine learning model to perform well on data it hasn't seen based on what it learnt on similar data it was trained on.
Simply put, pass exam based on course material and practice exam.

5.1. Choosing and Training a Model

Start by selecting an appropriate algorithm based on your problem type and data. Train the model using the training dataset to help it learn patterns and relationships.
For example, CatBoost and RandomForest works best on structured data.

5.2. Tuning a Model

After initial training, adjust hyperparameters (like learning rate, depth, number of estimators, etc. These are based on chosen algorithm) to improve performance. Techniques like Grid Search, Random Search, or Bayesian Optimization help find the best configuration. Tuning is done on training or validation datasets.

5.3. Model Comparison

This is to test the model with unseen data and compare the results. Testing is done on the test dataset.

✍️ 6. Model Evaluation

After the model has been trained, tuned, and tested, evaluate it using appropriate metrics on a validation or test dataset. Use metrics like accuracy, precision, recall, RMSE, etc, to assess performance.

✍️ 7. Experiment

Most times, a model's first result aren't its last. You need to perform the steps 5 and 6 on other algorithms/models, maybe modify the input and output, to see if there is a better result. Compare the evaluation results with the goal to select the model that generalizes best on unseen data, not just the one that performs best on the training dataset.

✍️ 8. Deployment (Optional)

Package and serve the model in a real-world environment. You can integrate the model into a usable product or service. Some tools are Flask, FastAPI, Streamlit, Docker, Heroku

📌 Key Python Libraries Overview

Pandas: Uused to explore, analyze, manipulate and get data ready for machine learning. It reads data as DataFrames.
NumPy: NumPy stands for Numerical Python and it is used for numerical computation. It forms the foundation of taking your DataFrame and turning it into a series of numbers and then a machine learning algorithm would work out the patterns in those numbers.
Matplotlib/Seaborn: Used to turn data into visualizations known as plots
Scikit-learn: A Python ML library for building ML models to train and evaluate models, used to make predictions.

📌 Summary and What’s Next

In this post, we explored the foundations of Machine Learning - understanding problem types, choosing and evaluating models, and making sense of our data through EDA and metrics.

But theory is only half the story.

Up next, I will be putting this into practice in a real-world project:

“Predicting treatment outcomes for mental health patients”

Top comments (8)

Dotallio • Jun 5

Really clear breakdown, makes the whole process feel a lot less overwhelming. Super interested to see your steps for the mental health prediction project - what kind of data are you hoping to use for that?

Odinaka Joy • Jun 5 • Edited

Thank you.
I am using a CSV data from Kaggle - kaggle.com/datasets/osmi/mental-he...

Tulsi Shukla • Jun 4

mam where you learn all like take some bootcamp or utube videos

Odinaka Joy • Jun 4 • Edited

I am currently learning AI stuff from Zero To Mastery platform.
I have been learning for about 5 months now.

Comment hidden by post author - thread only accessible via permalink

Tulsi Shukla • Jun 5

it,s paid or free mam

Tulsi Shukla • Jun 5

also mam i want to gain some practicle knowledge do you have any suggestion where i gain and sharp my fronted and backend knowledge

Odinaka Joy • Jun 5 • Edited

It's paid.
If you are confident you can work on real world project, reach out to founders and offer your services (it can be free for a start - say 3 months). You can build up experience from there

Tulsi Shukla • Jun 5

mam i am thinking that i add with community where i build project see how people do ,than after having confidence i apply for intenship and go for outreachy alogn with some open source contribution

Some comments have been hidden by the post's author - find out more