DEV Community: MercyMburu

5 analytical skills that evry Data Analyst should have

MercyMburu — Sat, 16 Aug 2025 01:25:08 +0000

Skill1 :The Why and How skill
This refers to the innate curiosity that comes naturally. For instance, remember asking questions back in high school or campus and wondering why you are the only one who raised your hand as other students jeered back at you? That is a valuable skill in Analytics. Simply because the more curious you are about the business problem, the more questions you'll ask the stakeholder. Clarifies goals and objectives and enhances your data analysis process.

Skill 2: Comprehending data based on context
It's all about understanding the full picture that the data presents.
The ability to fully grasp the context of your data can make a huge difference in your analysis process.

Skill 3: Breaking down things
Every data analyst needs to hone 'developing a technical mindset'. Having a technical mindset means tackling each problem methodogically and logically.
An essential part because you can't just jump into analyzing right after identifying the business problem. To adress the problem, start by breaking the process into smaller manageable pieces of information.
This includes gathering your data and then thoroughly cleaning it before you dive into the core analysis.

Skill 4: Data Design
It can be compared to working on a spreadsheetand then arranging the data neatly to spot patterns and insights. That whole process is called data design. This skill, like many others, grows stronger with practice.
Data Design is an extension of technical mindset.
In this phase, you’re rearranging cells and organizing data, making it easier to discover various patterns.

Skill 5: Data Strategy
Data strategy involves managing not just the data itself but also the people, processes, and tools involved in data analysis.
Think of it as a kind of resource allocation skill. It’s all about choosing the right tools and approaches for the specific business problem at hand.
For example if the business issue calls for a simple dashboard, the Ms. Excel will be the right tool. If it calls something more comprehensive and interactive like data modelling , then Power BI, and tableau are the go-to options.

By harnessing these analytical skills, you’ll find that your understanding of the problems becomes five times clearer and more concise, even before you begin the project.

Therefore, taking a step back to focus on the big picture and applying analytical thinking, as outlined above, is essential for every data analyst.

Cross-validation in Machine Learning

MercyMburu — Thu, 11 Apr 2024 11:36:06 +0000

Want to learn how to use cross-validation for better measures of model perfomance? Want to understand this common term used when developing machine learning models? What does it mean? 💁‍♀️
You are in the right place.😊

Machine learning is an iterative process. That we all agree. Basically facing the need to go some step back during development and making changes or reviews, is what we mean by iterative. You will face choices about what predictive variables to use, what types of models to use, what arguments to supply to those models, etc.

Most of us and quite often, have made these choices in a data-driven way by measuring model quality with a validation set.
But there are some drawbacks to this approach. To see this, imagine you have a dataset with 5000 rows. You will typically keep 20% of it as a validation set, right, or 1000 rows? But this leaves some random chance in determining model scores. A model might do well on a set of 1000 rows even if it will be inaccurate on a set of a different 1000 rows.

At an extreme you could imagine having only one row of data in the validation set. If you compare alternative models, which one makes the best predictions on a single data point will be mostly a matter of luck!

In general, the larger the validation set, the less randomness there is in our measure of model quality and the more reliable it will be. Unfortunately, we can only get our validation set by picking a set of rows from the training data and smaller training datasets mean worse models.

What is Cross-validation?

In cross-validation, we run our models on different subsets of data to get different performances of the models, or we can say, it's a way of getting multiple measures of quality of the models.
For example, we could begin by dividing our datasets into 5 parts, each 20% of the full dataset. In this case, we say we have broken the data into 5 folds.

Then we run one experiment for each fold.

In experiment 1, we use the first fold as a validation set and everything else as training data. This gives us a measure of quality based on a 20% holdout set.
In experiment 2, we hold out data from the second fold and use everything else as except the second fold for training. Gives a second estimate of the model quality.
We repeat this process, using every fold once as the holdout set. Putting this together, 100% of the data is used as holdout at some point, and we end up with a measure of model quality that is based on all of the rows in the dataset (even if we don't use all rows simultaneously).

When should you use cross-validation?

Cross-validation gives a more accurate measure of model quality, which is especially important if you are making a lot of modelling decisions. It can however take long to run because it estimates multiple models(one for each fold).

So, given these tradeoffs, when should you use each approach?

For small datasets, where extra computational burden isn't a big deal, you should run cross-validation.
For larger datasets, a single validation set is sufficient. Your code will run faster, and you may have enough data that there's little need to reuse some of it for holdout.

There's no way to conclude what measure constitutes of a large dataset vs a small dataset. But if your model takes a couple of minutes to run, then it's worth to switch to cross-validation.

Alternatively, you can run cross-validation and see if the scores for each experiment seem close. If the experiment yields the same results, a single validation set is probably sufficient.

Let's see this with an example:

import pandas as pd
# Read the data
data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')
# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]
# Select target
y = data.Price

Then we define a pipeline that uses an imputer to fill in missing values and a random forest model to make predictions.
While it's possible to do cross-validation without an imputer,it's quite difficult! Using a pipeline will make the code quite straightforward.

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
my_pipeline=Pipeline(steps=[('preprocessor',SimpleImputer()),
                      ('model', RandomForestRegressor(n_estimators=50,random_state=0])

We obtain the cross-validation scores with the cross_val_score() from scikit learn. We set the number of folds with the cv parameter.

from sklearn.model_selection import cross_val_score
# Multiply by -1 since sklearn calculates *negative* MAE
scores= -1 * cross_val_score(my_pipeline, X, y, cv=5, scoring=neg_mean_absolute_error)
print("MAE scores:\n, "scores)

Output: MAE scores:
[301628.7893587 303164.4782723 287298.331666 236061.84754543
260383.45111427]

The scoring parameter chooses a measure of model quality to report : in this case, we choose negative mean absolute error(MAE).

It's is a little surprising that we specify negative MAE. Scikit-learn has a convention where all metrics are defined so a high number is better. Using negatives here allows them to be consistent with that convention, though negative MAE is almost unheard of elsewhere.

We typically want a single measure of model quality to compare alternative models. So we take the average across experiments.

print("Average MAE score (across experiments):")
print(scores.mean())

Output: Average MAE score (across experiments):
277707.3795913405

Conclusion

Using cross-validation yields a much better measure of model quality, with the added benefit of cleaning up our code: note that we no longer need to keep track of separate training and validation sets. So, especially for small datasets, it's a good improvement!

This whole content was extracted from the Kaggle intermediate machine learning course, module : Cross-validation.💌

Mastering Python iteration: Loops and the magic of list comprehensions

MercyMburu — Mon, 08 Apr 2024 09:11:24 +0000

What are loops?

➰
They can be termed simply as a way to repeatedly execute some code➿.

For example:

planets= ['mercury','venus','earth','mars','jupyter','saturn','uranus','neptune']
for planet in planets:
print(planet,end='')

# The end='' overrides the \n behavior and instead adds a space ('') after each planet name

output: Mercury Venus Earth Mars Jupiter Saturn Uranus Neptune

The for loop specifies:

The variable names to use in this case, ‘planet’
The set of values to loop over ‘planets’
You use the word ‘in’ to link them together.

NB :The object right to the ‘in’ can be thought of as a set of a group of things you can probably loop over.

We can also iterate over elements in a tuple. Difference between lists and tuples is that tuple elements are put inside parenthesis and they are immutable i.e. cannot be assigned new values to them once assigned.

multiplicands=(2,2,2,3,3,5)
product = 1
for mult in multiplicands:
product=product*mult

product

Output:360

Multiplicands holds the values to be multiplied together. The ‘product=1’ initializes the variable product that will be used to store the final multiplication answer. ‘for mult in multiplicands’ is a for loop that iterates over each number represented by variable ‘mult’ in the multiplicands tuple.
‘product = product * mult’ is a line executed for each iteration on the loop.

Retrieves the current value of the product variable.
Multiplies it with the current mult value from the tuple.
Assigns the result back to the product variable, updating it with the new product.
The code effectively multiplies together all the numbers in the multiplicands tuple, resulting in 2 x 2 x 2 x 3 x 3 x 5 = 360. You can even loop through each character in a string:

s = 'steganograpHy is the practicE of conceaLing a file, message, image, or video within another fiLe, message, image, Or video.'
msg = ''
# print all the uppercase letters in s, one at a time

for char in s:
if char.isupper():
print(char)

output: HELLO

Range()
A function that returns a sequence of numbers and is very useful in writing loops. For instance, if we want to repeat some action 5 times:

for i in range(5):
print('Doing important work.i=',i)
Output:
Doing important work. i = 0
Doing important work. i = 1
Doing important work. i = 2
Doing important work. i = 3
Doing important work. i = 4

In this case the loop variable is named i and range(5) generates a sequence of numbers from 0 to 5.In each iteration, the value of i is inserted into the string using string formatting. So the printing is done 5 times equivalent to the sequence range.

While loops➰
Iterate until some condition is met.

i=0
while i<10:
print(i)
i+= 1 #increase the value of i by 1
output: 0 1 2 3 4 5 6 7 8 9

List Comprehensions
Are one of python’s most beloved features. The easiest way of understanding them is to probably look at examples.

squares= [n**2 for n in range(10)]
squares

Output: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Here’s how we would do the same thing without a list comprehension:

squares=[]
for n in range(10):
squares.append(n**2)
squares

Output:[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

We can also add an ‘if’ condition

short_planets = [planet for planet in planets if len(planet) < 6]
short_planets

Output: ['Venus', 'Earth', 'Mars']

(If you’re familiar with SQL, you might think of this as being like a “WHERE” clause)

Here’s an example for filtering with an ‘if’ condition and applying some transformations to the loop variable:

# str.upper() returns all-caps version of a string
loud_short_planets=[planet.upper() + '!' for planet in planets if len(planet)<6]
loud_short_planets

Output:['VENUS!', 'EARTH!', 'MARS!']

People usually write these on a single line, but you might find the structure clearer when split into 3 lines:

[
planet.upper()+'!'
for planet in planets
if len(planet)<6
]
Output: ['VENUS!', 'EARTH!', 'MARS!']

Continuing in the SQL analogy, you could think of these 3 lines as SELECT, FROM and WHERE.

List comprehensions combined with functions like min, max and sum can lead to impressive one-line solutions for problems that would otherwise require several lines of code.

For example compare the following two cells of code that do the same thing:

def count_negatives(nums):
"""Return the number of negative numbers in the given list.
>>> count_negatives([5, -1, -2, 0, 3])
    2
    """
n_negative=0
for num in nums:
if num<0:
n_negative=n_negative + 1
return n_negative

Here is a code using list comprehension instead:

def count_negatives(nums):
return len([num for num in nums if num<0])

Much better right? Especially in terms of lines of code.

Solving a problem with less code is always nice, but it’s worth keeping in mind the following lines from The Zen of Python:

Readability counts
Explicit is better than implicit
So use tools that make compact readable programs. But when you have to choose, favor code that is easy for others to understand.

A special thanks to the Kaggle community for their contributions to open-source data and code. This article benefitted greatly from the resources available on their platform: https://www.kaggle.com/.

Thank you!💌

Analyzing criminal incident data from Seattle or San Francisco

MercyMburu — Wed, 06 Mar 2024 00:29:42 +0000

(This blog post has been written as one of the assignments of the Data Science MOOC being offered on Coursera by the University of Washington)
We will compare and contrast the crime data gathered from Seattle and San Francisco in this brief essay. In summary, we will track how the crimes in the two cities change over the course of a year. In order to determine which parts of the cities are more likely to see criminal activity, we will also map out the crimes that have been reported in the area.

Conclusions

Over the course of a year, Seattle commits significantly more crimes than San Francisco does. June, July, and August are the busiest months for crimes in both cities.
The northeastern part of San Francisco is where most crime events occur. In Seattle, most crimes happen in the core area of the city, with small urban pockets hosting the majority of crime incidences.

The yearly variations in the crime statistics in Seattle are as follows:

The yearly variations in the crime cases in San Francisco are as follows:

The two charts above show that the months of June, July, and August are when crime peaks in both cities. In contrast to San Francisco, Seattle has a significantly higher number of crime cases, and from September to December, the city sees a sharp decline in the number of instances. San Francisco's data, however, does not indicate such a tendency. The fact that the dataset we utilized didn't have all of the information for San Francisco could be one explanation for the data gap.

The following diagram illustrates the distribution of crime incidents across the areas of Seattle.

The visualization above shows that the majority of crimes are recorded from Seattle's center metropolitan neighborhoods. In addition, there is a concentration of crime cases reported in little clusters that are divided by areas with lower crime case densities. This pattern lends credence to the theory that Seattle's densely populated metropolitan regions have a higher distribution of crime cases.
The following illustration displays the crime cases reported across the city of San Francisco:

The visualization indicates that the northeastern area of San Francisco reports the highest number of crime data. The visualization unequivocally demonstrates that the northeastern portion of San Francisco is more criminalized than the rest of the city, despite the fact that there are pockets of the city with moderately high reported crime rates.

Code for the exercise:
The following is the code that I used to generate these graphs. The code was written in R.
require(googleVis)
require(ggmap)
sanfrancisco_dataset <- read.csv("sanfrancisco.csv")
seattle_dataset <- read.csv("seattle.csv")
seattle_dataset$Date.Reported <- as.Date(seattle_dataset$Date.Reported,"%Y/%m/%d")
seattle_crime_count_date<-aggregate(seattle_dataset$Offense.Type,by=list(seattle_dataset$Date.Reported), FUN = length)
names(seattle_crime_count_date) <- c("Date","Count")
SeattleHeatMap <- gvisCalendar(seattle_crime_count_date,datevar = "Date",numvar = "Count",options=list(
title="Calendar heat map of Crime cases in Seattle over the year",
calendar="{cellSize:10,yearLabel:{fontSize:20, color:'#444444'},focusedCellColor:{stroke:'red'}}",width=590, height=320),chartid="Calendar")
plot(SeattleHeatMap)
sanfrancisco_dataset$Date <- as.Date(sanfrancisco_dataset$Date,"%Y/%m/%d")
sanfrancisco_crime_count_date<-aggregate(sanfrancisco_dataset$IncidntNum, by = list(sanfrancisco_dataset$Date), FUN = length)
names(sanfrancisco_crime_count_date) <- c("Date", "Count")
SanFranciscoHeatMap <- gvisCalendar(sanfrancisco_crime_count_date,datevar = "Date",numvar = "Count",options=list(
title="Calendar heat map of Crime cases in San Francisco over the year",
calendar="{cellSize:10,yearLabel:{fontSize:20, color:'#444444'},focusedCellColor:{stroke:'red'}}",width=590, height=320),chartid="Calendar")

Distributed Data and NoSQL

MercyMburu — Mon, 12 Feb 2024 20:29:21 +0000

Relational Databases are of huge help when it comes to analysis. Data is arranged in form of rows and columns and this in one or more tables or relations; Based on the relational model of data proposed by E.F. Codd in 1970.
However with the rapid increasing growth of information like web transactions and machine-generated data, large-scale analytics(problems require finding meaningful patterns in data sets that are so large as to require leading-edge processing and storage capability), it is becoming hard to use just relational databases.

Components of a Relational Database System

Database management system software
Physical servers on which software the loaded
Disks where the data items are stored

Imagine a scenario where more databases are coming in and more powerful servers are needed to process the large databases containing more data. Eventually a limit will be reached.
Alternatively, cloud-based distributed processing takes a large volume of data and breaks it into pieces. Viola!! The solution to our storage problems ladies and gentlemen!🙌
These small amounts of data are distributed among many computers in different locations. Basically, each computer has it's own task. However after processing, the data still has to be stored somewhere🤔. Another type of database is needed.
Note: Relational databases still allow this kind of distribution but sometimes to really take advantage of this distribution or unstructured data, NoSQL database is needed.

NoSQL and how it works

You should know:

A NoSQL database stores and accesses data differently that relational ones.
NoSQL is sometimes called non-relational because it doesn't organize data into tables conforming to a structured schema. So data is stored in a non structured or semi-structured format that makes database design simpler.

There are 4 main types of NoSQL databases
Key-value stores
This type stores just the key and its value and each key is unique. For example, storing all the contents of a shopping cart for one session on a online retail website. Also, a session ID that identifies all the activity of a single user during one session on a website. Remember it's still possible to do this on a relational database but key-value stores are optimized to store billions of these keys and are very efficient at retrieving the data quickly.The values of the keys can have a totally different structure from one key to another.

Document databases
Store documents in a machine readable format;JSON,XML. They aren't as efficient as relational databases when it comes to managing the relationships between documents but instead, are efficient when it comes to reading and searching them.
The structure of one document doesn't need to be the same as the other documents'. Elements can therefore be easily added without needing to change any tables or schemas.

Graph Databases
These databases store nodes that are connected to other nodes in a network. An example of a node can be a user on social media who is connected to all other friends. Another example is points on a map that are connected in real life, so it's easy to find routes between them.
Graph databases are optimized to query through these networks and navigate through the connections a lot faster and more efficiently than a relational database can join tables and find those relationships. They are designed to store huge amounts of interrelated nodes.

The image above is a fragment of a movies graph database, where we can see movie nodes(purple), person nodes(orange) and relationships between them(arrows).

Wide column stores
Store information in tables but the difference is that the columns are not attributes,they are values. Imagine a huge table from a streaming platform with a row for each user in the system and a column for each movie in the platform, where we record whether a user has watched that movie or not and some information about that viewing.To have a table like that in a relational database or in a flat file would be impossible when we are talking about millions or rows and columns. Most of those “cells” would be empty, because a user usually has just watched a few of the videos. These databases are designed to store these huge tables with “sparse” information in a very efficient way and retrieve it very quickly.

Important points to note
Relational databases are known for reliability, correctness, and version control, but NoSQL databases leave open many more possibilities for error and retrieving data that is not from the most recent version of the database. Something that needs to be considered when deciding to use a NoSQL solution.

NoSQL is usually a good choice when there are large amounts of data that change frequently, or when working with flexible formats that don't fit into a relational database model. Common NoSQL database systems include MongoDB, Apache Cassandra, and Amazon DynamoDB.

The advantages and challenges of NoSQL databases as compared to relational databases are as follows:

Advantages:

Designed for large, unstructured datasets.
Able to add new data that is structured differently than the data already in the database, which is not possible with flat files or RDBMS.
Can scale quickly to support rapid data growth.

Challenges:

Validating input fields against existing data like SQL databases do is not possible.
Temporal inconsistencies that allow for different versions to be confused.
Less application support for NoSQL.
No standardization of the ways to query the NoSQL databases.

This article has been written with 🩶.
Please feel free to add any comments pertaining more about NoSQL approach and literally any other random stuff🌝.

Exploratory Data Analysis using Data Visualization Techniques.

MercyMburu — Fri, 13 Oct 2023 20:07:17 +0000

Happy to be here again. In today's article, two keywords in the title are going to be defined. Exploratory Data Analysis and Data Visualization. With an understanding of these and a sample project for the purposes of description, everything will be understood. I have discovered that Exploratory Data Analysis is a step that cannot be skipped in any Data Science project, whether one likes it or not.

Exploratory Data Analysis.

Is the process of investigating a dataset in order to come up with summaries/hypothesis based on our understanding of the data, discovering patterns, detecting outliers and gaining insights through various techniques. Data visualization is one of them.

Data Visualization

A graphical representation of the information and the data.

Importance of Data Visualization

In the cleaning process, it helps identify incorrect data or missing values.
The results can be interpreted and operated on because they become clear.
Enables us to visualize stuff that cannot be observed by directly looking. Phenomenons like weather patterns and medical conditions. Also matematical relationships e.g when doing finance analysis.
Helps us to construct and select variables. We can be able to choose which to discard and which to use.
Bridge the gap between Technical and non-technical users by explaining figuratively what has been written in code.

Knowing the different types of analysis for data visualization is an important additional concept.
Univariate Analysis:In this type, we analyze almost all the properties of only one feature.
Bivariate Analysis: In this one, analysis of properties is done for two features. We compare exactly two features.
Multivariate Analysis:Here, we compare more than two variables.
Let's get right to it.

Import the necessary libraries

import numpy as np import matplotlib.pyplot as plt import pandas as pd %matplotlib inline import seaborn as sns

Reading the data

Step two is reading the data which is mostly in csv format. Using pandas library. I used this dataset.
df = pd.read_csv('StudentsPerformance.csv') df.head()
Our dataset looks like this after running df.head(), which outputs the first 5 rows.

You can easily tell just by looking at the dataset that it contains data about different students at a school/college, and their scores in 3 subjects.

Describe the Data

After loading the dataset,the next step is to summarize the info and it's main characteristics. Consider it as a way to get summary statistics like the mean, the maximum, minimum values, the 25th percentile e.t.c of the different columns in a data frame.
The output is something like this.

Please also note, if you want to include categorical features(features that have not been represented by numbers) also in your output, just run df.head(include='all').
Now, in the output, count, unique and the most appearing values(top) have been filled. See below,

Check for missing values

Incase of any missing entries, it is advisable we fill them. For categorical features, with mode and for numerical features with median or mean. Run df.isnull().sum()

Phwyuuks!! We don't have any missing values.
We can now proceed to observe any underlying patterns, analyze the data and identify any outliers using visual representations. I loooovee this part. Let's do it!

Graphs

Remember the three types of analysis we mentioned before? Let's look at them. We'll start with Univariate analysis. A bar graph. Look at the distribution of the students across gender, race, their lunch status and whether they have a course to prepare for or not.

plt.subplot(221)
df['gender'].value_counts().plot(kind='bar', title='Gender of students', figsize=(16,9))
plt.subplot(222)

df['race/ethnicity'].value_counts().plot(kind='bar', title='Race/ethnicity of students')

plt.xticks(rotation=0)

plt.subplot(223)

df['lunch'].value_counts().plot(kind='bar', title='Lunch status of students')

plt.xticks(rotation=0)

plt.subplot(224)

df['test preparation course'].value_counts().plot(kind='bar', title='Test preparation course')

plt.xticks(rotation=0)

plt.show()

The output:

We can conclude a lot of information. For instance,

There are more girls that boys.
The majority of students belong to race groups C and D.
More than 60% of the students have a standard lunch.
More than 60% of the students have not taken any test preparation course.

Next, lets look at univariate analysis and use a boxplot. A boxplot helps us in visualizing the data in terms of quartiles. Numerical columns are visualized very well with boxplots. We use function df.boxplot()

The horizontal green line in the middle represents the median of the data.
The hollow circles near the tails represent outliers in the dataset.
The middle portion represents the inter-quartile range(IQR) From those points, we conclude that a box plots show the distribution of data. How far is our middle value data dispersed or spread. So lets plot some distribution plots to see. We'll start with the math score. sns.distplot(df['math score'])

Well, the tip of the curve is at around 65 marks, the mean of the math score of the students in the dataset. We can make for the reading score and the writing score.

For our reading score curve, it's not a perfect bell curve. We conclude that the mean of the reading score is at around 72 marks.

For our writing score, it's also not a perfect bell curve. The mean of the writing score is at around 70 marks. So far so good, right? One more thing, let's look at the correlation between the three scores by use of a heatmap. Correlation basically means looking at the linear relationship between variables. If one variable changes, how does that affect the other?

corr = df.corr()
sns.heatmap(corr, annot=True, square=True)
plt.yticks(rotation=0)
plt.show()

The 3 scores are highly correlated.
Reading score has a correlation coefficient of 0.95 with the writing score. Math score has a correlation coefficient of 0.82 with the reading score.

Bivariate analysis:Understand the relationship between 2 variables on different subsets of the dataset. We can try to understand the relationship between the math score and the writing score of students of different genders.

sns.relplot(x='math score', y='writing score', hue='gender', data=df)

``

The graph shows a clear difference in scores between the male and female students.For the math score, female students are more likely to have a higher writing score than male students. For writing score, male students are expected to have a higher math score than female students.
Finally,let’s look at the impact of the test preparation course on students’ performance using a horizontal bar graph.

It is very evident that students who completed the test preparation course perfomed better than those who didn't.
That's the end Guys.
Thank you for following through.
YOU CAN DO IT!

Data Science for Beginners: 2023–2024 Complete Roadmap.

MercyMburu — Mon, 02 Oct 2023 08:20:25 +0000

In a recent EastAfrican datascience bootcamp, a speaker asked attendants in a google meeting, what had initially piqued their interest in data science. One lady raised their hand and answered, ”Uhm..because data is the new gold.” Was she right? Yes! A huge yes.
Data Science is such a driving force behind a majority of innovations in the world today due to its ability to extract valuable insights from datasets and make informed decisions that aid in problem solving. Furthermore, according to a Business Havard Review, the role ‘Data Scientist’ is the ‘The Sexiest Job in the 21st century’.
It is also a field that continues to evolve everyday and one must keep up with the latest news and trends. That is why a roadmap is important.

What is Data Science?

It is the science of analyzing raw data using statistics and machine learning techniques with the purpose of drawing conclusions about that information.

Usually, data scientists come from various educational and work experience backgrounds, most should be proficient in:

Domain Knowledge:- Take for instance you want to be one in the bank sector knowing about finance, insurance , credit risks e.t.c will be important information when drawing conclusions and tackling problems.
Mathematical Skills:- Linear Algebra, Calculus, probability and Statistics help us in understanding algorithms and perform data analysis.
Communication skills:- It includes both written and verbal communication. Data Science involves some form of communication of project findings.

The Roadmap

1. Mathematics

In data science, statistics is essential. It provides one with mathematical ideas necessary for conducting hypothesis testing and data analysis, as well as for making decisions based on the given data. Some significant applications of statistics in data science are listed below:

Data scientists can quickly summarize and define a dataset’s characteristics by using descriptive statistics. Included in this are statistics like mean (average), median (middle value), mode (most frequent value), variance (spread), and standard deviation (average deviation from the mean). For a basic understanding of the data, descriptive statistics are helpful.
Making assumptions or projections about a population based on a sample of data is known as inferential statistics. Regression analysis, confidence intervals, and hypothesis testing are some of the methods used in this.

Probability theory is crucial in data science for modeling uncertainty and randomness. It’s used in various applications, such as Bayesian statistics for machine learning. Others : Linear Algebra, Vector Calculus

2. Programming Skills

The most recommended languages are R and Python. While R is not mandatory, it is valuable for statisticians.

In python, familiarize yourself with numpy, pandas, matplotlib, visualization and analysis. Object oriented and procedure oriented python exercises will improve your mastery. Python offers several built-in data structures like lists and tuples that allow you to store and manipulate data efficiently.

3. Structured Query Language(SQL)

It plays a crucial role in data science, particularly in the context of working with structured data stored in relational databases. It is useful in :

Data Cleaning
Data Retrieval

3.Data Aggregation
Can be found here.

4. Data Visualization

Understand the principles of data visualization and storytelling with data. Tools like Matplotlib, Seaborn, and Tableau can help you create compelling visuals.

5. Machine learning concepts

One at least needs to understand basic algorithms of Supervised and Unsupervised Learning. There are multiple libraries available in Python and R for implementing these algorithms. Kaggle Machine learning courses are very helpful here.

6. Practice pratice practice!

In the process of doing this, create repositories for your work and track your progress. Collaborate on projects, network with like-minded people, share your insights on social media and engage in discussions.

Don't forget to refer to a graphical roadmap above .
All the best!