DEV Community: Isaac

DATA SCIENCE 2024 ROADMAP

Isaac — Tue, 06 Aug 2024 00:29:36 +0000

In the current growing technology industry, organizations are generating and storing more and more data and are looking to hire professionals to derive valuable insights from said data to help drive business decisions. Here, data science plays a big role and has actually been considered “the sexiest job of the 21st century” according to Harvard Business Review. Whether you're a recent graduate, a career changer, or simply curious about the world of data, this roadmap is designed to guide you through your journey to achieve a desired objective or goal within the timeframe of a year.

1. Understand the Basics
A. What is Data Science?
Before diving into data science, it's crucial to understand its essence. Data science is the practice of extracting meaningful insights and knowledge from data using various techniques, including statistical analysis, machine learning, and data visualization. Hence, briefly it can be said that data science involves;
• Statistics, computer science, mathematics
• Data cleaning and formatting
• Data visualization

2. Learning the Fundamentals ( 3 – 5 months)
A. Mathematics and Statistics ( 1 – 2 months)
Linear Algebra and Calculus are very important as they help in understanding various machine learning algorithms that play an important role in data science. Similarly, statistics is very significant as it is a part of data analysis. Descriptive Statistics is a powerful method to summarize data while Inferential Statistics is applicable in hypothesis testing.

B. Programming Skills. (2 – 3 months)
If you are a beginner, learning Python is strongly recommended for data science. Python is a favorite among data scientists for its simplified syntax. One can also access a lot of open-source libraries, including NumPy, pandas, and scikit-learn for the implementation of various data science tasks.

Data Manipulation and Analysis ( 2 – 3 months) A. Data Collection and Wrangling (1 months) Data collection is the process of gathering relevant data for further analysis from a variety of sources while data wrangling is the preparing and transforming data to an easier format for further analysis.

B. Exploratory Data Analysis (EDA) ( 1 – 2 months)
Master the art of EDA to gain insights from your data. EDA involves exploring the data using various statistical models like mean, median etc and come up with hypotheses and perform analyses. Data visualization tools like Matplotlib and Seaborn will be your best friends during this stage and will include data exploration using visual methods like histograms, bar charts and pie charts to identify trends and patterns within data.

Machine Learning and AI ( 3 – 4 months) A. Introduction to Machine Learning Understand the core concepts of machine learning, including supervised learning which includes regression and classification problems and unsupervised learning whose applications are clustering and dimensionality reduction.

B. Model Building
Learn to build, train, and evaluate machine learning models. Scikit-learn provides an extensive toolkit for this purpose.

C. Deep Learning (Optional)
If you're interested in more advanced techniques, consider delving into deep learning using libraries like Tensor Flow or PyTorch.

5.Data Engineering (2 – 3 months)
Data engineering is the field of building data infrastructure by designing, building and maintaining ETL data pipelines. This is not mandatory for data scientists to learn but having a good understanding is a big plus in the job market.

Points to Remember
• No Degree Requirement: While a degree in computer science can be beneficial, it's not mandatory for a career in data science. What matters most are the skills you acquire and master.
• Domain Expertise: Having expertise in a specific domain or industry can be an advantage as it enables you to leverage data effectively for solving domain-specific problems.
• Communication Skills: Good verbal and written communication skills are essential for collaborating with various stakeholders and effectively conveying your data findings and recommendations.
• Focus on Fundamentals: Data science is vast, so it's important to start by understanding the basics before delving into advanced concepts. Building a strong foundation is key.
• Practical Applications: Practical skills gained through working on real-world projects are highly valued by organizations. Practical application of knowledge is often more important than theoretical knowledge alone.
• Track Your Progress: Monitoring your learning progress is crucial. Assignments and assessments can help you gauge whether you are grasping concepts effectively and moving in the right direction.
•** Stay Updated**: Data science is an evolving field. Keeping up with the latest research and developments will help you remain competitive and stand out in your career.
These points provide valuable guidance for individuals looking to embark on a data science journey or advance their existing data science skills. They emphasize the importance of a balanced approach that combines theoretical knowledge with practical experience and ongoing learning.

DATA ENGINEERING FOR BEGINNERS -A STEP BY STEP GUIDE

Isaac — Sun, 05 Nov 2023 12:37:01 +0000

What is data engineering? This is probably the first question that comes to your mind after reading this title. Data engineering is simply the acquisition, storage, transformation and management of data. This ranges from simple data analysis to writing complex data processing models and algorithms to manage Big Data. Data engineers are the most technical profile in the field of data science, serving as a critical bridge between software and application developers. They are responsible for the first step of the traditional data science workflow, namely the process of collecting and storing data. They ensure that large amounts of data collected from various sources become resources that can be accessed by other data science professionals, such as data analysts and data scientists. The most important skills of a data engineer include:

- Be proficient in programming in languages such as Python and Scala. - Learn automation and scripts
- Understand database management and develop your SQL skills
- Master data processing system
- Learn how to organize your workflow
- Expand knowledge of cloud computing and platforms such as AWS
- Increase your knowledge of infrastructure such as Docker and Kubernetes

Programming skills in Python, R and/or Scala and database management are the core skills of any data engineer, as you will use them 75% of your time. Therefore, it is very important to learn these skills. According to many studies, data engineering is a growing field that holds great promise for those willing to enter. Below is a review of recent studies on advancements in technology.

Data engineering is an emerging profession and it is not always easy for recruiters to find the right candidates. Competition for this hard-to-find talent is high among companies, resulting in some of the highest salaries among data science positions. Based on several career portals, the average salary for data engineers in the United States is $114,564. However, these numbers differ depending on where you work in the country. So, for example, according to Glassdoor, the average salary for a data analyst in New York is $120,637, while in Chicago it is $113,610 and in Houston it is $94,416.

TIME SERIES ANALYSIS

Isaac — Wed, 01 Nov 2023 10:10:43 +0000

A time series analysis is a method of data analysis that involves the observation of a variable's trend over time. Using this data organizations are able to identify how a certain product has been doing over time as well as identify the trend of its rise and falls over time and be able to answer the "why" question. This could help know at what season a product has the best sales so at to ensure they have it stock by the next season sales are expected to rise. This is important important as it helps the company do timely re-stocking and hence they are able to maximize sales and profits.
Additionally, a time-series analysis helps an organization detect when a commodity was at its lowest and this can help them understand why it was so by reviewing the timelines and see circumstances that might have led to that. This makes a time-series analysis an important tool for business forecasting.

The best way to get the most of a time-series analysis is by visualizing it using eg with a scatter-plot, line chart or any other appropriate data visualizing tool at your disposal

Steps to a Time-Series Analysis

1. Data Collection
Gather all the relevant information you need for your analysis may it be sales records, website logs, order histories, date and time stamps any relevant information for your analysis should be available

2. Data Cleaning and Preprocessing
This is the most important stage as this helps prepare the data you will use in you analysis. Data pre-processing involves the filling of null values, removal of duplicate values, data type conversion, dealing with outliers and standardizing the data to the best workable format possible. This helps to prevent outputting wrong analysis during your analysis based on the rule of Garbage In Garbage Out

3. Data Visualization

After you data is ready, use data visualizing tools to plot your data to visuals that will help you get actionable insights from your data. This will enable you notice major trends, outliers as well as patterns that are important in answering your questions

4. Time Series Decomposition
A time series has 3 components : Trend, Seasonality and Residue(noise)

Trend=> Identifiable pattern of frequency of events. Seasonal=> Trends that happen after specific amount of time Residual=> Patterns that are out of trends and not in season

5. Model Selection
Choose an appropriate Time-Series model to help model your data. The major models include:

Autoregressive Integrated Moving Average (ARIMA): For stationary data.
Seasonal Decomposition of Time Series (STL): For data with seasonality.
Exponential Smoothing (ETS): Another method for forecasting time series data.

Data from the model should be evaluated so as to determine its correctness.

DIVE INTO RFM DATA ANALYSIS

Isaac — Wed, 11 Oct 2023 06:02:50 +0000

RFM is an acronym for Recency Frequency Monetary Data Analysis. It is a technique used by most data scientist in the e-commerce industry to rank customers based on how lately they bought goods from their store, how frequently they make purchases and how much in total they spend

RFM can be used to detect customers usage behavior and patterns, the number of customers who are frequent buyers or high spenders. This helps the company to come up with with new custom marketing strategies for each target group with an aim to accelerate sales and get the most out of each group. The Analysis is thus very important in coming up with better marketing strategies of a particular company.

In this article we will dive into a sample case scenario to explore some uses of RFM and how to leverage it in your industry. This is case study for learning purposes and not a full grade RFM Model
In an attempt to solve this I came up with this model:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

#Read csv File with raw data

myData= pd.read_csv("rfm_data.csv")

#Remove duplicates by making a copy of original Data
data = myData.copy(deep=True).drop_duplicates(subset=["CustomerID"],keep="last")


#Calculate frequency of each CustomerID
frequencySeries =myData.CustomerID.value_counts().reset_index(name="Frequency")
print(frequencySeries.loc[frequencySeries["CustomerID"]==8317])



#Appending frequency column to DataFrame
frequencySeries.rename({"index":"CustomerID"},inplace=True)
data = data.merge(frequencySeries,on="CustomerID",how="left")


#Total Money spent per user
moneySpent = myData.groupby("CustomerID")['TransactionAmount'].sum().reset_index(name="Total Spent")
data= data.merge(moneySpent,on="CustomerID",how="left")




recencyScore = [5,4,3,2,1,]
frequencyScore = [1,2,3,4,5]
monetaryScore = [1,2,3,4,5]

#Grading users 
data["recencyScore"] =pd.cut(data["recency"],bins=5,labels=recencyScore).astype(int)

data["FrequencyScore"] = pd.cut(data["Frequency"],bins=5,labels=frequencyScore).astype(int)

data["Monetary Score"] = pd.cut(data["Total Spent"],bins=5,labels=monetaryScore).astype(int)

data['totalScore'] = data["FrequencyScore"] + data["Monetary Score"] + data["recencyScore"]

#Ranking users based on ttal RFM score

myLabels = ["Beginner","Intermediate","PRO"]
data["Rank"] = pd.cut(data["totalScore"],bins=3,labels=myLabels)

I then used matplotlib to show the various statistics of users
Beginner- Users with Low RFM Score
Intermediate - Users with Average RFM Score
Expert - Users with High RFM Score

1. Show the Number of users per ranking

plt.pie([len(typeBeginners),len(typeInter),len(typePro)],labels=myLabels,autopct='%1.1f%%') plt.axis('equal') plt.legend(labels=myLabels) plt.show()

2. Show user distribution across major cities

def getType(myType,myCity) :
    return len(data.loc[(data["Rank"]==myType) & (data["Location"]==myCity)])

myCityStats={}
for i in myLabels:
    for j in locationLabels:
        myCityStats[i]= [getType(i,j) for j in locationLabels]

print(myCityStats)

x=np.arange(len(locationLabels))
width = 0.3
multiplier = 0

fig,ax = plt.subplots(layout='constrained')

for att,val in myCityStats.items():
    offset = width*multiplier
    bars = ax.bar(x+offset,val,width,label=att)
    ax.bar_label(bars,padding=3)
    multiplier+=1


# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Length (mm)')
ax.set_title('Number of users in a city per RFM Rank')
ax.set_xticks(x + width, locationLabels)
ax.legend(loc='upper left', ncols=3)
ax.set_ylim(0, 250)

plt.show()

Group bar charts for Rank distribution in Major Cities

This are just but a few visualizations more can be done based on the questions you focus on answering based on your organization

DATA EXPLORATORY ANALYSIS

Isaac — Tue, 10 Oct 2023 08:39:40 +0000

Just as the name might suggest ,Data exploratory analysis is a method of data analysis that involves the highlighting of key insights from a dataset and incorporating data visualizing tools to display them.

Most datasets you will find you need to work on will need to be worked on before you can actually use them for the data analysis itself. This will mainly include the removal of row or columns with missing values, removal of duplicate value and other necessary data cleaning practices needed. First we will explore the pre-EDA activities and then the post-EDA activities.

PRE-EDA Activities

1. Data Cleaning
Raw data is mainly 'raw data' and needed to be 'cooked' before its actually ready for consumption :) . Data analysis is actually 80% data cleaning and data analysis that means that this is the most crucial step and need lots of attention so as to make sure the data being used is free from anomalies. Just as stated earlier data cleaning mainly involves the removal or correction of inaccurate records. The major data cleaning practices include:

Parsing-> Converting data to the required format acceptable to the application in use
Duplicate Elimination -> Removal of repeating entries of a key in the dataset
Removal or Updating of missing values -> Some entries might be blank and may need to be removed or filled systematically to ensure a true representation of the collected dataset as much as possible. If the missing values are too many you may consider removing them but if the number is minimal use various methods to provide them eg. use the median or the mode
Data Transformation -> This involve reformatting hoe the data had been represented to away that can be used by the analytics tool.

Data cleaning is mainly done using major programming languages like Python using Numpy and Pandas , MS Excel, Tableau and many more.

POST EDA ACTIVITIES

When now using EDA the analysis can be graphical or non-graphical in that it may contain a table with an analysis of lets say measures of central tendency such as mean, mode, median, or the quartiles of the data. Graphical ones can contains graphs, pie charts, histograms, line-graphs, funnels, scatter plots or any other graphical representation of the data.

NON-GRAPHICAL EDA

Graphical EDA