A NOTE TO REMEMBER

#beginners #datascience #machinelearning #python

Hello Everyone

Hope you all are doing well in this lockdown. From the title, it might be confusing what note you should remember, I started exploring the field of data i.e. ML, DS, DL, etc. & actually it is pretty cool when you find a future prediction as to the output of your code.

So what is the first thing that comes to our mind when we hear about this field??

so we think it's all about learning a number of algorithms, then 2 to 4 libraries of python for cleaning the data, and then it's done!!!!!

so the most important thing I am going to discuss here is the backbone of this field i.e "Data"

The number of algorithms is fixed into three categories

supervised(you know the past relation(labeled data))
unsupervised(no past relation is known to you, you form a different group out of them)
reinforcement(you get rewarded with success and vice versa)

After getting familiar with these you now try to learn how to implement them on data to predict future outcomes.
Basically, we have two types of data:

structured data
unstructured data

structured data means no data cleaning part(the different terms like visualization, wrangling you have heard)..you just import it and then train_test_split and fit the model.

Now let's get our hand's dirty with the unstructured data because that's what I learned in these months, we will always face the unstructured data.

so basically I am going to use the following libraries for this purpose:
Step1:-importing the libraries:

NumPy - import numpy as np(for data preprocessing)
pandas - import pandas as pd(for data cleaning)
matplotlib - import matplotlib.pyplot as plt(for data visualization)
seaborn - import seaborn as sns(for data visualization)

Matplotlib is a python library used to create 2D graphs and plots by using python scripts. But I think if you're handling a larger dataset with very much non-linearity seaborn should be your major weapon

step2:-where and how to use different plots of seaborn:

plot name	where to use	how to use
heatmap	basically used to know the overall information and relation between the data	sns.heatmap(data)
barplot	when we are comparing between two categories	sns.barplot(value1,value2)
countplot	same as barplot but use to know the occurrence of a label	sns.countplot(value,data)
distplot	used to get the distribution of data	sns.distplot(data)
box-plot	shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable	sns.boxplot(data)

Now after you have completely visualized the data provided and understand the relation between different parameters provided to you, you're ready to clean your data.

However, we found these problems while handling the unstructured data:

categorical columns
null values(Nan)
biased column
no values(blank)
outliers
starting from the end the outliers are like the Cardamom to your biriyani, They're the ones which will cause less accuracy of your model
how to solve??

1)univariate method:-This method looks for data points with extreme values on one variable.
2)multivariate method:-Here we look for unusual combinations on all the variables
3)Minkowski error: This method reduces the contribution of potential outliers in the training process
blank values i.e. missing values sometimes you will see some data is missing in some columns but the output depends on that data so you have to fill that place accordingly with the maximum frequency of the data, some times the average of the data
data.fillna(value)-when you put a fixed value
data.fillna(method = bfill\ffill)-backward/forward filling
data.fillna(data.mean())-average value
now what is a biased column:- suppose for a prediction you have a gender column in data which is required for prediction, but the male: female ratio is 95:5, this called a biased column, so try to keep values appropriately else the model will predict according to a single value.
The traditional method to deal with null value is to drop them
data.dropna(), but if it is required for your prediction instead of dropping it try to fill this place by replacing with another value as mentioned in no values case
Last but not least how to handle the categorical columns
a)creating dummies:
Easy to use and fast way to handle categorical column values.(ps: not useful for many categories)
pd.get_dummies(Data)
b)When the categorical variables are ordinal(labeled), the easiest approach is to replace each label(not useful for nominal)
data.replace(man,0,inplace=True)
data.replace(woman,1,inplace=True)
c) one hot encoding:-applicable for a lesser number categories i.e. convert the data in 1 or 0
from sklearn.compose import ColumnTransformer
ColumnTransformer([('encoder', OneHotEncoder(), [no.of categories])], remainder='passthrough')
data = np.array(columnTransformer.fit_transform(data), dtype = np.str)
LabelEncoder:- the most useful part to convert any number of categories into different numerical values
from sklearn.preprocessing import LabelEncoder
LabelEncoder().fit_transform(data)

That's all. Hope this will help you a lot in data preprocessing and in ml term, we call feature engineering

for examples you can check my githup repo :-https://github.com/Ashishkumarpanda

just a beginner do comment any other methods if I missed something.Thank you :)
`

DEV Community