Data cleaning is a critical step in data analysis, and it can take up to 80% of the time spent on a project."
Data cleaning is not merely a chore but a craft that shapes the destiny of your analysis.
A brief exploration of data cleaning techniques that transform raw data into a valuable asset: include
Data is a precious thing and will last longer than the systems themselves."
More data-cleaning techniques include;
Outlier Detection- The process of identifying and managing values that significantly deviate from the rest of the data. Outliers are values that are unusually high or low compared to the rest of the data and can have a substantial impact on statistical analyses and machine learning models.
Data profiling - It is the process of analyzing and examining the content, structure, and quality of a dataset.
It helps ensure that subsequent analyses are based on a solid understanding of the data and that potential issues are addressed early in the process.
In the era of big data, where the volume and diversity of information continue to grow, the importance of sound data-cleaning practices becomes increasingly evident. By adopting and adapting these techniques to the unique characteristics of each dataset, analysts and data scientists pave the way for more trustworthy, insightful, and impactful results.
Data Cleaning involves removing bad data, data that
has empty cells, data in the wrong format, wrong data, and duplicates.
Cleaning Empty Cells.
Empty cells can potentially give you wrong results when you analyze data.
One way to deal with empty cells is to remove rows that contain empty cells.
By default, the dropna() method returns a new DataFrame, and will not change the original.
Now, the dropna(inplace = True) will NOT return a new DataFrame, but it will remove all rows containing NULL values from the original DataFrame.
Replace Empty Values
Another way of dealing with empty cells is to insert a new value instead.
This way you do not have to delete entire rows just because of some empty cells.
df.fillna (130,inplace= True)
This code remember it replaces all the empty cells in the whole data, but still, u can replace only specified columns
- To only replace empty values for one column, specify the column name for the DataFrame.
df["calories"].fillna(130,inplace = True)
- A common way to replace empty cells, is to calculate the mean, median, or mode value of the column. for mean the code will be
x = df["Calories"].mean()
df["Calories"].fillna(x, inplace = True)
x=df["Calories"].median()
df["Calories"].fillna(x, inplace =True)
Cleaning Data of Wrong Format.
Cells with data in the wrong format can make it difficult, or even impossible, to analyze data.
To fix the cells u have to remove the rows or convert all cells in the columns into the same format.
pandas have a method to convert datetime. to_datetime()
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())
Replacing Values
U can use loc to replace missing values for small data
df.loc[7, 'Duration'] = 45
where 7 is the row index and the Duration is the Column name.
Removing Duplicates
Discovering Duplicates
Duplicate rows are rows that have been registered more than one time.
To discover duplicates, we can use the duplicated() method.
The duplicated method return a boolean values for each row.
df = pd.read_csv
print(df.duplicated())
To remove duplicates, use the drop_duplicates() method.
Note The (inplace = True) will make sure that the method does
NOT return a new DataFrame, but it will remove all
duplicates from the original DataFrame.
Data cleaning also involves data pre-processing such as renaming the column, converting it to datetime variable and sorting the data in ascending order of date.
Renaming the data columns include;
df =df.rename(columns=
converting the date column to datetime,
df['Date'] =pd.to_datetime(df['Date'])
sorting the dataset in ascending order of date
df =df.sort_values(by = 'Date")
Top comments (0)