DEV Community

Cover image for Data Cleaning Techniques.
Kaira Kelvin.
Kaira Kelvin.

Posted on • Updated on

Data Cleaning Techniques.

Data cleaning is a critical step in data analysis, and it can take up to 80% of the time spent on a project."
Data cleaning is not merely a chore but a craft that shapes the destiny of your analysis.
A brief exploration of data cleaning techniques that transform raw data into a valuable asset: include

Image description

Data is a precious thing and will last longer than the systems themselves."
More data-cleaning techniques include;

  • Outlier Detection- The process of identifying and managing values that significantly deviate from the rest of the data. Outliers are values that are unusually high or low compared to the rest of the data and can have a substantial impact on statistical analyses and machine learning models.

  • Data profiling - It is the process of analyzing and examining the content, structure, and quality of a dataset.
    It helps ensure that subsequent analyses are based on a solid understanding of the data and that potential issues are addressed early in the process.

In the era of big data, where the volume and diversity of information continue to grow, the importance of sound data-cleaning practices becomes increasingly evident. By adopting and adapting these techniques to the unique characteristics of each dataset, analysts and data scientists pave the way for more trustworthy, insightful, and impactful results.

Data Cleaning involves removing bad data, data that
has empty cells, data in the wrong format, wrong data, and duplicates.

Cleaning Empty Cells.

Empty cells can potentially give you wrong results when you analyze data.
One way to deal with empty cells is to remove rows that contain empty cells.
By default, the dropna() method returns a new DataFrame, and will not change the original.
Now, the dropna(inplace = True) will NOT return a new DataFrame, but it will remove all rows containing NULL values from the original DataFrame.

Replace Empty Values

Another way of dealing with empty cells is to insert a new value instead.
This way you do not have to delete entire rows just because of some empty cells.

df.fillna (130,inplace= True)
Enter fullscreen mode Exit fullscreen mode

This code remember it replaces all the empty cells in the whole data, but still, u can replace only specified columns

  1. To only replace empty values for one column, specify the column name for the DataFrame.
df["calories"].fillna(130,inplace = True)
Enter fullscreen mode Exit fullscreen mode
  1. A common way to replace empty cells, is to calculate the mean, median, or mode value of the column. for mean the code will be
x = df["Calories"].mean()
df["Calories"].fillna(x, inplace = True)
Enter fullscreen mode Exit fullscreen mode
x=df["Calories"].median()
df["Calories"].fillna(x, inplace =True)

Enter fullscreen mode Exit fullscreen mode

Cleaning Data of Wrong Format.

Cells with data in the wrong format can make it difficult, or even impossible, to analyze data.
To fix the cells u have to remove the rows or convert all cells in the columns into the same format.
pandas have a method to convert datetime. to_datetime()

df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())
Enter fullscreen mode Exit fullscreen mode

Replacing Values

U can use loc to replace missing values for small data

df.loc[7, 'Duration'] = 45 
Enter fullscreen mode Exit fullscreen mode

where 7 is the row index and the Duration is the Column name.

Removing Duplicates
Discovering Duplicates
Duplicate rows are rows that have been registered more than one time.
To discover duplicates, we can use the duplicated() method.
The duplicated method return a boolean values for each row.

df = pd.read_csv
print(df.duplicated())
Enter fullscreen mode Exit fullscreen mode

To remove duplicates, use the drop_duplicates() method.
Note The (inplace = True) will make sure that the method does
NOT return a new DataFrame, but it will remove all
duplicates from the original DataFrame.
Data cleaning also involves data pre-processing such as renaming the column, converting it to datetime variable and sorting the data in ascending order of date.

Renaming the data columns include;
df =df.rename(columns=

converting the date column to datetime,
df['Date'] =pd.to_datetime(df['Date'])

sorting the dataset in ascending order of date
df =df.sort_values(by = 'Date")

Top comments (0)