DEV Community

Ben Goulding
Ben Goulding

Posted on • Edited on

How do you actually clean data in Excel?

Image description

What is data cleaning?

Data cleaning is the process of identifying and correcting inaccuracies and inconsistencies in data. Excel is a powerful tool for data cleaning because it offers many features for working with data, such as filtering, sorting, and conditional formatting.
When cleaning data in Excel, you will want to look for these errors and correct them accordingly. There are many ways to clean data in Excel, so it is important to find the method that works best for you and your dataset.

Errors can include invalid values, missing values, and duplicates. When cleaning data in Excel, be sure to check all cells that contain information relevant to your analysis.
Be patient while cleaning data in Excel; sometimes it may take several rounds of corrections before the dataset is accurate and consistent.

Image description

What do you need to look out for when cleaning data in Excel?

Start by identifying the source of your data. This will help you determine what type of cleaning is necessary.
Once you know the source of your data, open it in Excel and take a look at the overall structure. Is it organised in a way that makes sense? If not, you may need to reorganise it before proceeding with cleaning.

Next, start looking for any obvious errors or inconsistencies. These could be things like incorrect values, missing data, or formatting issues.

Once you've identified potential errors, decide how you want to handle them. Do you want to delete them, correct them, or just flag them for further review?

Finally, save your cleaned data set in a new file so you don't accidentally overwrite your original data set

Image description

How do you deal with empty values?

If you want to replace all empty cells in a worksheet with zeroes, use the Excel FillDown function. To find and select all the empty cells, type "=FillDown" into the cell where you want to start replacing values, press Enter, and then type "0" into each cell where you want to insert zeroes.

The easiest way to replace empty values with a specific text string is to use the Replace function. Type "Replace," followed by the text string you want to use for replacement, into the Cells box on the Home tab of your workbook and press Enter.

To deal with cells that contain only zeroes or nothing at all, use the IF function. Type "<>" (without any other characters) into the first cell in a row or column that you want to check for blank values, and then type 1 if there are any non-zero values in that cell, or 0 if there are no values in that cell at all.

You can also use special symbols such as "$" (dollar sign) or "/" (slash), which represent blank spaces within text strings, as part of conditional statements such as IF statements in Excel formulas.
To delete blank rows or columns from an Excel worksheet, highlight those rows or columns using either your mouse or keyboard shortcuts, and then choose Home > Delete > All Rows or Home > Delete > All Columns from your workbook's menus

Image description

How do you deal with duplicate records?

To clean data in Excel, you first need to identify which cells contain duplicate records. You can do this by using the COUNTIF function to count the number of times a value appears in a range of cells. If there are duplicates, you can delete them by selecting the cells and then pressing the Delete key on your keyboard.

You can also use the Data > Remove Duplicates command in Excel to remove duplicate values from your data set. Be sure to check for duplicates regularly, as they can introduce errors into your data analysis.
How do you deal with outlier values?
First, you need to identify where your outlier values are. This can be done by using the Excel functions of Min, Max, and Median which will return the lowest and highest value in a given range of cells.

Once you've identified the outliers, you need to decide how to deal with them. There are a few different ways to do this including:

a) Deleting them: If the outliers represent a significant portion of your data set then deleting them may be the best course of action.

b) Transforming them: Sometimes it's easier to transform an outlier value into something more representative before dealing with it. For example, if an outlier is extremely high or low compared to the rest of your data set, you could round it down or up accordingly before continuing with step 2.

c) Ignoring Them: Another option is simply ignoring the outliers altogether - they won't have any effect on your final analysis and they'll only take up space on your spreadsheet.

There are a few different methods for dealing with outliers depending on their severity:

a) Rounding Down/Up: If an outlier is severely off-base (e.g., it's much higher than all other values), you could round it down (to 0) or up (to 3rd decimal place). This will make it more comparable to other values in the data set while still retaining its individual identity.

b) Multiplying By a Known Value: If an outlier is relatively minor compared to other values in your data set, you can multiply it by a known value such as 1 or 100 before proceeding with steps 2&3 below. This will ensure that all instances of that value remain consistent within your data set regardless of its original magnitude.

c) Replacing The Outlier with a Resembling Value: Occasionally one instance of an outlier can be replaced by another similar but less extreme value without affecting overall accuracy or validity of your dataset. For example, if cell G2 contains an extremely high value compared to all others (e.g., 10x higher), and G1 contains a lower but still acceptable number (5x), then G2 could potentially be replaced with G1+500 so that both numbers are closer towards average within the data set without affecting its original meaning completely.(For more information on replacing outliers see our blog post here.)

Image description

How do you ensure consistency across variables?

The first step is to ensure that all of the data is entered in a consistent format. This means, for example, that all dates are entered in the same format (e.g. mm/dd/yyyy or dd/mm/yyyy) and that all numerical values use the same decimal point character (e.g. . or ,).

Once you have ensured that the data is entered in a consistent format, you can then start to look for any obvious errors or outliers. These can be anything from typos to incorrect values that don't make sense in context.

Once you have identified any errors or outliers, you need to decide how to deal with them. This will often involve making judgment calls on whether to correct the errors or simply remove them from the dataset altogether.

After dealing with errors and outliers, you should take a look at the overall distribution of your data variables. This will help you determine where further cleaning may be required.

Image description

Conclusion

In conclusion, data cleaning in Excel is not as difficult as it may seem at first. With a little practice, you'll be able to clean your data quickly and efficiently.

There are a few things to keep in mind when cleaning data in Excel: make sure to check for errors, duplicate values, and missing values.

Data cleansing is an important step in any data analysis process, so take the time to learn how to do it properly.

Top comments (0)