DEV Community

Cover image for Data Cleaning - Sort Values
Mage
Mage

Posted on

Data Cleaning - Sort Values

TLDR

Learn how to sort out our mess– I mean– row data in one or more columns by ascending and descending order.

Glossary

  • Best time to sort?

  • Dataset

  • Sort by…

  • “Abra” - kadabra! ✨🔮

Best time to sort?

By definition, sorting seems obvious– given text or numerical data, if you display text data in alphabetical order, you’re sorting by ascending values. Conversely, if you display the biggest to smallest numbers, you’re sorting by descending values.

Rather than asking what sorting is, the worthier question is whether your data pre-processing is optimized by sorting specific columns. Thus, what kinds of insights are you searching for in your data? In which situations, with what type of data, does sorting columns benefit your analysis?

Image descriptionFanart of Psyduck

Below are scenarios described by our co-founders where it was immensely useful to leverage sorting:

  • Viewing a user’s first and last page visit: when given user traffic data on a website (like our Mage blog) we can sort by timestamp per user. The first page tells us which page brought them to the site. The last page would say what made them drop off, and hopefully it’s not one of my articles 😛

  • Discern what kind of content users like to consume in the morning: As a newsfeed app, if we sort by timestamp per user, we can gather a list of articles they ❤️ to store in a column. With this data, we can train a model that decides what to show the user in the morning and keep them engaged with the app.

As shown in these examples, sort is rarely used on its own, and it’s most powerful when used with other operations, like to group by user.

Dataset

For this lesson, we’ll be using ThiagoAZen’s Pokemon dataset to visualize the functionality of sort.

However, since only three features are relevant to us, we will drop all but the columns Name, Type1, and Attack. For additional data cleaning, we’ll also remove the duplicates.

1 data = df[["Name", "Type1", "Attack"]]
2 pokedex= data.drop_duplicates(subset=["Name"], 
3 keep="first")
4 pokedex
Enter fullscreen mode Exit fullscreen mode

Image description

That’s a lot of Pokemon to work with! Thankfully, we have some Pokemon who also know sorting algorithms. Maybe he can come with us and help!

Image descriptionSquirtle doing a bubble sort

Sort by

As the name suggests, sorting by ascending values means that values in the column at the beginning are smaller than the values at the end. For alphabetical text entries, this means “A” is the first (ASCII value: 65) and “Z” is the last value (ASCII: 90). For ASCII value look-up, you can refer to this.

Thus, if we wanted the Pokedex in alphabetical order, we sort the “Name” column using:

1 pokedex.sort_values(by=["Name"])
Enter fullscreen mode Exit fullscreen mode

Image description

Now that it’s sorted, in the “Name” column, “Abra” has a smaller value than “Absol,” so “Abra” is placed before.

Image descriptionPokemon images from Bulbapedia

For string comparisons between two strings, what determines which word is positioned first is by the ASCII values of the first non-matching character. When comparing “Abra” with “Absol,” the ASCII values of the first two characters are the same, so we are not yet able to determine which Pokemon appears first in the list.

Thus the only comparison that matters is between the third character of each word– “R” and “S.” Since the ASCII value of “R,” 114, is smaller than S’ 115, “Abra” gets to be positioned before “Absol.”

Sorting by descending values means that the largest values appear at the beginning of the set.

If we sort by descending “Attack”, this means the strongest Pokemon are at the top of the list, so we should be sure to catch ‘em 😉

1 pokedex.sort_values(by=["Attack"], ascending = False)
Enter fullscreen mode Exit fullscreen mode

Image description

Since it’s descending, the Pokemon with the higher attack is placed first. This is the opposite behavior from the above section. Therefore, since the Pokemon Zacian has an attack of 170, which is larger than Rampardos’ 165, Zacian is sorted into a higher position.

Image descriptionImages from Bulbapedia

Arguably, sort is even more useful when we sort by multiple columns. If we first sort by descending value on “Type1,” all the Pokemon are sorted into groups where all the Water-type Pokemon are kept together:

Image description

Thus, when we sort again by “Attack,” the strongest Pokemon appear at the top of a group of types.

1 pokedex.sort_values(by=["Type1", 'Attack'], ascending = False)
Enter fullscreen mode Exit fullscreen mode

Image description

This is useful as we can now find the largest and smallest values, not only across the entire dataset, but within groups as well. Now, you have the knowledge to look out for and capture the strongest Pokemon of each type!

“Abra” - kadabra! ✨🔮

Data cleaning is the most tedious yet important aspect of data analysis. Let Mage handle all the fuss of sorting so you can re-allocate that energy to derive value from your data (or rest because you’ve been working hard).

Image descriptionSource: Blue Bandana Jake

There’s no code involved, but we need you to specify which column needs sorting, and which order. Go to Edit Data > Sort and enter the column and the order you’ll be sorting by.

Since I wanted to match the example above, I sorted by the two columns, “type1” and “attack,” in descending order.

Image description

Abrakadabra! Your data is all sorted with just a couple clicks and no code!

Want to learn more about machine learning (ML)? Visit Mage Academy! ✨🔮

Top comments (0)