DEV Community: Ruth

Pandas: Importing Data, Indexing, Comparisons and Selectors (featuring adoptable dog data)

Ruth — Sun, 14 Nov 2021 16:44:48 +0000

For the second Python Pandas guide we will be reviewing how to import data, as well as a deeper dive into indexing, comparison statements and selecting subsets of data. For this we will be using a dataset with information about adoptable dogs from Kaggle. They have loads of datasets that you can easily download and use for projects.

Importing data from a csv

To start we need to import our downloaded data into a DataFrame. It's pretty simple to upload the data from our downloads folder when working locally using the pd.read_csv function.

dogs = pd.read_csv(r'Downloads/ShelterDogs.csv')

Once we have imported the data we want to inspect the DataFrame and make sure it contains all of the information that we need and in the correct format. We can use .head() to view the first 5 rows.

dogs.head()

In order to find out more about the data types of each column, we can use .info() which displays the column name, the count of rows that contain null data and then the data type of each columns.

dogs.info()

Data Types

int - whole numbers
float - decimal numbers
object - a string

Changing data types

There may be an instance where we want to change a data type, in order to perform some operations, or to make them easier to view.

In order to change string to number you can use the Pandas method pd.to_numeric or to a change number into to string you can use .astype(str).

Within our dataframe we can also change the date column from a string to an actual date format, using the .to_datetime command.

dogs['date_found'] = pd.to_datetime(dogs['date_found'])
dogs

Finally, .shape will show us the number of columns and then the number of rows.

dogs.shape

Selecting multiple columns

To select two or more columns from a DataFrame, we can use a list of the column names between a double set of square brackets. This will provide us a subset of the data, and create a new DataFrame containing only this information, leaving our original DataFrame untouched.

type_breed = dogs[['name', 'sex', 'breed']]
type_breed.head()

Selecting one column using .iloc

We can use .iloc, which is integer based, to select and display a single column by specifying the positional index of the row we want to view.

dogs.iloc[1]

iloc indexing

In some cases, we might also want to select only certain rows, we can do this by using the index of the rows, and select those using the iloc command. It allows us to select multiple rows, using index based selection. This is similar to how we select elements from a list, using the : operator and square brackets.

dogs.iloc[3:7]

This will display rows 3-6, as the last row is not included.

The : operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values.

iloc[:10] would select all rows up to, but not including, the 10th row

dogs.iloc[:10]

We can also use minus indexing, for example iloc[-2:] will display the last 2 columns.

dogs.iloc[-2:]

Renaming columns

When we get data from different sources, we may need to rename the columns so they are easier to read or call from.

There are a couple of methods to do this, depending on which column names we want to change.

.columns will allow us to change all of the columns at once. However, it’s important to get the ordering right to avoid mislabeling them.
.rename will allow us to change individual columns. You can pass a single column to change or multiple columns, using a dictionary with the original name and the new name.

dogs.rename(columns = {'posted' : 'date_posted', 
                         'breed' : 'dog_breed',
                         'adoptable_from' : 'date_adoptable',
                         'coat' : 'coat_type'},
                         inplace = True)

dogs.head()

Selecting with comparison statements

We can also select a subset of data using comparison statements.

Operator	Purpose
==	Equal to
!=	Not Equal
>	Greater than
<	Less than
>=	Greater than or Equal to
<=	Less than or Equal to

Equal to

For example, if we wanted to only display female dogs from the dataset, we would use the double equal operator and define the column that we are selecting from.

female = dogs[dogs.sex == 'female']
female

Not Equal to

If we wanted to exclude all dogs which are an unknown mix breed, we could use the not equal to operator to select all rows except ones that mention Unknown Mix.

breeds = dogs[dogs.dog_breed != 'Unknown Mix']
breeds

Greater than

As we have numbers in our dataset we can use the greater than operator to select any rows which contain a number higher than the one we give it. We just need to specify which column we are selecting the data from, in this case, it's the age column.

over_two = dogs[dogs.age > 2]
over_two

Less than

Similarly, we can use the less than operator to select all of the dogs which have an age less than 4.

under_four = dogs[dogs.age < 4]
under_four

Multiple statements

It is also possible to use multiple comparison operators in a single selection, and we can define if we want the final dataset to contain either both of these by using the ampersand, to signify an and rule.

likes_people = dogs[(dogs.likes_people == 'yes') & (dogs.likes_children == 'yes')]
likes_people

In addition, we can use the pipe, |, to define an or rule, where we want our final dataset to display either of the data points we have selected.

cats_female = dogs[(dogs.get_along_females == 'yes') | (dogs.get_along_cats == 'yes')]
cats_female

Mixed statements

We can also use mixed comparison operators within one statement, for example if we want any dogs which are neutered and aged either 4 or older we can do such:

suitable = dogs[(dogs.neutered == 'yes') & (dogs.age >= 4)]
suitable

Built in selectors

Pandas also comes with some conditional selectors that are built in and can be used in a similar way to logical statements.

isin selects data where the value is what you are defining in the list
isnull will select data that is null within the columns that you select (i.e displaying Null)
notnull selects data that has a value, i.e all that is not null

isin

breed = dogs[(dogs.dog_breed.isin(['Staffordshire Terrier Mix', 'Labrador Retriever Mix', 'German Shepherd Dog Mix']))]
breed

Once we have selected subsets of data, the index is changed to reflect only the rows that we have selected, if we want to reorder the index appropriately we will want to reset the index using .reset_index().

breed = dogs[(dogs.dog_breed.isin(['Staffordshire Terrier Mix', 'Labrador Retriever Mix', 'German Shepherd Dog Mix']))].reset_index()
breed

notnull

neutered_known = dogs.loc[dogs.neutered.notnull()]
neutered_known

isnull

neutered_unknown = dogs.loc[dogs.neutered.isnull()]
neutered_unknown

Dealing with null data

It's likely that you will come across null data when dealing with big datasets, and this can cause issues when doing any selecting or mathematical operations. It can also sometimes make the tables look messy, hard to review and can skew final results.

There is an easy way to replace the null data columns using the Pandas method .fillna(). Within the brackets you will need to add an argument that states the replacement number or text. This will then replace all of the Nan fields.

dogs.fillna("not available")

I hope this has been a helpful (and cute) way to understand the more of the Pandas library in Python. I'm looking forward to making a few more posts in this series :)

A notebook to download and play around with can be found here .

Pandas: Creating, Modifying and Inspecting DataFrames (featuring data from Squid Game)

Ruth — Sun, 31 Oct 2021 19:07:06 +0000

Inspired by @codechips ' SQL Squid Game guide, I thought it would be fun to use some Squid Game data to write a guide on the basics of the Pandas library in Python including creating a DataFrame, modifying rows and columns and inspecting the data.

Importing Pandas

The first step is to import that Pandas library and alias it as pd. This means we don't have to call the full word when we run a Pandas function, we just need to type pd.

import pandas as pd

The main data structure of Pandas is a DataFrame, which is similar to an excel spreadsheet, where we can store data. Once the data is within a DataFrame, there are several ways it can be used, for example data analysis or to visualise it.

Introducing DataFrames

DataFrames store data in rows and columns. Each column has a name, which is a string, and each row has an index, which is typically an integer. Like lists in Python, DataFrames also use 0 indexing, which means the first row in index 0 instead of index 1. However, you can set the index to include extra information about what the row contains if you want.

DataFrames can contain many different data types, including strings, integers and floats.

You can create a DataFrame by uploading data from a csv file, but you can also create a DataFrame by typing values into a list and using a dictionary to transform it into a DataFrame. You can use multiple different lists, containing different data types, but the value of content included in each must be the same.

data = {'Name':  ['Oh Il-Nam', 'Kang Sae-byeok', 'Jang Deok-su', 'Abdul Ali', 'Han Mi-nyeo',  'Cho Sang-woo', 'Ji-yeong'],
'Number': [1, 67, 101, 199, 212, 218, 240],
        }

df = pd.DataFrame(data, index = ["player1", "player67", "player101", "player199", "player212", "player281", 
"player240"])

As you can see above we have one list with the players names, typed as strings, and another with their number, which is an integer.

We then transform this dictionary of lists to a DataFrame using the pd.DataFrame() command. We also set the index values, by passing the string names as a list argument.

Inspecting a DataFrame

Once the data has been added, we want to make sure it looks correct and contains everything we want. The best method to do this is to use df.head(). This will show the first 5 rows of data. However, if you would like to see more data, i.e 10 rows, you can pass this as an argument within the brackets.

df.head(10)

If we want to only view the data from one row, we can use the panda-specific access method .loc. This is a label based method, meaning we have to specify the name of the row we want to view, this method can only use string. Here we add in the index name that we specified when creating the DataFrame.

print(df.loc["player67"])

The other access method is .iloc, which is integer based, by specifying the positional index of the row we want to view. We would use this one if we hadn't changed the index, or if we had changed it to another integer value instead of a string.

We can also get a count of how many rows within the DataFrame, by using the count() function. This will print the number of rows in each column.

df.count()

If we only want to print a single number to display the row count, we can pass in the name of the column we want to square within square brackets before the count function. Storing this within a variable will also enable us to use this number for functions in the future.

player_count = df['Name'].count()
player_count

Adding a new row

There are a couple of different ways to add new rows to a DataFrame. If we have multiple rows within another DataFrame, we can use the .append() function to add several rows to the end of our existing DataFrame.

However, if we have just 1 row to add, i.e only 1 new player to add, we can again use the .loc method to add the row to the end of our original DataFrame. As we are defining the name of our indexes, we need to pass this in before the new values that will be within the row. The number of values contained must match the number of columns we have and be in the correct order.

df.loc['player456'] = ['Seong Gi-hun', 456]

Now if we print the head() of the new DataFrame, we will see our new row added in 🧑‍🦰

Adding a new column

In addition to adding a new row, we can also add a new column. There are several ways you can define the value that will be contained within this column for each row, including based on what is within other columns and using conditional statement or lambda function. However, in this case we want the value to be the same for every row, to show that every player is currently playing the game.

For this we just need to add the name of our new column and assign it the value that we want to add.

df['Status'] = "Playing"

Changing values

Now we have a column with the status of all players that are playing the game. But what happens when we start to have eliminations? We need to update the value in that column 😬

Again, we will use the *.loc * method to define the row that we will be amending and the column that we will be changing the value of.

In this case it will be based on their number, so we need to pass in this column name and use the equals to ensure it is only going to change the single row that equals that row in the Number column. Next we will pass in the name of the column that we will be amending before assigning the new value of Eliminated 😢

df.loc[df.Number == 1, "Status"] = "Eliminated"

Deleting a row

Once players start being eliminated, they will be removed from the game, therefore we will want to ensure we have updated DataFrame that reflects this and only contains those who are currently playing.

Because we want to keep track of all of the players in the original DataFrame, we don't want to delete anything from this one. However, we can use the data from our original DataFrame to make another one while leaving the original untouched. We can do this by defining the name of our new DataFrame and apply some logic to pull data from our original one.

In this case, .drop() allows us to drop (aka delete, but I guess in the instance of game 5, literally drop) the eliminated players. As we have changed their status to Eliminated, we can use this value to delete them all.

round5 = df.drop(df.index[df['Status'] == "Eliminated"])

It is also possible to drop the rows based on their name or number, but this would be done on a row by row basis, rather than removing multiple at once.

Now we can view the new DataFrame, and see our remaining players for the game.

round5.head()

Using the data

The final thing we can do is use the count function mentioned earlier, where we extracted a count value for the columns based on conditional statements. For example, if we want to print the number of players who have a status of Playing, and then for those who have status of Eliminated.

eliminated = df[df['Status'] == "Eliminated"]['Name'].count()
playing = df[df['Status'] == "Playing"]['Name'].count()

Assigning these values to a variable means we can also use these counts to create simple sentences using f strings. As the variable values are mutable, we can update these after every game when players are eliminated.

print(f"there are currently {playing} players playing")
print(f"there are currently {eliminated} players eliminated")

And not to forget a wonderful maths function, we can use the number of eliminated players and take this from the total number of players who started to get a count for the current players in the game.

total_players = 8
current = total_players - eliminated

I hope this has been a helpful (and fun) way to understand the basics of the Pandas library in Python. I'm also hoping to make more posts that go further in detail covering logical statements, merging and visualisations in Pandas.

A quick cheatsheet of these functions can be found here and a notebook to download and play around with can be found here.