DEV Community

Tankala Ashok
Tankala Ashok

Posted on • Originally published at blog.tanka.la on

My Pandas Cheat Sheet for Data Science in Python

Recently I worked most of the time on the data creation part for our model creation at that time I used pandas a lot. I used some functions a lot to complete my task. Let me share them with you I hope these will help you in your Data preparation journey.

You came here to read this article means you already know a bit about pandas and why so many people use pandas for working on Data. So I think I don’t need to give any intro about pandas. But if you ask me to tell me your opinion on pandas in one sentence then anytime I will say this

"pandas" is a great software library/package for data manipulation and analysis.

Don’t forget to install the pandas package. To install the pandas package run below command

pip3 install pandas
Enter fullscreen mode Exit fullscreen mode

You need to import pandas to use so let’s import pandas first

import pandas as pd
Enter fullscreen mode Exit fullscreen mode

We added as pd to import statement of pandas because of that you can use pandas superpowers by simply typing pd instead of typing pandas.

Reading JSON data

Most of the time we don’t have all the data we want. So we do scrapping and store data mostly in JSON. You can load it easily using pandas like this

df = pd.read_json('data.json')
Enter fullscreen mode Exit fullscreen mode

Here data.json looks like this

[
  {
    "_id": "5e91936d617c639d56d6e40a",
    "index": 0,
    "guid": "7fe17eae-c86f-4b99-896a-018636d54f59",
    "isActive": true,
    "balance": "$1,814.13",
    "picture": "http://placehold.it/32x32",
    "age": 33,
    "eyeColor": "blue",
    "name": "Britney Cote",
    "gender": "female",
    "company": "PANZENT",
    "email": "britneycote@panzent.com",
    "phone": "+1 (907) 524-3223",
    "address": "266 Church Lane, Swartzville, Massachusetts, 7719",
    "about": "Dolore aliquip sint ut proident. Ullamco sit eiusmod laborum ullamco sint adipisicing culpa eu laborum adipisicing nisi consequat non culpa. Dolor est fugiat ad dolore minim irure quis. Minim nulla sit ut elit eiusmod dolor.\r\n",
    "registered": "2018-07-24T02:51:54 -06:-30",
    "latitude": 35.545005,
    "longitude": -10.315414,
    "tags": [
      "enim",
      "dolor",
      "fugiat",
      "velit",
      "excepteur",
      "culpa",
      "nisi"
    ],
    "friends": 127,
    "greeting": "Hello, Britney Cote! You have 5 unread messages.",
    "favoriteFruit": "apple"
  },
......
]
Enter fullscreen mode Exit fullscreen mode

Some times we don’t store the JSON data in the proper JSON format. Let’s say we stored like this

{"_id": "5e91991ba38b52e0fb47bb7f","index": 0,"guid": "554903ee-525d-47cb-b09d-bf3cc15aec3d"}
{"_id": "5e91991bfc6a941b23db09b0","index": 1,"guid": "9bdd26f6-d14c-4ab8-bcd3-d42b27599fbd"}
{"_id": "5e91991bfc4766a4b3ea78ea","index": 2,"guid": "a3fee0e1-7b15-4f8b-8901-eaef3ea466c8"}
Enter fullscreen mode Exit fullscreen mode

Then below code will save you

df=pd.read_json('data_not_in_proper_format.json', lines=True)
Enter fullscreen mode Exit fullscreen mode

Here it takes line by line and treats it as one JSON object and loads.

Data Display

If you want to see the data loaded then run below code

df.head()
Enter fullscreen mode Exit fullscreen mode
index _id index guid isActive balance picture age eyeColor name gender phone address about registered latitude longitude tags friends greeting favoriteFruit
0 5e91936d617c639d56d6e40a 0 7fe17eae-c86f-4b99-896a-018636d54f59 True $1,814.13 http://placehold.it/32×32 33 blue Britney Cote female +1 (907) 524-3223 266 Church Lane, Swartzville, Massachusetts, 7719 Dolore aliquip sint ut proident. Ullamco sit e… 2018-07-24T02:51:54 -06:-30 35.545005 -10.315414 [enim, dolor, fugiat, velit, excepteur, culpa,… 127 Hello, Britney Cote! You have 5 unread messages. apple
1 5e91936d0326151699779694 1 b8c92f7e-ca91-4b07-9e49-956ce1f6bebf False $3,610.18 http://placehold.it/32×32 35 brown Miranda Mathis male +1 (952) 533-3545 271 Underhill Avenue, Watchtower, Palau, 7558 Ut ea dolor cupidatat laboris nulla in anim ex… 2014-03-27T08:55:19 -06:-30 16.061777 62.760309 [occaecat, veniam, elit, consequat, ea, eiusmo… 235 Hello, Miranda Mathis! You have 3 unread messa… apple
2 5e91936dac51832738a8d241 2 71627ba1-c0bf-4582-9e16-11a5fa2b7889 False $2,358.30 http://placehold.it/32×32 35 blue Carly Rodgers female +1 (974) 452-3699 116 Forbell Street, Kilbourne, American Samoa,… Culpa tempor voluptate excepteur adipisicing i… 2017-07-26T02:04:48 -06:-30 10.495423 139.383287 [est, sunt, sunt, laborum, sit, pariatur, nisi] 298 Hello, Carly Rodgers! You have 6 unread messages. apple
3 5e91936dc74a890c6f19d933 3 f24ce2af-1e61-47e4-93bb-d8024eb8e581 True $2,457.23 http://placehold.it/32×32 24 green Patsy Mckee female +1 (800) 570-2994 275 Railroad Avenue, Cloverdale, Vermont, 7149 Non duis incididunt laboris tempor aute aliqui… 2015-03-05T07:38:38 -06:-30 89.744027 -120.626924 [veniam, sit, nulla, duis, elit, sunt, consect… 191 Hello, Patsy Mckee! You have 5 unread messages. apple
4 5e91936df9d5b29dec2b1b63 4 013e2904-9e7c-43fe-8bc4-2afc9b8f270c False $2,466.73 http://placehold.it/32×32 20 blue Waters Farley male +1 (921) 580-2595 148 Village Road, Dyckesville, Minnesota, 2327 Aute nulla aute proident ullamco in laborum. A… 2017-01-05T01:40:22 -06:-30 -78.364669 63.065719 [laborum, ipsum, nulla, anim, laborum, minim, ex] 253 Hello, Waters Farley! You have 6 unread messages. apple

Here it is displaying the first 5 rows data. If you want to see only one row data then you need to run this

df.head(1)
Enter fullscreen mode Exit fullscreen mode

If you want to see the last 2 rows data then

df.tail(2)
Enter fullscreen mode Exit fullscreen mode

If there are so many columns like above some of the columns won’t be shown there you will see “…” only. To see all columns below code will come handy

pd.set_option('display.max_columns', None)
Enter fullscreen mode Exit fullscreen mode

At the time of display just like columns displaying many rows also have restrictions If you want to see all rows then below code is here to save you

pd.set_option('display.max_rows', None)
Enter fullscreen mode Exit fullscreen mode

Reading data from MySQL

If you need to read data from the MySQL database and load it in the dataframe then

import mysql.connector

mydb = mysql.connector.connect(
  host="localhost",
  user="root",
  passwd="root",
  database="test"
)
Enter fullscreen mode Exit fullscreen mode

First, you need to create a MySQL connection for that we will provide database configuration details.

I use mysql-connector-python package usually. To create a MySQL connection you need to run above code to run it first you need to install this package for that you need to run below command

pip3 install mysql-connector-python
Enter fullscreen mode Exit fullscreen mode

Once the connection is created you just need to provide query rest will be handled by pandas for you

df = pd.read_sql("SELECT * FROM table_name", mydb)
Enter fullscreen mode Exit fullscreen mode

Reading CSV Data

To read data from CSV file

df=pd.read_csv('data.csv', sep=';', header=0)
Enter fullscreen mode Exit fullscreen mode

By default it takes Comma as separator if you want pandas to consider some other separator then you need to mention like this sep=’;’ which you can find in above code.

header = 0 means you have the names of columns in the first row in the file

Writing Data to CSV file

If you want to write the data we have in the data frame to a CSV file then

df.to_csv('data.csv', sep=';', index=False)
Enter fullscreen mode Exit fullscreen mode

By default whenever pandas write data from data frame to CSV file they put row index too. To avoid this index=False is required.

Understanding the Data

To understand what columns have and their types and which have values in every record and which doesn’t

df.info()
Enter fullscreen mode Exit fullscreen mode

You will get output like this

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 22 columns):
 # Column Non-Null Count Dtype  
--- ------ -------------- -----  
 0 _id 19 non-null object 
 1 index 19 non-null int64  
 2 guid 19 non-null object 
 3 isActive 19 non-null bool   
 4 balance 19 non-null object 
 5 picture 19 non-null object 
 6 age 19 non-null int64  
 7 eyeColor 16 non-null object 
 8 name 19 non-null object 
 9 gender 19 non-null object 
 10 company 19 non-null object 
 11 email 19 non-null object 
 12 phone 19 non-null object 
 13 address 19 non-null object 
 14 registered 19 non-null object 
 15 latitude 19 non-null float64
 16 longitude 19 non-null float64
 17 tags 19 non-null object 
 18 friends 19 non-null int64  
 19 greeting 19 non-null object 
 20 favoriteFruit 19 non-null object 
 21 about 17 non-null object 
dtypes: bool(1), float64(2), int64(3), object(16)
memory usage: 3.3+ KB
Enter fullscreen mode Exit fullscreen mode

You want to have only column names nothing other than that then this small code is there to help you

for column in df.columns:
    print(column)
Enter fullscreen mode Exit fullscreen mode

To know unique values in a column

df['eyeColor'].unique()
Enter fullscreen mode Exit fullscreen mode

Output:

array(['blue', nan, 'green', 'brown'], dtype=object)
Enter fullscreen mode Exit fullscreen mode

To get counts of unique values of a column (Like Group By)

df['eyeColor'].value_counts()
Enter fullscreen mode Exit fullscreen mode

Output:

blue 10
green 3
brown 3
Name: eyeColor, dtype: int64
Enter fullscreen mode Exit fullscreen mode

We have 19 rows but if we total above counts then we will get 16 because by default it doesn’t count None values. To get None(or NaN) count also you need to add dropna=False

df['eyeColor'].value_counts(dropna=False)
Enter fullscreen mode Exit fullscreen mode

Output:

blue 10
green 3
brown 3
NaN 3
Name: eyeColor, dtype: int64
Enter fullscreen mode Exit fullscreen mode

To drop unnecessary columns

df = df.drop(['index','guid','picture','email','phone','latitude','longitude','address','registered','tags'], axis='columns')
Enter fullscreen mode Exit fullscreen mode

Whenever we handle operation on column level then axis=’columns’ or axis=1 need to be added

Now the data frame looks like this

index _id isActive balance age eyeColor name gender company friends greeting favoriteFruit about
0 5e91936d617c639d56d6e40a True $1,814.13 33 blue Britney Cote female PANZENT 127 Hello, Britney Cote! You have 5 unread messages. apple NaN
1 5e91936d0326151699779694 False $3,610.18 35 NaN Miranda Mathis male HIVEDOM 235 Hello, Miranda Mathis! You have 3 unread messa… apple Ut ea dolor cupidatat laboris nulla in anim ex…
2 5e91936dac51832738a8d241 False $2,358.30 35 blue Carly Rodgers female BUZZNESS 298 Hello, Carly Rodgers! You have 6 unread messages. apple Culpa tempor voluptate excepteur adipisicing i…
3 5e91936dc74a890c6f19d933 True $2,457.23 24 green Patsy Mckee female VISALIA 191 Hello, Patsy Mckee! You have 5 unread messages. apple Non duis incididunt laboris tempor aute aliqui…
4 5e91936df9d5b29dec2b1b63 False $2,466.73 20 blue Waters Farley male RODEOLOGY 253 Hello, Waters Farley! You have 6 unread messages. apple Aute nulla aute proident ullamco in laborum. A…

Data imputation

Data imputation plays a major role to prepare data for the model. These code snippets helped me a lot in data imputation journey

To change the value of a column based on condition

df.loc[df['_id']=='5e91936d617c639d56d6e40a',['isActive']] = False
Enter fullscreen mode Exit fullscreen mode

Output:

index _id isActive balance age eyeColor name gender company friends greeting favoriteFruit about
0 5e91936d617c639d56d6e40a False $1,814.13 33 blue Britney Cote female PANZENT 127 Hello, Britney Cote! You have 5 unread messages. apple NaN
1 5e91936d0326151699779694 False $3,610.18 35 NaN Miranda Mathis male HIVEDOM 235 Hello, Miranda Mathis! You have 3 unread messa… apple Ut ea dolor cupidatat laboris nulla in anim ex…
2 5e91936dac51832738a8d241 False $2,358.30 35 blue Carly Rodgers female BUZZNESS 298 Hello, Carly Rodgers! You have 6 unread messages. apple Culpa tempor voluptate excepteur adipisicing i…
3 5e91936dc74a890c6f19d933 True $2,457.23 24 green Patsy Mckee female VISALIA 191 Hello, Patsy Mckee! You have 5 unread messages. apple Non duis incididunt laboris tempor aute aliqui…
4 5e91936df9d5b29dec2b1b63 False $2,466.73 20 blue Waters Farley male RODEOLOGY 253 Hello, Waters Farley! You have 6 unread messages. apple Aute nulla aute proident ullamco in laborum. A…
You can observe here isActive column value of the row with id 5e91936d617c639d56d6e40a changed to False from True

Sometimes we get None values there we usually fill with data which is kind of normal. For example, most of the people have brown coloured eyes and some records have None in the eyeColor column. To fill it

df['eyeColor'].fillna('brown', inplace=True)
Enter fullscreen mode Exit fullscreen mode

Output:

index _id isActive balance age eyeColor name gender company friends greeting favoriteFruit about
0 5e91936d617c639d56d6e40a False $1,814.13 33 blue Britney Cote female PANZENT 127 Hello, Britney Cote! You have 5 unread messages. apple NaN
1 5e91936d0326151699779694 False $3,610.18 35 brown Miranda Mathis male HIVEDOM 235 Hello, Miranda Mathis! You have 3 unread messa… apple Ut ea dolor cupidatat laboris nulla in anim ex…
2 5e91936dac51832738a8d241 False $2,358.30 35 blue Carly Rodgers female BUZZNESS 298 Hello, Carly Rodgers! You have 6 unread messages. apple Culpa tempor voluptate excepteur adipisicing i…
3 5e91936dc74a890c6f19d933 True $2,457.23 24 green Patsy Mckee female VISALIA 191 Hello, Patsy Mckee! You have 5 unread messages. apple Non duis incididunt laboris tempor aute aliqui…
4 5e91936df9d5b29dec2b1b63 False $2,466.73 20 blue Waters Farley male RODEOLOGY 253 Hello, Waters Farley! You have 6 unread messages. apple Aute nulla aute proident ullamco in laborum. A…
You can see 2nd row previous value is NaN. Now it got filled by brown

We used inplace=True extra here. Usually, whatever operations we do in data frame it returns a new data frame. Instead of that if you want to do change in current data frame only then we need to use inplace=True

We can do several operations specific to data types. For example, If you want to convert values to uppercase of a column which has string values then pretty simple. Code for this is

df["gender"] = df["gender"].str.upper()
Enter fullscreen mode Exit fullscreen mode

Output:

index _id isActive balance age eyeColor name gender company friends greeting favoriteFruit about
0 5e91936d617c639d56d6e40a False $1,814.13 33 blue Britney Cote FEMALE PANZENT 127 Hello, Britney Cote! You have 5 unread messages. apple NaN
1 5e91936d0326151699779694 False $3,610.18 35 brown Miranda Mathis MALE HIVEDOM 235 Hello, Miranda Mathis! You have 3 unread messa… apple Ut ea dolor cupidatat laboris nulla in anim ex…
2 5e91936dac51832738a8d241 False $2,358.30 35 blue Carly Rodgers FEMALE BUZZNESS 298 Hello, Carly Rodgers! You have 6 unread messages. apple Culpa tempor voluptate excepteur adipisicing i…
3 5e91936dc74a890c6f19d933 True $2,457.23 24 green Patsy Mckee FEMALE VISALIA 191 Hello, Patsy Mckee! You have 5 unread messages. apple Non duis incididunt laboris tempor aute aliqui…
4 5e91936df9d5b29dec2b1b63 False $2,466.73 20 blue Waters Farley MALE RODEOLOGY 253 Hello, Waters Farley! You have 6 unread messages. apple Aute nulla aute proident ullamco in laborum. A…
You can observe now gender values converted to uppercase

We can consider a data frame as a matrix so we can do matrix operations too. For example, let’s say we want to create a new column where we want to calculate friends count with age. To do this

df['friends_to_age']=df['friends']/df['age']
Enter fullscreen mode Exit fullscreen mode

Output:

index _id isActive balance age eyeColor name gender company friends greeting favoriteFruit about friends_to_age
0 5e91936d617c639d56d6e40a False $1,814.13 33 blue Britney Cote FEMALE PANZENT 127 Hello, Britney Cote! You have 5 unread messages. apple NaN 3.848485
1 5e91936d0326151699779694 False $3,610.18 35 brown Miranda Mathis MALE HIVEDOM 235 Hello, Miranda Mathis! You have 3 unread messa… apple Ut ea dolor cupidatat laboris nulla in anim ex… 6.714286
2 5e91936dac51832738a8d241 False $2,358.30 35 blue Carly Rodgers FEMALE BUZZNESS 298 Hello, Carly Rodgers! You have 6 unread messages. apple Culpa tempor voluptate excepteur adipisicing i… 8.514286
3 5e91936dc74a890c6f19d933 True $2,457.23 24 green Patsy Mckee FEMALE VISALIA 191 Hello, Patsy Mckee! You have 5 unread messages. apple Non duis incididunt laboris tempor aute aliqui… 7.958333
4 5e91936df9d5b29dec2b1b63 False $2,466.73 20 blue Waters Farley MALE RODEOLOGY 253 Hello, Waters Farley! You have 6 unread messages. apple Aute nulla aute proident ullamco in laborum. A… 12.650000

You can do all sorts of things on values and write lambda functions too to extract data from values present in a column. For example, If you read the greeting message it contains the number of unread messages count. I don’t think greeting message has any significance except the unread messages count so we can extract and put it in a new column with name unread_messages_count

Output:

index _id isActive balance age eyeColor name gender company friends greeting favoriteFruit about friends_to_age unread_messages_count
0 5e91936d617c639d56d6e40a False $1,814.13 33 blue Britney Cote FEMALE PANZENT 127 Hello, Britney Cote! You have 5 unread messages. apple NaN 3.848485 5
1 5e91936d0326151699779694 False $3,610.18 35 brown Miranda Mathis MALE HIVEDOM 235 Hello, Miranda Mathis! You have 3 unread messa… apple Ut ea dolor cupidatat laboris nulla in anim ex… 6.714286 3
2 5e91936dac51832738a8d241 False $2,358.30 35 blue Carly Rodgers FEMALE BUZZNESS 298 Hello, Carly Rodgers! You have 6 unread messages. apple Culpa tempor voluptate excepteur adipisicing i… 8.514286 6
3 5e91936dc74a890c6f19d933 True $2,457.23 24 green Patsy Mckee FEMALE VISALIA 191 Hello, Patsy Mckee! You have 5 unread messages. apple Non duis incididunt laboris tempor aute aliqui… 7.958333 5
4 5e91936df9d5b29dec2b1b63 False $2,466.73 20 blue Waters Farley MALE RODEOLOGY 253 Hello, Waters Farley! You have 6 unread messages. apple Aute nulla aute proident ullamco in laborum. A… 12.650000 6

Let’s say you want to extract what you want from a column value and put it in different columns that also possible. For example, Let’s split name into first name and last name

def extract_first_last_name(row):
    name = row['name'].split()
    row['first_name'] = name[0]
    row['last_name'] = name[1]
    return row

df = df.apply(lambda row: extract_first_last_name(row), axis='columns')
Enter fullscreen mode Exit fullscreen mode

Output:

index _id isActive balance age eyeColor name gender company friends greeting favoriteFruit about friends_to_age unread_messages_count first_name last_name
0 5e91936d617c639d56d6e40a False $1,814.13 33 blue Britney Cote FEMALE PANZENT 127 Hello, Britney Cote! You have 5 unread messages. apple NaN 3.848485 5 Britney Cote
1 5e91936d0326151699779694 False $3,610.18 35 brown Miranda Mathis MALE HIVEDOM 235 Hello, Miranda Mathis! You have 3 unread messa… apple Ut ea dolor cupidatat laboris nulla in anim ex… 6.714286 3 Miranda Mathis
2 5e91936dac51832738a8d241 False $2,358.30 35 blue Carly Rodgers FEMALE BUZZNESS 298 Hello, Carly Rodgers! You have 6 unread messages. apple Culpa tempor voluptate excepteur adipisicing i… 8.514286 6 Carly Rodgers
3 5e91936dc74a890c6f19d933 True $2,457.23 24 green Patsy Mckee FEMALE VISALIA 191 Hello, Patsy Mckee! You have 5 unread messages. apple Non duis incididunt laboris tempor aute aliqui… 7.958333 5 Patsy Mckee
4 5e91936df9d5b29dec2b1b63 False $2,466.73 20 blue Waters Farley MALE RODEOLOGY 253 Hello, Waters Farley! You have 6 unread messages. apple Aute nulla aute proident ullamco in laborum. A… 12.650000 6 Waters Farley

We did whatever data imputation we need to do and for some columns, we can’t do anything and they don’t have values in some rows and there is no point of them to keep because it’s going to give anyway trouble at the time of model training so we want to drop those columns and it’s data then

df = df.dropna(axis='columns')
Enter fullscreen mode Exit fullscreen mode

Output:

index _id isActive balance age eyeColor name gender company friends greeting favoriteFruit friends_to_age unread_messages_count first_name last_name
0 5e91936d617c639d56d6e40a False $1,814.13 33 blue Britney Cote FEMALE PANZENT 127 Hello, Britney Cote! You have 5 unread messages. apple 3.848485 5 Britney Cote
1 5e91936d0326151699779694 False $3,610.18 35 brown Miranda Mathis MALE HIVEDOM 235 Hello, Miranda Mathis! You have 3 unread messa… apple 6.714286 3 Miranda Mathis
2 5e91936dac51832738a8d241 False $2,358.30 35 blue Carly Rodgers FEMALE BUZZNESS 298 Hello, Carly Rodgers! You have 6 unread messages. apple 8.514286 6 Carly Rodgers
3 5e91936dc74a890c6f19d933 True $2,457.23 24 green Patsy Mckee FEMALE VISALIA 191 Hello, Patsy Mckee! You have 5 unread messages. apple 7.958333 5 Patsy Mckee
4 5e91936df9d5b29dec2b1b63 False $2,466.73 20 blue Waters Farley MALE RODEOLOGY 253 Hello, Waters Farley! You have 6 unread messages. apple 12.650000 6 Waters Farley
You can observe here about column is gone.

Let’s drop some unnecessary columns

df = df.drop(['name','greeting'], axis='columns')
Enter fullscreen mode Exit fullscreen mode

Dropping duplicate rows

To drop duplicate rows

unique_df = df.drop_duplicates()
Enter fullscreen mode Exit fullscreen mode

If you want see what unique_df have

unique_df.head(10)
Enter fullscreen mode Exit fullscreen mode
index _id isActive balance age eyeColor gender company friends favoriteFruit friends_to_age unread_messages_count first_name last_name
0 5e91936d617c639d56d6e40a False $1,814.13 33 blue FEMALE PANZENT 127 apple 3.848485 5 Britney Cote
1 5e91936d0326151699779694 False $3,610.18 35 brown MALE HIVEDOM 235 apple 6.714286 3 Miranda Mathis
2 5e91936dac51832738a8d241 False $2,358.30 35 blue FEMALE BUZZNESS 298 apple 8.514286 6 Carly Rodgers
3 5e91936dc74a890c6f19d933 True $2,457.23 24 green FEMALE VISALIA 191 apple 7.958333 5 Patsy Mckee
4 5e91936df9d5b29dec2b1b63 False $2,466.73 20 blue MALE RODEOLOGY 253 apple 12.650000 6 Waters Farley
5 5e91936dbf8fbe6a1b99f888 False $2,069.59 39 brown FEMALE QUAILCOM 123 banana 3.153846 7 Kelly Monroe
6 5e91936d4eed9784f3515a46 False $1,472.59 24 blue FEMALE ENTROPIX 167 apple 6.958333 8 Chelsea Gregory
unique_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 0 to 6
Data columns (total 13 columns):
 # Column Non-Null Count Dtype  
--- ------ -------------- -----  
 0 _id 7 non-null object 
 1 isActive 7 non-null bool   
 2 balance 7 non-null object 
 3 age 7 non-null int64  
 4 eyeColor 7 non-null object 
 5 gender 7 non-null object 
 6 company 7 non-null object 
 7 friends 7 non-null int64  
 8 favoriteFruit 7 non-null object 
 9 friends_to_age 7 non-null float64
 10 unread_messages_count 7 non-null object 
 11 first_name 7 non-null object 
 12 last_name 7 non-null object 
dtypes: bool(1), float64(1), int64(2), object(9)
memory usage: 735.0+ bytes
Enter fullscreen mode Exit fullscreen mode

Now you can observe there are no duplicate rows

Renaming column names

unique_df = unique_df.rename(columns={"unread_messages_count": "unread_count","friends_to_age":"friends_by_age"})
Enter fullscreen mode Exit fullscreen mode

Here we renamed column names from unread_messages_count to unread_count and from friends_to_age to friends_by_age

unique_df.head()
Enter fullscreen mode Exit fullscreen mode
index _id isActive balance age eyeColor gender company friends favoriteFruit friends_by_age unread_count first_name last_name
0 5e91936d617c639d56d6e40a False $1,814.13 33 blue FEMALE PANZENT 127 apple 3.848485 5 Britney Cote
1 5e91936d0326151699779694 False $3,610.18 35 brown MALE HIVEDOM 235 apple 6.714286 3 Miranda Mathis
2 5e91936dac51832738a8d241 False $2,358.30 35 blue FEMALE BUZZNESS 298 apple 8.514286 6 Carly Rodgers
3 5e91936dc74a890c6f19d933 True $2,457.23 24 green FEMALE VISALIA 191 apple 7.958333 5 Patsy Mckee
4 5e91936df9d5b29dec2b1b63 False $2,466.73 20 blue MALE RODEOLOGY 253 apple 12.650000 6 Waters Farley

Data Filtering and sorting

We can do all sorts of conditions and filter the data what we need too. For example, you want to see the rows which users have age more than 30

unique_df.loc[unique_df['age'] > 30]
Enter fullscreen mode Exit fullscreen mode

Output:

index _id isActive balance age eyeColor gender company friends favoriteFruit friends_by_age unread_count first_name last_name
0 5e91936d617c639d56d6e40a False $1,814.13 33 blue FEMALE PANZENT 127 apple 3.848485 5 Britney Cote
1 5e91936d0326151699779694 False $3,610.18 35 brown MALE HIVEDOM 235 apple 6.714286 3 Miranda Mathis
2 5e91936dac51832738a8d241 False $2,358.30 35 blue FEMALE BUZZNESS 298 apple 8.514286 6 Carly Rodgers
5 5e91936dbf8fbe6a1b99f888 False $2,069.59 39 brown FEMALE QUAILCOM 123 banana 3.153846 7 Kelly Monroe

We can put multiple conditions on a single column too

unique_df.loc[unique_df['eyeColor'].isin(['brown','green'])]
Enter fullscreen mode Exit fullscreen mode

Output:

index _id isActive balance age eyeColor gender company friends favoriteFruit friends_by_age unread_count first_name last_name
1 5e91936d0326151699779694 False $3,610.18 35 brown MALE HIVEDOM 235 apple 6.714286 3 Miranda Mathis
3 5e91936dc74a890c6f19d933 True $2,457.23 24 green FEMALE VISALIA 191 apple 7.958333 5 Patsy Mckee
5 5e91936dbf8fbe6a1b99f888 False $2,069.59 39 brown FEMALE QUAILCOM 123 banana 3.153846 7 Kelly Monroe

Even we can use predicate functions related to data types like this

unique_df.loc[unique_df['eyeColor'].str.contains('b')]
Enter fullscreen mode Exit fullscreen mode

Output:

index _id isActive balance age eyeColor gender company friends favoriteFruit friends_by_age unread_count first_name last_name
0 5e91936d617c639d56d6e40a False $1,814.13 33 blue FEMALE PANZENT 127 apple 3.848485 5 Britney Cote
1 5e91936d0326151699779694 False $3,610.18 35 brown MALE HIVEDOM 235 apple 6.714286 3 Miranda Mathis
2 5e91936dac51832738a8d241 False $2,358.30 35 blue FEMALE BUZZNESS 298 apple 8.514286 6 Carly Rodgers
4 5e91936df9d5b29dec2b1b63 False $2,466.73 20 blue MALE RODEOLOGY 253 apple 12.650000 6 Waters Farley
5 5e91936dbf8fbe6a1b99f888 False $2,069.59 39 brown FEMALE QUAILCOM 123 banana 3.153846 7 Kelly Monroe
6 5e91936d4eed9784f3515a46 False $1,472.59 24 blue FEMALE ENTROPIX 167 apple 6.958333 8 Chelsea Gregory
4th row is gone because it have green as a value in eyeColor column

We can sort the data based on values of a column

unique_df.sort_values(by=['first_name'])
Enter fullscreen mode Exit fullscreen mode

Output:

index _id isActive balance age eyeColor gender company friends favoriteFruit friends_by_age unread_count first_name last_name
0 5e91936d617c639d56d6e40a False $1,814.13 33 blue FEMALE PANZENT 127 apple 3.848485 5 Britney Cote
2 5e91936dac51832738a8d241 False $2,358.30 35 blue FEMALE BUZZNESS 298 apple 8.514286 6 Carly Rodgers
6 5e91936d4eed9784f3515a46 False $1,472.59 24 blue FEMALE ENTROPIX 167 apple 6.958333 8 Chelsea Gregory
5 5e91936dbf8fbe6a1b99f888 False $2,069.59 39 brown FEMALE QUAILCOM 123 banana 3.153846 7 Kelly Monroe
1 5e91936d0326151699779694 False $3,610.18 35 brown MALE HIVEDOM 235 apple 6.714286 3 Miranda Mathis
3 5e91936dc74a890c6f19d933 True $2,457.23 24 green FEMALE VISALIA 191 apple 7.958333 5 Patsy Mckee
4 5e91936df9d5b29dec2b1b63 False $2,466.73 20 blue MALE RODEOLOGY 253 apple 12.650000 6 Waters Farley
Sorted based on(order by) first name

Creating unique data set with specific fields

Some times we need to see specific columns data only and those too unique ones so that we can do these stuff using group by operations.We have duplicate values in df and let’s say we want to know age and balance per user

df.groupby('_id').agg({'age':'min', 'balance':'min'})
Enter fullscreen mode Exit fullscreen mode

Output:

age balance
_id
--- --- ---
5e91936d0326151699779694 35 $3,610.18
5e91936d4eed9784f3515a46 24 $1,472.59
5e91936d617c639d56d6e40a 33 $1,814.13
5e91936dac51832738a8d241 35 $2,358.30
5e91936dbf8fbe6a1b99f888 39 $2,069.59
5e91936dc74a890c6f19d933 24 $2,457.23
5e91936df9d5b29dec2b1b63 20 $2,466.73

You can observe that age and balance in the header at a different level. To solve this problem reset_index function is there to help you

df.groupby('_id').agg({'age':'min', 'balance':'min'}).reset_index()
Enter fullscreen mode Exit fullscreen mode
index _id age balance
0 5e91936d0326151699779694 35 $3,610.18
1 5e91936d4eed9784f3515a46 24 $1,472.59
2 5e91936d617c639d56d6e40a 33 $1,814.13
3 5e91936dac51832738a8d241 35 $2,358.30
4 5e91936dbf8fbe6a1b99f888 39 $2,069.59
5 5e91936dc74a890c6f19d933 24 $2,457.23
6 5e91936df9d5b29dec2b1b63 20 $2,466.73

Merging 2 data frames

Some times we have data in a different table or different CSV. So we put that in a different data frame, to merge this to another data frame we need to use merge function.

To explain how we can do let’s create a data frame. It has _id and favourite food.

food_dict = [{"_id":"5e91936d0326151699779694", "favouriteFood":"Pasta"},
{"_id":"5e91936d4eed9784f3515a46", "favouriteFood":"Hamburgers"},
{"_id":"5e91936d617c639d56d6e40a", "favouriteFood":"Pizza"},
{"_id":"5e91936dac51832738a8d241", "favouriteFood":"Vegetables"},
{"_id":"5e91936dbf8fbe6a1b99f888", "favouriteFood":"Chicken"},
{"_id":"5e91936dc74a890c6f19d933", "favouriteFood":"Pasta"}
]

df_food = pd.DataFrame(food_dict)

df_food.head()
Enter fullscreen mode Exit fullscreen mode
index _id favouriteFood
0 5e91936d0326151699779694 Pasta
1 5e91936d4eed9784f3515a46 Hamburgers
2 5e91936d617c639d56d6e40a Pizza
3 5e91936dac51832738a8d241 Vegetables
4 5e91936dbf8fbe6a1b99f888 Chicken

To merge df_food and unique_df let’s do an inner join

unique_df = unique_df.merge(df_food, on="_id", how = 'inner')

unique_df.head(10)
Enter fullscreen mode Exit fullscreen mode
index _id isActive balance age eyeColor gender company friends favoriteFruit friends_by_age unread_count first_name last_name favouriteFood
0 5e91936d617c639d56d6e40a False $1,814.13 33 blue FEMALE PANZENT 127 apple 3.848485 5 Britney Cote Pizza
1 5e91936d0326151699779694 False $3,610.18 35 brown MALE HIVEDOM 235 apple 6.714286 3 Miranda Mathis Pasta
2 5e91936dac51832738a8d241 False $2,358.30 35 blue FEMALE BUZZNESS 298 apple 8.514286 6 Carly Rodgers Vegetables
3 5e91936dc74a890c6f19d933 True $2,457.23 24 green FEMALE VISALIA 191 apple 7.958333 5 Patsy Mckee Pasta
4 5e91936dbf8fbe6a1b99f888 False $2,069.59 39 brown FEMALE QUAILCOM 123 banana 3.153846 7 Kelly Monroe Chicken
5 5e91936d4eed9784f3515a46 False $1,472.59 24 blue FEMALE ENTROPIX 167 apple 6.958333 8 Chelsea Gregory Hamburgers

These are the ones helped me in my project. I hope the above code snippets and examples help you too. This is my cheat sheet. I am putting all these code snippets in a notebook and putting into my GitHub Repository. You can find it here.

Peace. Happy Coding.

Top comments (0)