Tankala Ashok

Posted on Apr 19, 2020 • Originally published at blog.tanka.la on Apr 19, 2020

My Pandas Cheat Sheet for Data Science in Python

#datascience #pandas #pandascheatsheet #pandasreadcsv

Recently I worked most of the time on the data creation part for our model creation at that time I used pandas a lot. I used some functions a lot to complete my task. Let me share them with you I hope these will help you in your Data preparation journey.

You came here to read this article means you already know a bit about pandas and why so many people use pandas for working on Data. So I think I don’t need to give any intro about pandas. But if you ask me to tell me your opinion on pandas in one sentence then anytime I will say this

"pandas" is a great software library/package for data manipulation and analysis.

Don’t forget to install the pandas package. To install the pandas package run below command

pip3 install pandas

You need to import pandas to use so let’s import pandas first

import pandas as pd

We added as pd to import statement of pandas because of that you can use pandas superpowers by simply typing pd instead of typing pandas.

Reading JSON data

Most of the time we don’t have all the data we want. So we do scrapping and store data mostly in JSON. You can load it easily using pandas like this

df = pd.read_json('data.json')

Here data.json looks like this

[
  {
    "_id": "5e91936d617c639d56d6e40a",
    "index": 0,
    "guid": "7fe17eae-c86f-4b99-896a-018636d54f59",
    "isActive": true,
    "balance": "$1,814.13",
    "picture": "http://placehold.it/32x32",
    "age": 33,
    "eyeColor": "blue",
    "name": "Britney Cote",
    "gender": "female",
    "company": "PANZENT",
    "email": "britneycote@panzent.com",
    "phone": "+1 (907) 524-3223",
    "address": "266 Church Lane, Swartzville, Massachusetts, 7719",
    "about": "Dolore aliquip sint ut proident. Ullamco sit eiusmod laborum ullamco sint adipisicing culpa eu laborum adipisicing nisi consequat non culpa. Dolor est fugiat ad dolore minim irure quis. Minim nulla sit ut elit eiusmod dolor.\r\n",
    "registered": "2018-07-24T02:51:54 -06:-30",
    "latitude": 35.545005,
    "longitude": -10.315414,
    "tags": [
      "enim",
      "dolor",
      "fugiat",
      "velit",
      "excepteur",
      "culpa",
      "nisi"
    ],
    "friends": 127,
    "greeting": "Hello, Britney Cote! You have 5 unread messages.",
    "favoriteFruit": "apple"
  },
......
]

Some times we don’t store the JSON data in the proper JSON format. Let’s say we stored like this

{"_id": "5e91991ba38b52e0fb47bb7f","index": 0,"guid": "554903ee-525d-47cb-b09d-bf3cc15aec3d"}
{"_id": "5e91991bfc6a941b23db09b0","index": 1,"guid": "9bdd26f6-d14c-4ab8-bcd3-d42b27599fbd"}
{"_id": "5e91991bfc4766a4b3ea78ea","index": 2,"guid": "a3fee0e1-7b15-4f8b-8901-eaef3ea466c8"}

Then below code will save you

df=pd.read_json('data_not_in_proper_format.json', lines=True)

Here it takes line by line and treats it as one JSON object and loads.

Data Display

If you want to see the data loaded then run below code

df.head()

index	_id	index	guid	isActive	balance	picture	age	eyeColor	name	gender	…	phone	address	about	registered	latitude	longitude	tags	friends	greeting	favoriteFruit
0	5e91936d617c639d56d6e40a	0	7fe17eae-c86f-4b99-896a-018636d54f59	True	$1,814.13	http://placehold.it/32×32	33	blue	Britney Cote	female	…	+1 (907) 524-3223	266 Church Lane, Swartzville, Massachusetts, 7719	Dolore aliquip sint ut proident. Ullamco sit e…	2018-07-24T02:51:54 -06:-30	35.545005	-10.315414	[enim, dolor, fugiat, velit, excepteur, culpa,…	127	Hello, Britney Cote! You have 5 unread messages.	apple
1	5e91936d0326151699779694	1	b8c92f7e-ca91-4b07-9e49-956ce1f6bebf	False	$3,610.18	http://placehold.it/32×32	35	brown	Miranda Mathis	male	…	+1 (952) 533-3545	271 Underhill Avenue, Watchtower, Palau, 7558	Ut ea dolor cupidatat laboris nulla in anim ex…	2014-03-27T08:55:19 -06:-30	16.061777	62.760309	[occaecat, veniam, elit, consequat, ea, eiusmo…	235	Hello, Miranda Mathis! You have 3 unread messa…	apple
2	5e91936dac51832738a8d241	2	71627ba1-c0bf-4582-9e16-11a5fa2b7889	False	$2,358.30	http://placehold.it/32×32	35	blue	Carly Rodgers	female	…	+1 (974) 452-3699	116 Forbell Street, Kilbourne, American Samoa,…	Culpa tempor voluptate excepteur adipisicing i…	2017-07-26T02:04:48 -06:-30	10.495423	139.383287	[est, sunt, sunt, laborum, sit, pariatur, nisi]	298	Hello, Carly Rodgers! You have 6 unread messages.	apple
3	5e91936dc74a890c6f19d933	3	f24ce2af-1e61-47e4-93bb-d8024eb8e581	True	$2,457.23	http://placehold.it/32×32	24	green	Patsy Mckee	female	…	+1 (800) 570-2994	275 Railroad Avenue, Cloverdale, Vermont, 7149	Non duis incididunt laboris tempor aute aliqui…	2015-03-05T07:38:38 -06:-30	89.744027	-120.626924	[veniam, sit, nulla, duis, elit, sunt, consect…	191	Hello, Patsy Mckee! You have 5 unread messages.	apple
4	5e91936df9d5b29dec2b1b63	4	013e2904-9e7c-43fe-8bc4-2afc9b8f270c	False	$2,466.73	http://placehold.it/32×32	20	blue	Waters Farley	male	…	+1 (921) 580-2595	148 Village Road, Dyckesville, Minnesota, 2327	Aute nulla aute proident ullamco in laborum. A…	2017-01-05T01:40:22 -06:-30	-78.364669	63.065719	[laborum, ipsum, nulla, anim, laborum, minim, ex]	253	Hello, Waters Farley! You have 6 unread messages.	apple

Here it is displaying the first 5 rows data. If you want to see only one row data then you need to run this

df.head(1)

If you want to see the last 2 rows data then

df.tail(2)

If there are so many columns like above some of the columns won’t be shown there you will see “…” only. To see all columns below code will come handy

pd.set_option('display.max_columns', None)

At the time of display just like columns displaying many rows also have restrictions If you want to see all rows then below code is here to save you

pd.set_option('display.max_rows', None)

Reading data from MySQL

If you need to read data from the MySQL database and load it in the dataframe then

import mysql.connector

mydb = mysql.connector.connect(
  host="localhost",
  user="root",
  passwd="root",
  database="test"
)

First, you need to create a MySQL connection for that we will provide database configuration details.

I use mysql-connector-python package usually. To create a MySQL connection you need to run above code to run it first you need to install this package for that you need to run below command

pip3 install mysql-connector-python

Once the connection is created you just need to provide query rest will be handled by pandas for you

df = pd.read_sql("SELECT * FROM table_name", mydb)

Reading CSV Data

To read data from CSV file

df=pd.read_csv('data.csv', sep=';', header=0)

By default it takes Comma as separator if you want pandas to consider some other separator then you need to mention like this sep=’;’ which you can find in above code.

header = 0 means you have the names of columns in the first row in the file

Writing Data to CSV file

If you want to write the data we have in the data frame to a CSV file then

df.to_csv('data.csv', sep=';', index=False)

By default whenever pandas write data from data frame to CSV file they put row index too. To avoid this index=False is required.

Understanding the Data

To understand what columns have and their types and which have values in every record and which doesn’t

df.info()

You will get output like this

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 22 columns):
 # Column Non-Null Count Dtype  
--- ------ -------------- -----  
 0 _id 19 non-null object 
 1 index 19 non-null int64  
 2 guid 19 non-null object 
 3 isActive 19 non-null bool   
 4 balance 19 non-null object 
 5 picture 19 non-null object 
 6 age 19 non-null int64  
 7 eyeColor 16 non-null object 
 8 name 19 non-null object 
 9 gender 19 non-null object 
 10 company 19 non-null object 
 11 email 19 non-null object 
 12 phone 19 non-null object 
 13 address 19 non-null object 
 14 registered 19 non-null object 
 15 latitude 19 non-null float64
 16 longitude 19 non-null float64
 17 tags 19 non-null object 
 18 friends 19 non-null int64  
 19 greeting 19 non-null object 
 20 favoriteFruit 19 non-null object 
 21 about 17 non-null object 
dtypes: bool(1), float64(2), int64(3), object(16)
memory usage: 3.3+ KB

You want to have only column names nothing other than that then this small code is there to help you

for column in df.columns:
    print(column)

To know unique values in a column

df['eyeColor'].unique()

Output:

array(['blue', nan, 'green', 'brown'], dtype=object)

To get counts of unique values of a column (Like Group By)

df['eyeColor'].value_counts()

Output:

blue 10
green 3
brown 3
Name: eyeColor, dtype: int64

We have 19 rows but if we total above counts then we will get 16 because by default it doesn’t count None values. To get None(or NaN) count also you need to add dropna=False

df['eyeColor'].value_counts(dropna=False)

Output:

blue 10
green 3
brown 3
NaN 3
Name: eyeColor, dtype: int64

To drop unnecessary columns

df = df.drop(['index','guid','picture','email','phone','latitude','longitude','address','registered','tags'], axis='columns')

Whenever we handle operation on column level then axis=’columns’ or axis=1 need to be added

Now the data frame looks like this

index	_id	isActive	balance	age	eyeColor	name	gender	company	friends	greeting	favoriteFruit	about
0	5e91936d617c639d56d6e40a	True	$1,814.13	33	blue	Britney Cote	female	PANZENT	127	Hello, Britney Cote! You have 5 unread messages.	apple	NaN
1	5e91936d0326151699779694	False	$3,610.18	35	NaN	Miranda Mathis	male	HIVEDOM	235	Hello, Miranda Mathis! You have 3 unread messa…	apple	Ut ea dolor cupidatat laboris nulla in anim ex…
2	5e91936dac51832738a8d241	False	$2,358.30	35	blue	Carly Rodgers	female	BUZZNESS	298	Hello, Carly Rodgers! You have 6 unread messages.	apple	Culpa tempor voluptate excepteur adipisicing i…
3	5e91936dc74a890c6f19d933	True	$2,457.23	24	green	Patsy Mckee	female	VISALIA	191	Hello, Patsy Mckee! You have 5 unread messages.	apple	Non duis incididunt laboris tempor aute aliqui…
4	5e91936df9d5b29dec2b1b63	False	$2,466.73	20	blue	Waters Farley	male	RODEOLOGY	253	Hello, Waters Farley! You have 6 unread messages.	apple	Aute nulla aute proident ullamco in laborum. A…

Data imputation

Data imputation plays a major role to prepare data for the model. These code snippets helped me a lot in data imputation journey

To change the value of a column based on condition

df.loc[df['_id']=='5e91936d617c639d56d6e40a',['isActive']] = False

Output:

index	_id	isActive	balance	age	eyeColor	name	gender	company	friends	greeting	favoriteFruit	about
0	5e91936d617c639d56d6e40a	False	$1,814.13	33	blue	Britney Cote	female	PANZENT	127	Hello, Britney Cote! You have 5 unread messages.	apple	NaN
1	5e91936d0326151699779694	False	$3,610.18	35	NaN	Miranda Mathis	male	HIVEDOM	235	Hello, Miranda Mathis! You have 3 unread messa…	apple	Ut ea dolor cupidatat laboris nulla in anim ex…
2	5e91936dac51832738a8d241	False	$2,358.30	35	blue	Carly Rodgers	female	BUZZNESS	298	Hello, Carly Rodgers! You have 6 unread messages.	apple	Culpa tempor voluptate excepteur adipisicing i…
3	5e91936dc74a890c6f19d933	True	$2,457.23	24	green	Patsy Mckee	female	VISALIA	191	Hello, Patsy Mckee! You have 5 unread messages.	apple	Non duis incididunt laboris tempor aute aliqui…
4	5e91936df9d5b29dec2b1b63	False	$2,466.73	20	blue	Waters Farley	male	RODEOLOGY	253	Hello, Waters Farley! You have 6 unread messages.	apple	Aute nulla aute proident ullamco in laborum. A…

You can observe here isActive column value of the row with id 5e91936d617c639d56d6e40a changed to False from True

Sometimes we get None values there we usually fill with data which is kind of normal. For example, most of the people have brown coloured eyes and some records have None in the eyeColor column. To fill it

df['eyeColor'].fillna('brown', inplace=True)

Output:

index	_id	isActive	balance	age	eyeColor	name	gender	company	friends	greeting	favoriteFruit	about
0	5e91936d617c639d56d6e40a	False	$1,814.13	33	blue	Britney Cote	female	PANZENT	127	Hello, Britney Cote! You have 5 unread messages.	apple	NaN
1	5e91936d0326151699779694	False	$3,610.18	35	brown	Miranda Mathis	male	HIVEDOM	235	Hello, Miranda Mathis! You have 3 unread messa…	apple	Ut ea dolor cupidatat laboris nulla in anim ex…
2	5e91936dac51832738a8d241	False	$2,358.30	35	blue	Carly Rodgers	female	BUZZNESS	298	Hello, Carly Rodgers! You have 6 unread messages.	apple	Culpa tempor voluptate excepteur adipisicing i…
3	5e91936dc74a890c6f19d933	True	$2,457.23	24	green	Patsy Mckee	female	VISALIA	191	Hello, Patsy Mckee! You have 5 unread messages.	apple	Non duis incididunt laboris tempor aute aliqui…
4	5e91936df9d5b29dec2b1b63	False	$2,466.73	20	blue	Waters Farley	male	RODEOLOGY	253	Hello, Waters Farley! You have 6 unread messages.	apple	Aute nulla aute proident ullamco in laborum. A…

You can see 2nd row previous value is NaN. Now it got filled by brown

We used inplace=True extra here. Usually, whatever operations we do in data frame it returns a new data frame. Instead of that if you want to do change in current data frame only then we need to use inplace=True

We can do several operations specific to data types. For example, If you want to convert values to uppercase of a column which has string values then pretty simple. Code for this is

df["gender"] = df["gender"].str.upper()

Output:

index	_id	isActive	balance	age	eyeColor	name	gender	company	friends	greeting	favoriteFruit	about
0	5e91936d617c639d56d6e40a	False	$1,814.13	33	blue	Britney Cote	FEMALE	PANZENT	127	Hello, Britney Cote! You have 5 unread messages.	apple	NaN
1	5e91936d0326151699779694	False	$3,610.18	35	brown	Miranda Mathis	MALE	HIVEDOM	235	Hello, Miranda Mathis! You have 3 unread messa…	apple	Ut ea dolor cupidatat laboris nulla in anim ex…
2	5e91936dac51832738a8d241	False	$2,358.30	35	blue	Carly Rodgers	FEMALE	BUZZNESS	298	Hello, Carly Rodgers! You have 6 unread messages.	apple	Culpa tempor voluptate excepteur adipisicing i…
3	5e91936dc74a890c6f19d933	True	$2,457.23	24	green	Patsy Mckee	FEMALE	VISALIA	191	Hello, Patsy Mckee! You have 5 unread messages.	apple	Non duis incididunt laboris tempor aute aliqui…
4	5e91936df9d5b29dec2b1b63	False	$2,466.73	20	blue	Waters Farley	MALE	RODEOLOGY	253	Hello, Waters Farley! You have 6 unread messages.	apple	Aute nulla aute proident ullamco in laborum. A…

You can observe now gender values converted to uppercase

We can consider a data frame as a matrix so we can do matrix operations too. For example, let’s say we want to create a new column where we want to calculate friends count with age. To do this

df['friends_to_age']=df['friends']/df['age']

Output:

index	_id	isActive	balance	age	eyeColor	name	gender	company	friends	greeting	favoriteFruit	about	friends_to_age
0	5e91936d617c639d56d6e40a	False	$1,814.13	33	blue	Britney Cote	FEMALE	PANZENT	127	Hello, Britney Cote! You have 5 unread messages.	apple	NaN	3.848485
1	5e91936d0326151699779694	False	$3,610.18	35	brown	Miranda Mathis	MALE	HIVEDOM	235	Hello, Miranda Mathis! You have 3 unread messa…	apple	Ut ea dolor cupidatat laboris nulla in anim ex…	6.714286
2	5e91936dac51832738a8d241	False	$2,358.30	35	blue	Carly Rodgers	FEMALE	BUZZNESS	298	Hello, Carly Rodgers! You have 6 unread messages.	apple	Culpa tempor voluptate excepteur adipisicing i…	8.514286
3	5e91936dc74a890c6f19d933	True	$2,457.23	24	green	Patsy Mckee	FEMALE	VISALIA	191	Hello, Patsy Mckee! You have 5 unread messages.	apple	Non duis incididunt laboris tempor aute aliqui…	7.958333
4	5e91936df9d5b29dec2b1b63	False	$2,466.73	20	blue	Waters Farley	MALE	RODEOLOGY	253	Hello, Waters Farley! You have 6 unread messages.	apple	Aute nulla aute proident ullamco in laborum. A…	12.650000

You can do all sorts of things on values and write lambda functions too to extract data from values present in a column. For example, If you read the greeting message it contains the number of unread messages count. I don’t think greeting message has any significance except the unread messages count so we can extract and put it in a new column with name unread_messages_count

Output:

index	_id	isActive	balance	age	eyeColor	name	gender	company	friends	greeting	favoriteFruit	about	friends_to_age	unread_messages_count
0	5e91936d617c639d56d6e40a	False	$1,814.13	33	blue	Britney Cote	FEMALE	PANZENT	127	Hello, Britney Cote! You have 5 unread messages.	apple	NaN	3.848485	5
1	5e91936d0326151699779694	False	$3,610.18	35	brown	Miranda Mathis	MALE	HIVEDOM	235	Hello, Miranda Mathis! You have 3 unread messa…	apple	Ut ea dolor cupidatat laboris nulla in anim ex…	6.714286	3
2	5e91936dac51832738a8d241	False	$2,358.30	35	blue	Carly Rodgers	FEMALE	BUZZNESS	298	Hello, Carly Rodgers! You have 6 unread messages.	apple	Culpa tempor voluptate excepteur adipisicing i…	8.514286	6
3	5e91936dc74a890c6f19d933	True	$2,457.23	24	green	Patsy Mckee	FEMALE	VISALIA	191	Hello, Patsy Mckee! You have 5 unread messages.	apple	Non duis incididunt laboris tempor aute aliqui…	7.958333	5
4	5e91936df9d5b29dec2b1b63	False	$2,466.73	20	blue	Waters Farley	MALE	RODEOLOGY	253	Hello, Waters Farley! You have 6 unread messages.	apple	Aute nulla aute proident ullamco in laborum. A…	12.650000	6

Let’s say you want to extract what you want from a column value and put it in different columns that also possible. For example, Let’s split name into first name and last name

def extract_first_last_name(row):
    name = row['name'].split()
    row['first_name'] = name[0]
    row['last_name'] = name[1]
    return row

df = df.apply(lambda row: extract_first_last_name(row), axis='columns')

Output:

index	_id	isActive	balance	age	eyeColor	name	gender	company	friends	greeting	favoriteFruit	about	friends_to_age	unread_messages_count	first_name	last_name
0	5e91936d617c639d56d6e40a	False	$1,814.13	33	blue	Britney Cote	FEMALE	PANZENT	127	Hello, Britney Cote! You have 5 unread messages.	apple	NaN	3.848485	5	Britney	Cote
1	5e91936d0326151699779694	False	$3,610.18	35	brown	Miranda Mathis	MALE	HIVEDOM	235	Hello, Miranda Mathis! You have 3 unread messa…	apple	Ut ea dolor cupidatat laboris nulla in anim ex…	6.714286	3	Miranda	Mathis
2	5e91936dac51832738a8d241	False	$2,358.30	35	blue	Carly Rodgers	FEMALE	BUZZNESS	298	Hello, Carly Rodgers! You have 6 unread messages.	apple	Culpa tempor voluptate excepteur adipisicing i…	8.514286	6	Carly	Rodgers
3	5e91936dc74a890c6f19d933	True	$2,457.23	24	green	Patsy Mckee	FEMALE	VISALIA	191	Hello, Patsy Mckee! You have 5 unread messages.	apple	Non duis incididunt laboris tempor aute aliqui…	7.958333	5	Patsy	Mckee
4	5e91936df9d5b29dec2b1b63	False	$2,466.73	20	blue	Waters Farley	MALE	RODEOLOGY	253	Hello, Waters Farley! You have 6 unread messages.	apple	Aute nulla aute proident ullamco in laborum. A…	12.650000	6	Waters	Farley

We did whatever data imputation we need to do and for some columns, we can’t do anything and they don’t have values in some rows and there is no point of them to keep because it’s going to give anyway trouble at the time of model training so we want to drop those columns and it’s data then

df = df.dropna(axis='columns')

Output:

index	_id	isActive	balance	age	eyeColor	name	gender	company	friends	greeting	favoriteFruit	friends_to_age	unread_messages_count	first_name	last_name
0	5e91936d617c639d56d6e40a	False	$1,814.13	33	blue	Britney Cote	FEMALE	PANZENT	127	Hello, Britney Cote! You have 5 unread messages.	apple	3.848485	5	Britney	Cote
1	5e91936d0326151699779694	False	$3,610.18	35	brown	Miranda Mathis	MALE	HIVEDOM	235	Hello, Miranda Mathis! You have 3 unread messa…	apple	6.714286	3	Miranda	Mathis
2	5e91936dac51832738a8d241	False	$2,358.30	35	blue	Carly Rodgers	FEMALE	BUZZNESS	298	Hello, Carly Rodgers! You have 6 unread messages.	apple	8.514286	6	Carly	Rodgers
3	5e91936dc74a890c6f19d933	True	$2,457.23	24	green	Patsy Mckee	FEMALE	VISALIA	191	Hello, Patsy Mckee! You have 5 unread messages.	apple	7.958333	5	Patsy	Mckee
4	5e91936df9d5b29dec2b1b63	False	$2,466.73	20	blue	Waters Farley	MALE	RODEOLOGY	253	Hello, Waters Farley! You have 6 unread messages.	apple	12.650000	6	Waters	Farley

You can observe here about column is gone.

Let’s drop some unnecessary columns

df = df.drop(['name','greeting'], axis='columns')

Dropping duplicate rows

To drop duplicate rows

unique_df = df.drop_duplicates()

If you want see what unique_df have

unique_df.head(10)

index	_id	isActive	balance	age	eyeColor	gender	company	friends	favoriteFruit	friends_to_age	unread_messages_count	first_name	last_name
0	5e91936d617c639d56d6e40a	False	$1,814.13	33	blue	FEMALE	PANZENT	127	apple	3.848485	5	Britney	Cote
1	5e91936d0326151699779694	False	$3,610.18	35	brown	MALE	HIVEDOM	235	apple	6.714286	3	Miranda	Mathis
2	5e91936dac51832738a8d241	False	$2,358.30	35	blue	FEMALE	BUZZNESS	298	apple	8.514286	6	Carly	Rodgers
3	5e91936dc74a890c6f19d933	True	$2,457.23	24	green	FEMALE	VISALIA	191	apple	7.958333	5	Patsy	Mckee
4	5e91936df9d5b29dec2b1b63	False	$2,466.73	20	blue	MALE	RODEOLOGY	253	apple	12.650000	6	Waters	Farley
5	5e91936dbf8fbe6a1b99f888	False	$2,069.59	39	brown	FEMALE	QUAILCOM	123	banana	3.153846	7	Kelly	Monroe
6	5e91936d4eed9784f3515a46	False	$1,472.59	24	blue	FEMALE	ENTROPIX	167	apple	6.958333	8	Chelsea	Gregory

unique_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 0 to 6
Data columns (total 13 columns):
 # Column Non-Null Count Dtype  
--- ------ -------------- -----  
 0 _id 7 non-null object 
 1 isActive 7 non-null bool   
 2 balance 7 non-null object 
 3 age 7 non-null int64  
 4 eyeColor 7 non-null object 
 5 gender 7 non-null object 
 6 company 7 non-null object 
 7 friends 7 non-null int64  
 8 favoriteFruit 7 non-null object 
 9 friends_to_age 7 non-null float64
 10 unread_messages_count 7 non-null object 
 11 first_name 7 non-null object 
 12 last_name 7 non-null object 
dtypes: bool(1), float64(1), int64(2), object(9)
memory usage: 735.0+ bytes

Now you can observe there are no duplicate rows

Renaming column names

unique_df = unique_df.rename(columns={"unread_messages_count": "unread_count","friends_to_age":"friends_by_age"})

Here we renamed column names from unread_messages_count to unread_count and from friends_to_age to friends_by_age

unique_df.head()

index	_id	isActive	balance	age	eyeColor	gender	company	friends	favoriteFruit	friends_by_age	unread_count	first_name	last_name
0	5e91936d617c639d56d6e40a	False	$1,814.13	33	blue	FEMALE	PANZENT	127	apple	3.848485	5	Britney	Cote
1	5e91936d0326151699779694	False	$3,610.18	35	brown	MALE	HIVEDOM	235	apple	6.714286	3	Miranda	Mathis
2	5e91936dac51832738a8d241	False	$2,358.30	35	blue	FEMALE	BUZZNESS	298	apple	8.514286	6	Carly	Rodgers
3	5e91936dc74a890c6f19d933	True	$2,457.23	24	green	FEMALE	VISALIA	191	apple	7.958333	5	Patsy	Mckee
4	5e91936df9d5b29dec2b1b63	False	$2,466.73	20	blue	MALE	RODEOLOGY	253	apple	12.650000	6	Waters	Farley

Data Filtering and sorting

We can do all sorts of conditions and filter the data what we need too. For example, you want to see the rows which users have age more than 30

unique_df.loc[unique_df['age'] > 30]

Output:

index	_id	isActive	balance	age	eyeColor	gender	company	friends	favoriteFruit	friends_by_age	unread_count	first_name	last_name
0	5e91936d617c639d56d6e40a	False	$1,814.13	33	blue	FEMALE	PANZENT	127	apple	3.848485	5	Britney	Cote
1	5e91936d0326151699779694	False	$3,610.18	35	brown	MALE	HIVEDOM	235	apple	6.714286	3	Miranda	Mathis
2	5e91936dac51832738a8d241	False	$2,358.30	35	blue	FEMALE	BUZZNESS	298	apple	8.514286	6	Carly	Rodgers
5	5e91936dbf8fbe6a1b99f888	False	$2,069.59	39	brown	FEMALE	QUAILCOM	123	banana	3.153846	7	Kelly	Monroe

We can put multiple conditions on a single column too

unique_df.loc[unique_df['eyeColor'].isin(['brown','green'])]

Output:

index	_id	isActive	balance	age	eyeColor	gender	company	friends	favoriteFruit	friends_by_age	unread_count	first_name	last_name
1	5e91936d0326151699779694	False	$3,610.18	35	brown	MALE	HIVEDOM	235	apple	6.714286	3	Miranda	Mathis
3	5e91936dc74a890c6f19d933	True	$2,457.23	24	green	FEMALE	VISALIA	191	apple	7.958333	5	Patsy	Mckee
5	5e91936dbf8fbe6a1b99f888	False	$2,069.59	39	brown	FEMALE	QUAILCOM	123	banana	3.153846	7	Kelly	Monroe

Even we can use predicate functions related to data types like this

unique_df.loc[unique_df['eyeColor'].str.contains('b')]

Output:

index	_id	isActive	balance	age	eyeColor	gender	company	friends	favoriteFruit	friends_by_age	unread_count	first_name	last_name
0	5e91936d617c639d56d6e40a	False	$1,814.13	33	blue	FEMALE	PANZENT	127	apple	3.848485	5	Britney	Cote
1	5e91936d0326151699779694	False	$3,610.18	35	brown	MALE	HIVEDOM	235	apple	6.714286	3	Miranda	Mathis
2	5e91936dac51832738a8d241	False	$2,358.30	35	blue	FEMALE	BUZZNESS	298	apple	8.514286	6	Carly	Rodgers
4	5e91936df9d5b29dec2b1b63	False	$2,466.73	20	blue	MALE	RODEOLOGY	253	apple	12.650000	6	Waters	Farley
5	5e91936dbf8fbe6a1b99f888	False	$2,069.59	39	brown	FEMALE	QUAILCOM	123	banana	3.153846	7	Kelly	Monroe
6	5e91936d4eed9784f3515a46	False	$1,472.59	24	blue	FEMALE	ENTROPIX	167	apple	6.958333	8	Chelsea	Gregory

4th row is gone because it have green as a value in eyeColor column

We can sort the data based on values of a column

unique_df.sort_values(by=['first_name'])

Output:

index	_id	isActive	balance	age	eyeColor	gender	company	friends	favoriteFruit	friends_by_age	unread_count	first_name	last_name
0	5e91936d617c639d56d6e40a	False	$1,814.13	33	blue	FEMALE	PANZENT	127	apple	3.848485	5	Britney	Cote
2	5e91936dac51832738a8d241	False	$2,358.30	35	blue	FEMALE	BUZZNESS	298	apple	8.514286	6	Carly	Rodgers
6	5e91936d4eed9784f3515a46	False	$1,472.59	24	blue	FEMALE	ENTROPIX	167	apple	6.958333	8	Chelsea	Gregory
5	5e91936dbf8fbe6a1b99f888	False	$2,069.59	39	brown	FEMALE	QUAILCOM	123	banana	3.153846	7	Kelly	Monroe
1	5e91936d0326151699779694	False	$3,610.18	35	brown	MALE	HIVEDOM	235	apple	6.714286	3	Miranda	Mathis
3	5e91936dc74a890c6f19d933	True	$2,457.23	24	green	FEMALE	VISALIA	191	apple	7.958333	5	Patsy	Mckee
4	5e91936df9d5b29dec2b1b63	False	$2,466.73	20	blue	MALE	RODEOLOGY	253	apple	12.650000	6	Waters	Farley

Sorted based on(order by) first name

Creating unique data set with specific fields

Some times we need to see specific columns data only and those too unique ones so that we can do these stuff using group by operations.We have duplicate values in df and let’s say we want to know age and balance per user

df.groupby('_id').agg({'age':'min', 'balance':'min'})

Output:

	age	balance
_id
---	---	---
5e91936d0326151699779694	35	$3,610.18
5e91936d4eed9784f3515a46	24	$1,472.59
5e91936d617c639d56d6e40a	33	$1,814.13
5e91936dac51832738a8d241	35	$2,358.30
5e91936dbf8fbe6a1b99f888	39	$2,069.59
5e91936dc74a890c6f19d933	24	$2,457.23
5e91936df9d5b29dec2b1b63	20	$2,466.73

You can observe that age and balance in the header at a different level. To solve this problem reset_index function is there to help you

df.groupby('_id').agg({'age':'min', 'balance':'min'}).reset_index()

index	_id	age	balance
0	5e91936d0326151699779694	35	$3,610.18
1	5e91936d4eed9784f3515a46	24	$1,472.59
2	5e91936d617c639d56d6e40a	33	$1,814.13
3	5e91936dac51832738a8d241	35	$2,358.30
4	5e91936dbf8fbe6a1b99f888	39	$2,069.59
5	5e91936dc74a890c6f19d933	24	$2,457.23
6	5e91936df9d5b29dec2b1b63	20	$2,466.73

Merging 2 data frames

Some times we have data in a different table or different CSV. So we put that in a different data frame, to merge this to another data frame we need to use merge function.

To explain how we can do let’s create a data frame. It has _id and favourite food.

food_dict = [{"_id":"5e91936d0326151699779694", "favouriteFood":"Pasta"},
{"_id":"5e91936d4eed9784f3515a46", "favouriteFood":"Hamburgers"},
{"_id":"5e91936d617c639d56d6e40a", "favouriteFood":"Pizza"},
{"_id":"5e91936dac51832738a8d241", "favouriteFood":"Vegetables"},
{"_id":"5e91936dbf8fbe6a1b99f888", "favouriteFood":"Chicken"},
{"_id":"5e91936dc74a890c6f19d933", "favouriteFood":"Pasta"}
]

df_food = pd.DataFrame(food_dict)

df_food.head()

index	_id	favouriteFood
0	5e91936d0326151699779694	Pasta
1	5e91936d4eed9784f3515a46	Hamburgers
2	5e91936d617c639d56d6e40a	Pizza
3	5e91936dac51832738a8d241	Vegetables
4	5e91936dbf8fbe6a1b99f888	Chicken

To merge df_food and unique_df let’s do an inner join

unique_df = unique_df.merge(df_food, on="_id", how = 'inner')

unique_df.head(10)

index	_id	isActive	balance	age	eyeColor	gender	company	friends	favoriteFruit	friends_by_age	unread_count	first_name	last_name	favouriteFood
0	5e91936d617c639d56d6e40a	False	$1,814.13	33	blue	FEMALE	PANZENT	127	apple	3.848485	5	Britney	Cote	Pizza
1	5e91936d0326151699779694	False	$3,610.18	35	brown	MALE	HIVEDOM	235	apple	6.714286	3	Miranda	Mathis	Pasta
2	5e91936dac51832738a8d241	False	$2,358.30	35	blue	FEMALE	BUZZNESS	298	apple	8.514286	6	Carly	Rodgers	Vegetables
3	5e91936dc74a890c6f19d933	True	$2,457.23	24	green	FEMALE	VISALIA	191	apple	7.958333	5	Patsy	Mckee	Pasta
4	5e91936dbf8fbe6a1b99f888	False	$2,069.59	39	brown	FEMALE	QUAILCOM	123	banana	3.153846	7	Kelly	Monroe	Chicken
5	5e91936d4eed9784f3515a46	False	$1,472.59	24	blue	FEMALE	ENTROPIX	167	apple	6.958333	8	Chelsea	Gregory	Hamburgers

These are the ones helped me in my project. I hope the above code snippets and examples help you too. This is my cheat sheet. I am putting all these code snippets in a notebook and putting into my GitHub Repository. You can find it here.

Peace. Happy Coding.

DEV Community

My Pandas Cheat Sheet for Data Science in Python

Reading JSON data

Data Display

Reading data from MySQL

Reading CSV Data

Writing Data to CSV file

Understanding the Data

Data imputation

Dropping duplicate rows

Renaming column names

Data Filtering and sorting

Creating unique data set with specific fields

Merging 2 data frames

Top comments (0)