DEV Community

Python Geeks
Python Geeks

Posted on • Originally published at pythongeeks.net on

Python Pandas for Beginners – A Complete Guide (Part 2)

Python Pandas for Beginners - A Complete Guide (Part 2)

In this article, we continue learning Python Pandas. We will know how to read DataFrame from file and the most important Pandas operator for beginners.

In the previous post of the series, we understand the basic concepts in Pandas such as "what is Pandas?", Series and DataFrame. If you don't remember, click here to get back to Part 1.

Python Pandas for Beginner. Exploring, cleaning, transforming, and visualization data with pandas
Python Pandas for Beginner (Photo by The Washington Post)

How to read and write DataFrame

It is quite simple to read data with the support of the Pandas library. We can load data from various data sources such as CSV, JSON or Excel file. Because of the learning purpose, we will try to load data with all kinds of data sources.

JSON file

Assuming that we have a JSON file that contains data related to product inventory as bellow.

{
  "CAN": {
    "Headphone": 1,
    "Laptop": 5
  },
  "SGN": {
    "Headphone": 3,
    "Laptop": 8
  },
  "SIN": {
    "Headphone": 0,
    "Laptop": 5,
  }
}

Then we can read JSON file easily with read_json function of pandas.

import pandas as pd
data = pd.read_json("data.json")
print(data)

# Result:
#            CAN  SGN  SIN
# headphone    1    3    0
# laptop       5    8    5
#

Writing data to JSON file

data.to_json("exported_data.json")

CSV file

Using the same demo data in previous example. However, the file format is CSV. The data in file as below:

,CAN,SGN,SIN
headphone,1,3,0
laptop,5,8,5

Loading data to DataFrame using read_csv

import pandas as pd
data = pd.read_csv("data.csv")
print(data)

# Result:
#     Unnamed: 0  CAN  SGN  SIN
#  0  headphone    1    3    0
#  1     laptop    5    8    5
#

In the above code, read_csv generates the index column as default. However, we want the first column (headphone, laptop) is the index column. So passing index_col parameter to read_csv to let it know which column will be indexed.

import pandas as pd
data = pd.read_csv("data.csv", index_col=0)
print(data)

# Result:
#              CAN  SGN  SIN
#  headphone    1    3    0
#  laptop       5    8    5
#

Writing DataFrame to CSV file

data.to_csv("exported_data.csv")

Excel file

Reading data from excel file using read_excel function.

import pandas as pd
data = pd.read_excel("data.xlsx", index_col=0)
print(data)

# Result:
#              CAN  SGN  SIN
#  headphone    1    3    0
#  laptop       5    8    5
#

Writing DataFrame to excel file

data.to_excel('exported_data.xlsx', sheet_name='Sheet1')

Important Pandas Operators

Before learning pandas operator, we will create the DataFrame and use it in example of pandas operators section.

import pandas as pd
import numpy as np

dates = pd.date_range('20190101', periods=50)
data = pd.DataFrame(np.random.randn(50, 4), index=dates, columns=list('ABCD'))

Viewing your data

Now we have a DataFrame with 50 rows. It's too big to print. In a real-life situation, the number of rows will be larger many times. Therefore, the first thing we should know is how to print out a few rows to keep as a visual reference.

We will use .head() to display the first 5 rows of your data

data.head()

# Result:
#                   A         B         C         D
#  2019-01-01  -1.005372  0.142613 -0.181516  1.036709
#  2019-01-02   0.790087  0.294033 -0.602744  1.035578
#  2019-01-03  -1.703856  0.126258  1.080593 -0.421066
#  2019-01-04  -0.558818 -2.923537 -1.721127 -0.275644
#  2019-01-05  -1.408334 -0.860980  0.052589  1.104063

What do we do, if we would like to print the first 10 rows instead of 5 rows? In this case, we can pass a number to .head() to define the number of rows you want to print. For example:

data.head(10)

# Result:
#                   A         B         C         D
#  2019-01-01  -1.005372  0.142613 -0.181516  1.036709
#  2019-01-02   0.790087  0.294033 -0.602744  1.035578
#  2019-01-03  -1.703856  0.126258  1.080593 -0.421066
#  2019-01-04  -0.558818 -2.923537 -1.721127 -0.275644
#  2019-01-05  -1.408334 -0.860980  0.052589  1.104063
#  2019-01-06   0.418509  0.607834  0.017931  0.748909
#  2019-01-07   0.336740 -0.406930  1.420234 -1.702862
#  2019-01-08   0.739592  1.051292 -0.757623 -1.156324
#  2019-01-09  -0.225760  0.416810  0.128996 -1.450296
#  2019-01-10   0.527811  1.771893 -0.551995  1.101953

We can use .tail() to get the last 5 rows of the data. And the same with .head(), we can pass a number to determine the number of rows will be printed.

data.tail(7)

# Result:
#                  A         B         C         D
#  2019-02-13  0.834528  2.912336 -0.957908  0.758701
#  2019-02-14 -0.866577 -0.886605 -0.339376  1.296223
#  2019-02-15  0.324452 -1.030220  0.854473  1.471936
#  2019-02-16  2.657040 -1.169546 -1.746896 -0.745877
#  2019-02-17  1.494073 -0.709933 -0.086347 -0.512125
#  2019-02-18 -1.455421 -0.370378  1.475331 -0.867604
#  2019-02-19 -1.542814  0.355690 -0.705522  0.069457

Showing data info

.info() will show you the summary info of your data

data.info()

# Result
#  <class 'pandas.core.frame.DataFrame'>
#  DatetimeIndex: 50 entries, 2019-01-01 to 2019-02-19
#  Freq: D
#  Data columns (total 4 columns):
#  A    50 non-null float64
#  B    50 non-null float64
#  C    50 non-null float64
#  D    50 non-null float64
#  dtypes: float64(4)
#  memory usage: 2.0 KB

DataFrame selection, addition, and deletion

Until now, we known to understand Pandas data structure, how to create, load and write DataFrame. In this section, we will learn the methods of selecting, adding and deleting that you'll need to use constantly.

By Column

The simple way to select a column of data is by using brackets. For example:

data['A'].head()

# Result
#  2019-01-01   -1.005372
#  2019-01-02    0.790087
#  2019-01-03   -1.703856
#  2019-01-04   -0.558818
#  2019-01-05   -1.408334
#  Freq: D, Name: A, dtype: float64

We can select multiple columns inside the brackets.

data[['A', 'D']].head()

# Result
#                   A         D
#  2019-01-01  -1.005372  1.036709
#  2019-01-02   0.790087  1.035578
#  2019-01-03  -1.703856 -0.421066
#  2019-01-04  -0.558818 -0.275644
#  2019-01-05  -1.408334  1.104063

By Row

Another choice for indexing data is using .loc and .iloc. Remember that this syntax is used for indexing by rows only. Below is its syntax:

  • .loc : loc ation of name
  • .iloc : loc ation by i ndex number

If we pass only 1 parameter to .loc or .iloc . It understands the parameter is rows as default. Now we try to get the row labeled "2019-02-19".

data.loc['2019-02-19']

# Result
#  A   -1.542814
#  B    0.355690
#  C   -0.705522
#  D    0.069457
#  Name: 2019-02-19 00:00:00, dtype: float64

Now we will select the specific column in the result

data.loc['2019-02-19', ['A', 'B']]

# Result
#  A   -1.542814
#  B    0.355690
#  Name: 2019-02-19 00:00:00, dtype: float64

Because the row label is time so we also can get rows by time range.

data.loc['2019-02-02':'2019-02-5']

# Result
#                   A         B         C         D
#  2019-02-02   1.343458 -0.940805 -0.671073 -0.102026
#  2019-02-03   0.815890  1.190464 -0.479341 -0.105754
#  2019-02-04  -0.256803  0.595940  0.279049  2.010431
#  2019-02-05   0.311328 -0.111790 -0.515169 -0.014552

Similar to .loc, we can use .iloc to select one or more specific index numbers. Below example for selecting multiple index numbers:

data.iloc[[1,3],:]

# Result
#                  A         B         C         D
#  2019-01-02   0.790087  0.294033 -0.602744  1.035578
#  2019-01-04  -0.558818 -2.923537 -1.721127 -0.275644

Or selecting a slice of index:

data.iloc[10:13]

# Result
#                   A         B         C         D
#  2019-01-11   0.530191 -1.301325 -0.050186  0.267728
#  2019-01-12  -0.652778 -0.611594  0.291840 -0.935883
#  2019-01-13   1.602400 -0.137409  1.002766  0.820419

Be aware that we're getting rows [10:13] but what we get is data from 10th row to 12th row. 13th row is not in the result.

Slicing .iloc follows the same rule as slicing the lists. The object at the last index is not included in the result.

Selection by callable

.loc, .iloc and [] indexing as well can accept a callable. The function must be a function with one argument. For best practice, the Python developer usually uses lambda. You can refer our article for more detail about lambda function ( Python Lambda Function )

data.loc[lambda row: row.D > 0]

# Result:
#                   A         B         C         D
#  2019-01-01  -0.862970  1.312603 -0.427154  1.222134
#  2019-01-03   0.173703 -0.258924 -0.014464  0.320602
#  2019-01-04   0.252102 -1.914325  0.649628  0.885115
#  2019-01-06  -0.461437  0.063366  0.569461  0.427437
#  2019-01-08  -0.956210  0.067605 -2.273661  0.833149
#  2019-01-09   0.340620 -1.976085 -1.144538  0.425282
#  2019-01-11   0.530191 -1.301325 -0.050186  0.267728
#  2019-01-13   1.602400 -0.137409  1.002766  0.820419
#  2019-01-15  -1.083804  1.790858 -0.301093  0.674832
#  2019-01-16  -0.143779  2.104730 -1.228123  1.568482
#  2019-01-22  -0.821942  1.381137 -1.812166  0.785756
#  2019-01-23  -1.404428  0.563167  2.138703  2.387186
#  2019-01-28   0.271000 -0.633658 -0.839952  0.816997
#  2019-01-29  -0.259307  1.638976 -0.648043  0.715521
#  2019-02-01   0.109348 -0.624031 -0.283261  0.138264
#  2019-02-03  -1.319024 -0.131162  2.011317  1.312116
#  2019-02-04   0.910435 -0.636221 -0.857388  0.456990

Handling duplicated data

Handling duplicated data is a common case while working with a large data set. We can iterate over each row of data and check if it is duplicated to others, but it will be very slow and complicated.

Fortunately, the DataFrame class provides a function to help the developer to handle duplicated data easily. We can find duplicated data either by row or by column. Now we will figure out how to it works.

  • duplicated It will return a Boolean series with True at the place of each duplicated.
  • drop_duplicates remove duplicate rows.

By default, the first observed row is considered as unique. However, each method has a keep parameter to specify which row is kept.

  • keep='first': The default value, mark duplicated except for the first observed data.
  • keep='last': mark duplicated except for the last observed data.
  • keep=False: mark all are duplicated.

Before doing the examples, we're going to create data as below

import pandas as pd

employees = [('Mark', 34, 'Toronto'),
            ('Kana', 30, 'Delhi'),
            ('Tam', 26, 'Ha Noi'),
            ('Kana', 30, 'Delhi'),
            ('Kana', 30, 'Delhi'),
            ('Kana', 30, 'Delhi'),
            ('Hashima', 40, 'London'),
            ('Rook', 30, 'Delhi')
            ]

data = pd.DataFrame(employees, columns=['Name', 'Age', 'City'])
print(data)

#        Name  Age     City
#  0     Mark   34  Toronto
#  1     Kana   30    Delhi
#  2      Tam   26   Ha Noi
#  3     Kana   30    Delhi
#  4     Kana   30    Delhi
#  5     Kana   30    Delhi
#  6  Hashima   40   London
#  7     Rook   30    Delhi

By Row

Finding the duplicated rows:

# The same with data.duplicated(keep='first')
data.duplicated()

# Result
#  0    False
#  1    False
#  2    False
#  3     True
#  4     True
#  5     True
#  6    False
#  7    False
#  dtype: bool

The result is a Series data that each row marks that if it is duplicated or not. If we use the result for selecting the data, we will get the value of the duplicated results.

data[data.duplicated()]

# Result
#     Name  Age   City
#  3  Kana   30  Delhi
#  4  Kana   30  Delhi
#  5  Kana   30  Delhi

Dropping the duplicated row, keep the first observed data as default

data.drop_duplicates()

# Result
#        Name  Age     City
#  0     Mark   34   Toronto
#  1     Kana   30    Delhi
#  2      Tam   26   Ha Noi
#  6  Hashima   40   London
#  7     Rook   30    Delhi

Now let see what is the difference if keep='last'

data.drop_duplicates(keep='last')

# Result
#        Name  Age     City
#  0     Mark   34  Toronto
#  2      Tam   26   Ha Noi
#  5     Kana   30    Delhi
#  6  Hashima   40   London
#  7     Rook   30    Delhi

And now keep=False

data.drop_duplicates(keep=False)

# Result
#        Name  Age     City
#  0     Mark   34  Toronto
#  2      Tam   26   Ha Noi
#  6  Hashima   40   London
#  7     Rook   30    Delhi

By Column

We are able to select specific columns to determine if data is duplicated

data[data.duplicated(['Age', 'City'])]

# Result
#     Name  Age   City
#  3  Kana   30  Delhi
#  4  Kana   30  Delhi
#  5  Kana   30  Delhi
#  7  Rook   30  Delhi

Other examples for drop_duplicates by the selected column will be available in our GitHub. Check it for more detail if you need it.

Applying Function

Now, for example, we would like to get the employees who are over 30 years old. Don't iterate over each employee and check if the employee is over 30 years old or not. Instead, take advantage of the Pandas library. It is apply() function.

data[data['Age'].apply(lambda x: x > 30)]

# Result
#        Name  Age     City
#  0     Mark   34  Toronto
#  6  Hashima   40   London

Summary

Exploring, cleaning, transforming, and visualization data with pandas in Python is an essential skill in data science. This series just described very basic knowledge about Python Pandas for Beginner. Find full examples in this series on our GitHub project.

If you found your own useful information through the series, please share it for others, leave a comment and subscribe to us for getting notification of future posts. We are going to really really appreciate your time to read the post and share it with other Python Geeks.

Reference

Like every article from us, we always provide you the references that we used to make the post. In case you want to know more about Pandas library, you can refer below links:

The post Python Pandas for Beginners – A Complete Guide (Part 2) appeared first on Python Geeks.

Top comments (0)