DEV Community

Cover image for Pandas Pocket Guide.
Kaira Kelvin.
Kaira Kelvin.

Posted on

Pandas Pocket Guide.

Python's Pandas library has become an essential tool for data analysis, enabling users to handle complex data sets with ease. As Wes McKinney, the creator of Pandas said in his book Python for Data Analysis.

Pandas is the swiss Army knife of data manipulation tools in python.

Pandas provides the ability to import, clean, manipulate and visualize data making it indispensable for data scientists, analysts, and even beginners in programming.

Why Pandas?

Pandas is built on top of NumPy, making it both powerful and flexible. It introduces intuitive data structures like:

  • Series: A one-dimensional array-like structure.

  • DataFrame: A two-dimensional table-like structure, similar to a spreadsheet or SQL table.
    Dataframe denoted as df is a two-dimensional table consisting of rows and columns.In particular, they include;columns-column names,index-row names,dtypes-dtype associated with each column,shape-number of frows and columns.
    Technically, it is a collection of series-connected or stitched together with a common index.

Creating our first DataFrame.

  • pd.read_csv(‘import_filename.csv', header=1): Reads data from a CSV file directly into a pandas DataFrame

  • my_df.to_csv(‘export_filename'): Directly exports the DataFrame to a CSV file to your workstation

  • my_df.shape: Provides the number of rows and columns of your DataFrame.keep in mind.shape is used without parentheses. The correct usage is to access it as an attribute, not as a method.
    Remember the number of rows come first then followed by
    columns eg in this data from kaggle ("netflix_titles.csv") has
    8807 rows,12columns.

  • my_df.info(): Provides metadata about your DataFrame, including data types for each column.

  • df.dtypes used to show the data types in the pandas data frame.

  • df.set_index('Country',inplace=True) used to set the column country on the data frame.

  • my_df.describe(): Includes statistical details with a column that includes the count, mean, standard deviation (std), minimum, maximum, and percentiles (25th, 50th, and 75th) for any numeric column.

  • column_headers = data.columns -then print (column_headers)

  • len (df) or len(df.index) - gets the number of rows in a data frame.

  • df.replace('a':{'b':'c),'d')

  • Use df[df.isnull().any(axis=1)] to filter rows with any blanks.

  • Use df[df.isnull().all(axis=1)] to filter rows where all values are blank.

Viewing data.

To view a small sample of a series or dataframe object use the head() and tail() methods.

  • my_df.head(2): Displays the first two records from the DataFrame.

  • my_df.tail(2): Displays the last two records from the DataFrame

  • my_df.sort_index(1): Sorts by the labels along an axis—in this example, by the column label headers alphabetically from left to right.

-my_df.isnull(): Displays a list of all rows with a True/False indicator if any of the values by column are null.
df.isnull().any()

To get the top values in a dataset or series.

nlargest() - method returns a specified number of rows ,starting at the top after sorting the dataframe by the highest value for a specified column.

Syntax dataframe.nlargest(n,columns,keep)

  • n- Required a number specifying the number of rows to return.
  • columns- optional, A string (Column label) OR a list of column labels, specifying the columns to order by
  • keep- Optional default 'last' specifying what to do with duplicate rows. 'all', 'first', 'last'
df_food.nlargest(n=5,columns='Vitamin C')
Enter fullscreen mode Exit fullscreen mode

more lines to keep note of

-value_counts()
This method is useful for understanding the distribution of values in a dataset, especially in categorical data, where you want to know how many times each category appears.
It shows the counts of each unique value in the Series. The index represents the unique values, and the corresponding values represent the counts of each unique value in the original Series.

-df_test = df[['drive-wheels',' body-style', 'price']] - this is how to call three columns in a dataframe.

-str.isdigit -Returns a boolean series indicating which strings contain only the digits 0-9

df['colname'].str.isdigit()
Enter fullscreen mode Exit fullscreen mode
df.sort_index 
Enter fullscreen mode Exit fullscreen mode

Reorders the rows of a data frame based on the values in its index in ascending order.

Returns a series with the unique(i.e,distinct )values in s,including NaN (if it occurs in s)

s.unique
Enter fullscreen mode Exit fullscreen mode

How to read or extract columns from a dataset.

To display or output columns in a dataframe in pandas use this ways:

  • Basic ways.
  • Using loc[]
  • Using iloc[]
  • Using .ix

Basic ways

Cal_Wat=df_food[['Calories','Water']]
Enter fullscreen mode Exit fullscreen mode

Using index to select the columns for example

df_food[df_food.columns[1:4]]
Enter fullscreen mode Exit fullscreen mode

Pandas is a very useful library for manipulating mathematical data and is used in the field of machine learning. It comprises many methods for its proper functioning. loc() **and **iloc() are one of those methods.

loc()function.

The loc() function is a label-based data-selecting method which means that we have to pass the name of the row or column that we want to select.
Operations that can be performed using the loc() method include;

1. Selecting Data According to Some Conditions

The code uses the loc function to select and display rows with the same condition such as book price of novels and satisfy a certain condtion such as a rating above 4.6. for example

display(df.loc[(df['book price]==16.99) & (df['rating'] >4.6)])
Enter fullscreen mode Exit fullscreen mode
display(crime.loc[(crime['Crm Cd Desc'] == 'VEHICLE - STOLEN') & (crime['AREA NAME'] == 'Central')])
Enter fullscreen mode Exit fullscreen mode

remember

  • Use single quotes (') around the column names if the column names contain spaces or special characters.
  • Use & for the logical AND operation.

2. Selecting a Range of Rows From the DataFrame.

The code utilizes the loc function to extract and display rows with indices ranging from 3 to 8(inclusive) from the DataFrame.

display(df.loc[3:8])
Enter fullscreen mode Exit fullscreen mode
  1. Updating the Value of Any Column. The code uses the loc function to update

Pandas DataFrame.isin()

The **isin **function is a powerful tool for filtering and selecting data within a data frame based on specified conditions.
It allows you to create boolean masks to identify rows where the values in one or more columns match certain criteria.
eg

Display = crime["AREA NAME"].isin(['Devonshire']) & crime["Status Desc"].isin(['Adult Other'])
print(Display)
Enter fullscreen mode Exit fullscreen mode

this will return

  • The code uses the isin function to check if each element in the specified columns is contained in the provided lists.

  • It creates a boolean Series (Display) where each element corresponds to whether both conditions are true for the corresponding row.

Python iloc() function.

The iloc() function is an indexed-based selecting method which means that we have to pass an integer index in the method to select a specific row/column.DataFrames based on their position or index.

  1. Selecting Rows Using Integer Indices. The code employs the iloc function to extract and display specific rows with indices 0, 2, 4, and 7 from the DataFrame, showcasing the information about the rows selected.
 display(data.iloc[[0, 2, 4, 7]])
Enter fullscreen mode Exit fullscreen mode
  1. Selecting a Range of Columns and Rows Simultaneously. The code utilizes the iloc function to extract and display a subset of the DataFrame, including rows 1 to 4 and columns 2 to 4.
display(data.iloc[1: 5, 2: 5])
Enter fullscreen mode Exit fullscreen mode

df.dtypes - attribute of a DataFrame which returns a Series with the data type of each column.
int64 - represents a 64-bit integer.
object - typically represents strings or mixed types (i.e., columns with different types of data).
float64 - represents a 64-bit floating-point number.

To convert the data types of columns using the keyword 'astype'

df['column1']=df['column1'].astype(float)
Enter fullscreen mode Exit fullscreen mode

when converting object to float u can first clean the data.

Dataframe. info() **-useful during exploratory analysis offering a quick and informative overview of the dataset. it helps gain insights.
**Functions:

It lists all columns with their data types and the number of non-null values in each column.

df.info(verbose = False) 
Enter fullscreen mode Exit fullscreen mode

it utilizes the panda's library to handle tabular data, it prints a summary of the dataset.

df.info(verbose = True, null_counts = False)
Enter fullscreen mode Exit fullscreen mode

it excludes all the null counts.

df.size, df.shape and df.ndim attributes.
Pandas size, shape, and ndim attributes are used to return the size, shape, and dimensions of data frames and series.

data.size - calculates and prints the total number of elements (size) in the DataFrame, which is the product of the number of rows and the number of columns. It includes both non-null and null values in the count.

df. Shape- calculates and prints the shape of the dataframe which includes the number of rows and columns.

df.ndim- attribute of a data frame or series represents the number of dimensions of the underlying data. eg (rows and columns).

Dataframe.isnull() method.
** df.isnull() ** - function detect missing values in the given dataset.It return a boolean same sized object indicating if the values are NA. Missing values get mapped to TRUE and non-missing values get mapped to FALSE .

To get the total of missing values per column.
df_food.isnull().sum() or df_food.isna().sum()

     **
Enter fullscreen mode Exit fullscreen mode
dataframe.sum()
Enter fullscreen mode Exit fullscreen mode

method** - used to calculate the sum of values in a DataFrame or Series

DataFrame merge ()Method .

We can join,merge and concat data frame using different methods. In DataFrame df. merge(),df. join(), and df. concat() methods help in joining merging and concating different dataframes.
The merge () method updates the content of two DataFrame by merging them using the specified methods().

Merging DataFrame.

Pandas have options for high-performance in-memory merging and joining. When we need to combine very large data frames, joins serve as a powerful way to perform these operations swiftly. Joins can only be done on two data frame at a time, denoted as left and right tables. The key is the common column that the column avoid unintended duplication of row values. Pandas provide a single function merge () as the entry point for all standard database join operations between DataFrame objects.

Code 1: Merging a data frame with one unique key combination,
res =pd. merge(df,df1,on='key')
Code 2: Merging dataframe using multiple join keys.
res=pd.merge(df,df1,on=['key','key1']

Merging data frame using how in an argument:
We use how argument to merge specifies how to determine which keys are to be included in the resulting table. If a key combination does not appear in either the left or right tables, the values in the joined table will be NA, Here is a summary of the how operations and their SQL equivalent names:

Image description

Now we set how = 'left' to use keys from the left frame only.

res = pd.merge(df, df1, how='left', on=['key', 'key1'])

Now we set how = 'right' to use keys from the right frame only.

Now we set how = 'outer' to get a union of keys from data frames.

Joining DataFrame.

To join the data frame, we use. join() function. It is used for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.

idxmax() and idxmin() method
This method is used to get the index **of the first occurrence of **maximum over the requested axis.Null values are excluded

df.idxmax(axis=0,skipna=True)
Enter fullscreen mode Exit fullscreen mode

axis '0' or index represents the row wise and '1' or column represents the column wise it represents which axis is to use.
Skipna- it represents the bool(True or False and the default is True.

idxmin() Method.

Method is used to get the index of the first occurrence of minimum over the requested axis.

groupby().

The groupby() function is a powerful tool in python,as the name suggest it is used for grouping data according to the categories and applying a function to the categories.

when converting strings to numeric we use the following code;

 pd.to_numeric
Enter fullscreen mode Exit fullscreen mode

for instance:

 Mpesa['Withdrawn'] = pd.to_numeric(Mpesa['Withdrawn'])

Enter fullscreen mode Exit fullscreen mode

Removing the negative sign in pandas use .

abs()
Enter fullscreen mode Exit fullscreen mode

eg Mpesa['Withdrawn']=Mpesa['Withdrawn'].abs()

Reading from a specific webpage, u pass the following line of code
"url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

Top comments (0)