Welcome to the second article in the "7 Days of Pandas" series where we cover the pandas
library in Python which is used for data manipulation.
In the first article of the series, we looked at how to read and write CSV files with Pandas. In this tutorial, we will look at some of the most common operations that we perform on a dataframe in Pandas.
Pandas is a powerful Python library that is widely used for data manipulation and analysis. It provides a range of functions and methods that allow you to easily manipulate and transform data in a variety of formats. In this tutorial, we will cover the following topics:
- Selecting rows and columns
- Filtering data
- Sorting data
- Adding and deleting columns
Before we begin, let's first import pandas and read in a sample data file. We will use the pandas.read_csv()
function to read in a CSV file and store it in a DataFrame object.
We'll assume that a CSV file "sample_data.csv" exists in the current working directory that we read into a dataframe.
import pandas as pd
df = pd.read_csv("sample_data.csv")
Now that we have a DataFrame, let's dive into the first topic: selecting rows and columns.
Selecting Rows and Columns
There are several ways to select specific rows and columns from a pandas DataFrame. One way is to use the loc attribute, which allows you to select rows and columns based on their labels. For example, to select the first row of the DataFrame, you can use the following code:
# select the first row
df.loc[0]
To select a specific column, you can pass the column name as a string:
# select column by its name
df.loc[:, "column_name"]
You can also use the iloc
attribute to select rows and columns based on their integer indices. For example, to select the first row using iloc, you can use the following code:
# select the first row
df.iloc[0]
To select a specific column, you can pass the column index as an integer:
# select column by column index
df.iloc[:, 0]
Filtering Data
In addition to selecting rows and columns, you can also use pandas to filter your data based on specific conditions.
You can use boolean indexing to filter the data in a dataframe. Boolean indexing allows you to filter a DataFrame based on the values in one or more columns. The idea is the to use a boolean expression that results in a boolean index which we use to filter the original data.
To do this, you pass a boolean expression to the DataFrame's indexing operator, []. For example, to filter the DataFrame to only include rows where the value in the "column_name" column is greater than 5, you can use the following code:
# filter dataframe
df[df["column_name"] > 5]
You can also filter the dataframe on multiple conditions by using the logical operators &
(and) and |
(or). For example, to filter the DataFrame to only include rows where the value in the "column_name" column is greater than 5 and the value in the "other_column" column is less than 10, you can use the following code:
# filter dataframe on mulitple conditions
df[(df["column_name"] > 5) & (df["other_column"] < 10)]
Alternatively, you can also use the query()
function in pandas to filter a dataframe.
Sorting Data
To sort a pandas DataFrame, you can use the pandas dataframe sort_values()
method. This method allows you to specify one or multiple columns to sort by, as well as the sort order (ascending or descending).
For example, to sort the DataFrame by the "column_name" column in ascending order, you can use the following code:
# sort dataframe by "column_name" in ascending order
df.sort_values("column_name")
To sort in descending order, you can set the ascending
parameter to False
:
# sort dataframe by "column_name" in descending order
df.sort_values("column_name", ascending=False)
You can also sort by multiple columns by passing a list of column names:
# sort dataframe by multiple columns
df.sort_values(["column_name_1", "column_name_2"])
Adding and Deleting Columns
To add a new column to a pandas DataFrame, you can simply assign a new value to a column that doesn't exist. For example, to add a new column called "new_column" with a default value of 0 for all rows, you can use the following code:
# create a new column with all values as 0
df["new_column"] = 0
You can also assign different values to each row using a list or another Series object.
There are other methods to add a column as well.
To delete a column from a DataFrame, you can use the drop()
method and specify the column name and the axis
parameter set to 1 (columns). For example, to delete the "new_column" from the DataFrame, you can use the following code:
# remove the column "new_column" from the dataframe
df = df.drop("new_column", axis=1)
That concludes this tutorial on basic data manipulation with pandas. We hope that you found it useful.
In the coming articles, we'll look at other useful operations in Pandas.
Top comments (0)