In the previous article, we looked at what DataFrames and Series are in pandas. We also explored how to create a pandas DataFrame (referred to as df
here) from a list and a dictionary. I hope you've also researched the key differences between a DataFrame and a Series in pandas. In this article, we will build on that knowledge.
To understand better, in data science, data can be sourced from various sources, and each source may store the data in different formats, such as comma-separated values (CSV), Excel sheets, SQL files, and more. It's up to you to know what tools to use. Now, let's dive into the available functions in pandas.
Reading and Writing CSV Files
Pandas provides read_csv()
and to_csv()
functions to work with CSV files.
Reading CSV Files
read_csv()
is used to read data from CSV files. For example:
import pandas as pd
csvdf = pd.read_csv("file_path.csv")
By default, read_csv()
assumes the first row in your file is the header row. If this is not the case, you can use the header
parameter to specify the row number to use as the header:
csvdf = pd.read_csv("file_path.csv", header=3)
Note that all rows above the specified header will be ignored when creating the DataFrame.
Reading CSV Files with Multiple Headers
If your CSV file has multiple header rows, you can specify them using a list with the header
parameter:
csvdf = pd.read_csv("file_path.csv", header=[0, 2])
Reading CSV Files Without Headers
If your CSV file doesn't include headers, but you have the column names separately, you can use the names
parameter to provide a list of column names:
csvdf = pd.read_csv("file_path.csv", names=["first_name", "last_name", "gender"])
You can check the set headers using csvdf.columns
.
Adding a Prefix to Column Names
If you have many columns and labeling them is cumbersome, you can add a prefix to the columns using the prefix
parameter:
csvdf = pd.read_csv("file_path.csv", header=None, prefix="Col_")
Setting the Index Column
By default, DataFrames are indexed from 0 to n-1. If you want to use a specific column as the index, you can do so with the index_col
parameter:
csvdf = pd.read_csv("file_path.csv", index_col='company')
# Alternate approach using column index
csvdf = pd.read_csv("file_path.csv", index_col=0)
If it's a multi-index DataFrame, use a list to indicate the index, similar to how we did for the headers.
Reading CSV Files with Defined Columns and Rows
If you don't need all the columns and rows in the provided file, you can filter them using the usecols
and nrows
parameters. For example:
csvdf = pd.read_csv("file_path.csv", index_col=0, usecols=['names', 'gender'], nrows=10)
This means you'll get a DataFrame with only the 'names' and 'gender' columns and only 10 rows of the available rows.
Skipping Rows
If you need to skip specific rows, like those with even row numbers, you can use the skiprows
functionality:
csvdf = pd.read_csv("file_path.csv", index_col=0, skiprows=lambda x: x % 2 == 0)
This uses a lambda function to skip rows where the row number is even.
Writing to CSV
To save a DataFrame to a CSV file, use the to_csv()
method. By default, it also includes the index in the file. You can disable this by setting the index
parameter to `False:
python
csvdf.to_csv("file_name.csv", index=False)
Most of these functionalities ie read_json() and read_excel() are similar _(and I will not be handling that at least for now)_when writing to CSV as when reading from CSV. Note that this is not an exhaustive guide on read_csv
, so I encourage you to explore further and practice to better understand. You can also find free courses on platforms like Simplilearn to boost your skills and resume.
Top comments (3)
Good stuff! I've just started learning pandas myself. Please keep posting these and continue getting more in depth.
I will, and thank you
Interesting, please keep posting