Ian Daniel

Posted on Feb 15, 2023

A Beginner's Guide to Reading and Writing Files with Pandas

#automation #career #productivity #discuss

In this article we will look at the several ways pandas allows us to read and write files. Pandas is a powerful python library that helps in the manipulation of data, data representation , handling large data sets, making work with data more flexible and customizable etc.

Installing Pandas - pip install pandas

import pandas as pd

## Reading data from files in pandas
there are several methods provided by pandas to read Data from files.

1. read_csv

df= pd.read_csv('data.csv')

the read_csv() method is used to read a CSV (comma-separated values) file and create a DataFrame object from it

parameters:

delimiter - this is commonly used when your csv uses a delimiter other than a comma. common delimiter types in CSVs are comma, tab, space colon, semicolon, pipe, etc. the example below shows a csv that was using semicolon as its delimiter

df = pd.read_csv('data.csv', delimiter=';')

na_values - This parameter is used to specify the values that should be treated as missing values in the resulting DataFrame. values that are treated as missing values by default in pandas are NaN, NA, N/A, null and empty spaces. if a string has a value like a unkown , you can specify this string as a missing value in the na_values parameter.

df = pd.read_csv('data.csv', na_values=['unknown'])

na_values can also be dropped using the dropna() method in pandas.

df = pd.read_csv('data.csv', na_values=['unknown'])
df = df.dropna()

skiprows- If you need to skip certain rows in your CSV file, you can use the skiprows parameter to indicate the number of rows to skip. For instance, if the first three rows of your CSV file contain metadata or other irrelevant information, you can skip reading that portion by:

df = pd.read_csv('my_data.csv', skiprows=3)

this are just but a few common parameters used in the .read_csv method, you can look up others like headers,cols here

2. read_excel
The read_excel() method is used to read Excel files (.xls and .xlsx) and returns a pandas DataFrame.

df=pd.read_excel('mydata.xlsx')
df=pd.read_excel('mydata.xls')

Parameters:

sheet_name - This parameter specifies the name or index of the sheet to read from the Excel file. If the file has multiple sheets, you can use this parameter to select a specific sheet. By default, it reads the first sheet. lets say my excel file has three sheets named Audi, Toyota, Volkswagen

df=pd.read_excel('mydata.xlsx', sheetname='Audi')
df=pd.read_excel('mydata.xlsx', sheetname='Toyota')
df=pd.read_excel('mydata.xlsx', sheetname='Volkswagen')

Unlike read_csv(), Excel files do not have a delimiter method because the data is not separated by a delimiter. Instead, read_excel() uses the sheet name or index, range of cells, or column and row labels to select the data to be read.

you can check out other parameters related to read_excel here

3. read_json
The read_json is a method in the pandas library that is used to read data from a JSON file into a pandas DataFrame. This is mostly used for reading information from a nested JSON file and extracting the essential data from it.

df = pd.read_json('filename.json')

Parameters:

orient - This parameter specifies the orientation of the JSON file. The default value is columns, which assumes that the JSON file has a column-based structure. Other options include index, which assumes that the JSON file has a row-based structure, and records, which assumes that the JSON file is a list of records. suppose we have this as our json file.

{ "Make": ["Audi", "Mercedes Benz", "Volkswagen"], "Model": ["Rs6","S560","Touareg"], "Engine Size": [4.2,3.0,4.2], "Fuel": ["Diesel", "Diesel", "CNG"], "Year Of Manufacture": [2005, 2018, 2015] }

df = pd.read_json("data.json", orient="columns", index=False)

OUTPUT

In this example, we set the orient option to 'columns' to indicate to Pandas that the data in the JSON file is ordered by columns. As a result, the procedure generated a DataFrame with the columns Make, Model, Engine Size, Fuel, Year Of Manufacture and the relevant JSON file data.

there are other parameters like typ which specifies the type of object to create from the JSON file. you can read more methods of the read_json here

*4. read_sql *
This method in pandas allows you to read data from a SQL database into a pandas DataFrame. I'm going to use SQLite to illustrate examples.

import pandas as pd
import sqlite3

conn = sqlite3.connect('data.db')
df = pd.read_sql(sql_query, conn)

Parameters:

sql_query: A string containing the SQL query that you want to execute. This can be a SELECT statement, a JOIN, or any other valid SQL command that returns data.
con: A database connection object. This is an instance of any Python library that provides a connection to a database.
params: A list of parameters to pass to the SQL query. This is useful for executing parameterized queries that contain placeholders for user input.

import pandas as pd
import sqlite3

conn = sqlite3.connect('orders.db')
orderID=233

df = pd.read_sql('SELECT * FROM order WHERE orderID = ?', conn, params=[orderID])

In this example, we define a variable orderID that contains the Order Id that we want to retrieve from the orders table. We then define a SQL query that uses a parameter placeholder (?) in place of the orderID. We pass the SQL query to the read_sql() method, along with the conn object that represents our connection to the SQLite database.

There are other parameters you can use in the read_sql method, read more about them here

Those are the common methods mainly used to read data in pandas, there are other methods like read_html,read_pickle,read_table etc.

Writing to files Using pandas

Pandas offers various methods to export data from a DataFrame to a file. The supported file types include Excel workbooks, CSV files, SQL databases, JSON files, and others.

1. to_excel
This method allows you to write a DataFrame to an Excel file

df.to_excel('data.xlsx', index=False)

Parameters:

sheet_name: This is the name of the sheet to write the DataFrame to.
index: A Boolean value that indicates whether to include the DataFrame's index in the output file.
excel_writer: The name of the Excel file to write to, or an instance of ExcelWriter that allows you to write to multiple sheets in the same file.

writer = pd.ExcelWriter('data.xlsx', engine='xlsxwriter')
orders_df.to_excel(writer, sheet_name='Orders', index=False)

you can read more about to_excel parameter here.

to_csv This method used to write a pandas DataFrame to a CSV file.

parameters:

sep: The delimiter that will be used between fields in the output file. The default is a comma (','). you can use others like ; , | or ' ' etc.
encoding: The character encoding to use for the output file. The default is 'utf-8'. other encoding types are asciii, utf_32 etc. you can find the full list here
mode: The mode to open the file in. The default is w, which overwrites any existing file. you can also use the a which will append data every time your adding data to the file.

df.to_csv('data.csv', sep=';', index=True, header=True, encoding='utf-8', mode='w')

Pandas also provides other methods to write data to, like to_json,to_sql,to_dict,to_numpy and many more. you can check the rest of the methods in the pandas documentation

Methods used to manipulate Data:

head(): This method returns the first n rows of the DataFrame. By default, it returns the first 5 rows.
tail(): This method returns the last n rows of the DataFrame. By default, it returns the last 5 rows.
dropna(): This method removes any rows that contain missing values.
fillna(value):This method replaces missing values with the specified value.
groupby(column): This method groups the DataFrame by the specified column and returns a GroupBy object. You can then use this object to perform various aggregation functions on the groups.

there are other methods that can be used in the read_csv and read_excel like sort_values(),pivot(),merge() and other more. take a look at them here

df = pd.read_excel('data.xlsx', sheet_name='Audi', header=0, usecols=[0, 1, 3], skiprows=2, nrows=100)

_# print the first 5 rows_
print(df.head())

_# print summary statistics for each column_
print(df.describe())

_# remove any rows with missing values_
df = df.dropna()

# fill missing values with 0
df = df.fillna(0)

#groupby category and calculate the mean 
grouped_df = df.groupby('category').mean()

This is just the beginning of the many impressive capabilities of this library. It offers numerous distinctive features that simplify the task of manipulating, analyzing, and visualizing data for data analysts, data scientists, and anyone else working with data