DEV Community

ryantjo
ryantjo

Posted on

Using Pandas to Work with Large Excel Spreadsheets

Excel has become a mainstay of the finance industry with spreadsheets being the defacto tool for analyzing financial data and in particular time series data. However, a recent trend of using higher resolution data (eg 1-minute trading intervals as opposed to daily data) has exposed a major weakness in Excel - it has a limit of 1 million rows, but in reality performance degrades dramatically on most systems when the row count goes over 500k.

A common problem we often encounter is how to break large files of time-series data into smaller Excel files that can be worked with. Fortunately, Pandas is ideally suited for this and in this tutorial I will outline how we use Pandas to generate usable Excel files from large time-series data-files.

Working with a Jupyter notebook we start by importing Pandas

import pandas as pd 

We then load the data from a csv file using read_csv. For the purposes of this demo we will use the data provided by FirstRate Data
which provides large high-frequency data file samples. In this walkthrough we will use the AAPL (Apple) stock price datafile.

cols = ["TimeStamp", "open", "high", "low", "close", "volume"]
df = pd.read_csv("https://frd001.s3-us-east-2.amazonaws.com/AAPL_FirstRateDatacom1.zip",                 
                 names=cols, 
                 parse_dates=["TimeStamp"],
                 index_col=["TimeStamp"] )

There are a few things to note here. Firstly, we need to ascertain if the data has a header row containing the column names, if so we can include header=0 in the read_csv arguments. If not we can add them by passing in a list of column names (ie cols in the above sample) to the names parameter.

By default, read_csv will read a timestamp such as 2019-01-02 04:01:00 as a string, therefore it needs to be converted to a Timestamp object by using parse_dates. Finally the Timestamp column needs to be converted to the index for the dataframe (otherwise the default integer index will be used).

To test check the dataframe we can run

df.head()

Which should give us a familiar looking OHLCV (open, high, low, close, volume) format dataframe:

AAPL Dataframe

There are two common ways large timeseries files are broken into smaller Excel files, firstly by maintaining the same data frequency (in this case 1-minute intervals) and filtering by dates. In which case we can simply use

filtered_df = df.loc['2019-05-01':'2019-10-01']

The other method is by aggregating the data into longer time intervals, in this example we will aggregate 1-minute data into 1-hour data. This can be accomplished by firstly using the resample method to select the timeframe (in this case 1H for 1-hour) and then using the agg method to aggregate the data (note that each column will have different aggregation rules and so a key value pair is passed into the agg method corresponding to the column and aggregation method)

filtered_df =  df.resample("1H").agg({'open': 'first', 'close': 'last', 'high' : 'max', 'low' : 'min', 'volume': 'sum'}) 

Also, note that on some could samples the resample method has a ‘how’ parameter that

how=ohlc

This method has been deprecated by Pandas and is no longer available.

To filter out the non-trading days such as weekends and holidays we can pass in an argument which will filter the dataframe for rows where the open is above zero.

filtered_df = filtered_df[filtered_df.open > 0]

Finally we can save the filtered dataframe as an Excel file

filtered_df.to_excel(r'path\file_name.xlsx', index = False)

If the error ModuleNotFoundError: No module named ‘openpyxl’ is encountered you will need to install openpyxl as Pandas is relying on this:

pip install openpyxl

In a later tutorial we will look at more complex aggregation scenarios such as aggregating tick (ie trade-by-trade) data into OHLCV bars.

Top comments (0)