DEV Community: ryantjo

Resampling Market Tick Data

ryantjo — Sat, 27 Feb 2021 10:18:11 +0000

Tick Data

Tick data is the stream of individual trades executed on an exchange (usually a stock exchange) with each ‘tick' representing a single trade.

Typically each tick contains a timestamp, trade price, volume and the exchange the trade was executed on. For example, below is a series of ticks for Apple AAPL:

2021-02-01 04:00:02:533,133.65,1,ARCX
2021-02-01 04:00:02:533,133.7,4,ARCX
2021-02-01 04:00:03:713,133.71,50,XNGS
2021-02-01 04:00:03:713,134,50,XNGS
2021-02-01 04:00:03:713,133.7,50,ARCX
2021-02-01 04:00:03:932,134,200,XNGS

(note the timestamp includes milliseconds)

Resampling Tick Data

Tick data is the highest resolution form of market data and can give a lot of insight into a market’s microstructure over very short timeframes. However, it is extremely large in size and the sheer volume of the data can make it unwieldy for analysis for longer timeframe analysis (such as over 1 week). For analysis of longer timeframes, intraday bars (or ‘candles’) are the preferred data format.

A bar, is a single data-point for a timeframe which includes the open, close, high and low prices. For example: below is a series of 1-minute bars for Apple AAPL:

2021-01-04 09:30:00,133.52,133.612,132.95,133.15,2328651
2021-01-04 09:31:00,133.13,133.45,133.08,133.335,486524
2021-01-04 09:32:00,133.345,133.36,132.99,133.11,471947
2021-01-04 09:33:00,133.11,133.15,132.71,132.746,477518

(format : timestamp, high, low, open, close, volume)

Therefore, a common requirement is to resample tick data into intraday bars. Fortunately, the Pandas Python library has several inbuilt functions to perform this task very efficiently.

Worked Example

Starting with a tick dataset for AAPL, you can get a sample tick dataset for AAPL at TickHistory.

If you do not already have Python and Pandas installed, a simple solution is to install Anaconda and then use Anaconda to install Pandas.

Once in Python, import the Pandas package

import pandas as pd

Next, load the data into a dataframe

aapl_df = pd.read_csv('AAPL_2020_10.txt', names=['timestamp', 'trade_price', 'volume', 'exchange'], index_col=0, parse_dates=True)

This statement reads the csv formatted file (note that it can also read directly from a zip file), names the columns, parses the timestamp into a date_time, and finally sets the index to the timestamp column.

Once the data has been loaded, we can quickly review the dataframe to ensure it has correctly loaded using the head() function:

aapl_df.head()

This should output :

Timestamp           trade_price volume exchange
2021-02-01 04:00:02:533 133.65  1   ARCX
2021-02-01 04:00:02:533 133.7       4   ARCX
2021-02-01 04:00:03:713 133.71  50  XNGS
2021-02-01 04:00:03:713 134     50  XNGS
2021-02-01 04:00:03:713 133.7       50  ARCX

To resample the data, we will use the Pandas resample() function. This needs to be repeated for each of the high, low, open, close, volume datapoints in the bar:

aapl_1hour_open_df = aapl_df.resample("1H").agg({'trade_price': 'first’}) 
aapl_1hour_high_df = aapl_df.resample("1H").agg({'trade_price': 'high’}) 
aapl_1hour_low_df = aapl_df.resample("1H").agg({'trade_price': 'low’}) 
aapl_1hour_close_df = aapl_df.resample("1H").agg({'trade_price': 'last’}) 
aapl_1hour_volume_df = aapl_df.resample("1H").agg({'volume': 'sum’})

Now we have a separate dataframe for each of the open, high, low, close, volume datapoints. We now need to combine these into a single dataframe using the Pandas concat() function:

aapl_1hour_df =pd.concat([aapl_1hour_open_df, aapl_1hour_high_df, aapl_1hour_low_df, aapl_1hour_close_df, aapl_1hour_volume_df], axis=1, keys=['open', 'high', 'low', 'close', 'volume'])

Finally, we need to remove the zero volume bars as the resample function will include a bar for every timeframe during the 24-hour day and not just the trading hours. This can be done by filtering for volumes above 0.

aapl_1hour_df = aapl_1hour_df[aapl_1hour_df.volume > 0]

The resampling from ticks to 1-hour bars is now complete and the file can be created using the Pandas to_csv() function:

aapl_1hour_df.to_csv('file_path')

Using Pandas to Work with Large Excel Spreadsheets

ryantjo — Thu, 04 Jun 2020 06:19:01 +0000

Excel has become a mainstay of the finance industry with spreadsheets being the defacto tool for analyzing financial data and in particular time series data. However, a recent trend of using higher resolution data (eg 1-minute trading intervals as opposed to daily data) has exposed a major weakness in Excel - it has a limit of 1 million rows, but in reality performance degrades dramatically on most systems when the row count goes over 500k.

A common problem we often encounter is how to break large files of time-series data into smaller Excel files that can be worked with. Fortunately, Pandas is ideally suited for this and in this tutorial I will outline how we use Pandas to generate usable Excel files from large time-series data-files.

Working with a Jupyter notebook we start by importing Pandas

import pandas as pd

We then load the data from a csv file using read_csv. For the purposes of this demo we will use the data provided by FirstRate Data
which provides large high-frequency data file samples. In this walkthrough we will use the AAPL (Apple) stock price datafile.

cols = ["TimeStamp", "open", "high", "low", "close", "volume"]
df = pd.read_csv("https://frd001.s3-us-east-2.amazonaws.com/AAPL_FirstRateDatacom1.zip",                 
                 names=cols, 
                 parse_dates=["TimeStamp"],
                 index_col=["TimeStamp"] )

There are a few things to note here. Firstly, we need to ascertain if the data has a header row containing the column names, if so we can include header=0 in the read_csv arguments. If not we can add them by passing in a list of column names (ie cols in the above sample) to the names parameter.

By default, read_csv will read a timestamp such as 2019-01-02 04:01:00 as a string, therefore it needs to be converted to a Timestamp object by using parse_dates. Finally the Timestamp column needs to be converted to the index for the dataframe (otherwise the default integer index will be used).

To test check the dataframe we can run

df.head()

Which should give us a familiar looking OHLCV (open, high, low, close, volume) format dataframe:

There are two common ways large timeseries files are broken into smaller Excel files, firstly by maintaining the same data frequency (in this case 1-minute intervals) and filtering by dates. In which case we can simply use

filtered_df = df.loc['2019-05-01':'2019-10-01']

The other method is by aggregating the data into longer time intervals, in this example we will aggregate 1-minute data into 1-hour data. This can be accomplished by firstly using the resample method to select the timeframe (in this case 1H for 1-hour) and then using the agg method to aggregate the data (note that each column will have different aggregation rules and so a key value pair is passed into the agg method corresponding to the column and aggregation method)

filtered_df =  df.resample("1H").agg({'open': 'first', 'close': 'last', 'high' : 'max', 'low' : 'min', 'volume': 'sum'})

Also, note that on some could samples the resample method has a ‘how’ parameter that

how=’ohlc’

This method has been deprecated by Pandas and is no longer available.

To filter out the non-trading days such as weekends and holidays we can pass in an argument which will filter the dataframe for rows where the open is above zero.

filtered_df = filtered_df[filtered_df.open > 0]

Finally we can save the filtered dataframe as an Excel file

filtered_df.to_excel(r'path\file_name.xlsx', index = False)

If the error ModuleNotFoundError: No module named ‘openpyxl’ is encountered you will need to install openpyxl as Pandas is relying on this:

pip install openpyxl

In a later tutorial we will look at more complex aggregation scenarios such as aggregating tick (ie trade-by-trade) data into OHLCV bars.