<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: ryantjo</title>
    <description>The latest articles on DEV Community by ryantjo (@ryantjo).</description>
    <link>https://dev.to/ryantjo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F400813%2Fa85b7bc1-9ea9-4016-ac12-b5cf49d336ed.png</url>
      <title>DEV Community: ryantjo</title>
      <link>https://dev.to/ryantjo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ryantjo"/>
    <language>en</language>
    <item>
      <title>Resampling Market Tick Data</title>
      <dc:creator>ryantjo</dc:creator>
      <pubDate>Sat, 27 Feb 2021 10:18:11 +0000</pubDate>
      <link>https://dev.to/ryantjo/resampling-market-tick-data-5020</link>
      <guid>https://dev.to/ryantjo/resampling-market-tick-data-5020</guid>
      <description>&lt;h3&gt;
  
  
  Tick Data
&lt;/h3&gt;

&lt;p&gt;Tick data is the stream of individual trades executed on an exchange (usually a stock exchange) with each ‘tick' representing a single trade.&lt;/p&gt;

&lt;p&gt;Typically each tick contains a timestamp, trade price, volume and the exchange the trade was executed on. For example, below is a series of ticks for Apple AAPL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2021-02-01 04:00:02:533,133.65,1,ARCX
2021-02-01 04:00:02:533,133.7,4,ARCX
2021-02-01 04:00:03:713,133.71,50,XNGS
2021-02-01 04:00:03:713,134,50,XNGS
2021-02-01 04:00:03:713,133.7,50,ARCX
2021-02-01 04:00:03:932,134,200,XNGS
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(note the timestamp includes milliseconds)&lt;/p&gt;

&lt;h3&gt;
  
  
  Resampling Tick Data
&lt;/h3&gt;

&lt;p&gt;Tick data is the highest resolution form of market data and can give a lot of insight into a market’s microstructure over very short timeframes. However, it is extremely large in size and the sheer volume of the data can make it unwieldy for analysis for longer timeframe analysis (such as over 1 week). For analysis of longer timeframes, intraday bars (or ‘candles’) are the preferred data format. &lt;/p&gt;

&lt;p&gt;A bar, is a single data-point for a timeframe which includes the open, close, high and low prices. For example: below is a series of 1-minute bars for Apple AAPL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2021-01-04 09:30:00,133.52,133.612,132.95,133.15,2328651
2021-01-04 09:31:00,133.13,133.45,133.08,133.335,486524
2021-01-04 09:32:00,133.345,133.36,132.99,133.11,471947
2021-01-04 09:33:00,133.11,133.15,132.71,132.746,477518
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(format : timestamp, high, low, open, close, volume)&lt;/p&gt;

&lt;p&gt;Therefore, a common requirement is to resample tick data into intraday bars. Fortunately, the Pandas Python library has several inbuilt functions to perform this task very efficiently.&lt;/p&gt;

&lt;h3&gt;
  
  
  Worked Example
&lt;/h3&gt;

&lt;p&gt;Starting with a tick dataset for AAPL, you can get a sample tick dataset for AAPL at &lt;a href="https://tickhistory.com/free-tick-data"&gt;TickHistory&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you do not already have Python and Pandas installed, a simple solution is to install &lt;a href="https://www.anaconda.com/products/individual#Downloads"&gt;Anaconda&lt;/a&gt; and then use Anaconda to install Pandas. &lt;/p&gt;

&lt;p&gt;Once in Python, import the Pandas package&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, load the data into a dataframe&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aapl_df = pd.read_csv('AAPL_2020_10.txt', names=['timestamp', 'trade_price', 'volume', 'exchange'], index_col=0, parse_dates=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This statement reads the csv formatted file (note that it can also read directly from a zip file), names the columns, parses the timestamp into a date_time, and finally sets the index to the timestamp column. &lt;/p&gt;

&lt;p&gt;Once the data has been loaded, we can quickly review the dataframe to ensure it has correctly loaded using the &lt;code&gt;head()&lt;/code&gt; function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aapl_df.head()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This should output :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Timestamp           trade_price volume exchange
2021-02-01 04:00:02:533 133.65  1   ARCX
2021-02-01 04:00:02:533 133.7       4   ARCX
2021-02-01 04:00:03:713 133.71  50  XNGS
2021-02-01 04:00:03:713 134     50  XNGS
2021-02-01 04:00:03:713 133.7       50  ARCX
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To resample the data, we will use the Pandas &lt;code&gt;resample()&lt;/code&gt; function. This needs to be repeated for each of the high, low, open, close, volume datapoints in the bar:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aapl_1hour_open_df = aapl_df.resample("1H").agg({'trade_price': 'first’}) 
aapl_1hour_high_df = aapl_df.resample("1H").agg({'trade_price': 'high’}) 
aapl_1hour_low_df = aapl_df.resample("1H").agg({'trade_price': 'low’}) 
aapl_1hour_close_df = aapl_df.resample("1H").agg({'trade_price': 'last’}) 
aapl_1hour_volume_df = aapl_df.resample("1H").agg({'volume': 'sum’}) 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we have a separate dataframe for each of the open, high, low, close, volume datapoints. We now need to combine these into a single dataframe using the Pandas &lt;code&gt;concat()&lt;/code&gt; function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aapl_1hour_df =pd.concat([aapl_1hour_open_df, aapl_1hour_high_df, aapl_1hour_low_df, aapl_1hour_close_df, aapl_1hour_volume_df], axis=1, keys=['open', 'high', 'low', 'close', 'volume'])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, we need to remove the zero volume bars as the resample function will include a bar for every timeframe during the 24-hour day and not just the trading hours. This can be done by filtering for volumes above 0.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aapl_1hour_df = aapl_1hour_df[aapl_1hour_df.volume &amp;gt; 0]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The resampling from ticks to 1-hour bars is now complete and the file can be created using the Pandas &lt;code&gt;to_csv()&lt;/code&gt; function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aapl_1hour_df.to_csv('file_path')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>python</category>
      <category>pandas</category>
    </item>
    <item>
      <title>Using Pandas to Work with Large Excel Spreadsheets
</title>
      <dc:creator>ryantjo</dc:creator>
      <pubDate>Thu, 04 Jun 2020 06:19:01 +0000</pubDate>
      <link>https://dev.to/ryantjo/using-pandas-to-work-with-large-excel-spreadsheets-55ap</link>
      <guid>https://dev.to/ryantjo/using-pandas-to-work-with-large-excel-spreadsheets-55ap</guid>
      <description>&lt;p&gt;Excel has become a mainstay of the finance industry with spreadsheets being the defacto tool for analyzing financial data and in particular time series data. However, a recent trend of using higher resolution data (eg 1-minute trading intervals as opposed to daily data) has exposed a major weakness in Excel - it has a limit of 1 million rows, but in reality performance degrades dramatically on most systems when the row count goes over 500k.&lt;/p&gt;

&lt;p&gt;A common problem we often encounter is how to break large files of time-series data into smaller Excel files that can be worked with. Fortunately, Pandas is ideally suited for this and in this tutorial I will outline how we use Pandas to generate usable Excel files from large time-series data-files. &lt;/p&gt;

&lt;p&gt;Working with a Jupyter notebook we start by importing Pandas&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;We then load the data from a csv file using read_csv. For the purposes of this demo we will use the data provided by  &lt;a href="https://firstratedata.com"&gt;FirstRate Data&lt;/a&gt;&lt;br&gt;
which provides large high-frequency data file samples. In this walkthrough we will use the AAPL (Apple) stock price datafile.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"TimeStamp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"open"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"low"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"close"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"volume"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"https://frd001.s3-us-east-2.amazonaws.com/AAPL_FirstRateDatacom1.zip"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                 
                 &lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cols&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                 &lt;span class="n"&gt;parse_dates&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"TimeStamp"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                 &lt;span class="n"&gt;index_col&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"TimeStamp"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;There are a few things to note here. Firstly, we need to ascertain if the data has a header row containing the column names, if so we can include header=0 in the read_csv arguments. If not we can add them by passing in a list of column names (ie cols in the above sample) to the names parameter.&lt;/p&gt;

&lt;p&gt;By default, read_csv will read a timestamp such as 2019-01-02 04:01:00 as a string, therefore it needs to be converted to a Timestamp object by using parse_dates. Finally the Timestamp column needs to be converted to the index for the dataframe (otherwise the default integer index will be used). &lt;/p&gt;

&lt;p&gt;To test check the dataframe we can run&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Which should give us a familiar looking OHLCV (open, high, low, close, volume) format dataframe:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--zO-OgKxv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/b46h2nx9pkm65j5xr66e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--zO-OgKxv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/b46h2nx9pkm65j5xr66e.png" alt="AAPL Dataframe"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are two common ways large timeseries files are broken into smaller Excel files, firstly by maintaining the same data frequency (in this case 1-minute intervals) and filtering by dates. In which case we can simply use&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;filtered_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'2019-05-01'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;'2019-10-01'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The other method is by aggregating the data into longer time intervals, in this example we will aggregate 1-minute data into 1-hour data. This can be accomplished by firstly using the resample method to select the timeframe (in this case 1H for 1-hour) and then using the agg method to aggregate the data (note that each column will have different aggregation rules and so a key value pair is passed into the agg method corresponding to the column and aggregation method)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;filtered_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"1H"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s"&gt;'open'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'first'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'close'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'last'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'high'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'max'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'low'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'min'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'volume'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'sum'&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Also, note that on some could samples the resample method has a ‘how’ parameter that&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="err"&gt;’&lt;/span&gt;&lt;span class="n"&gt;ohlc&lt;/span&gt;&lt;span class="err"&gt;’&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This method has been deprecated by Pandas and is no longer available. &lt;/p&gt;

&lt;p&gt;To filter out the non-trading days such as weekends and holidays we can pass in an argument which will filter the dataframe for rows where the open is above zero.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;filtered_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;filtered_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;filtered_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;open&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Finally we can save the filtered dataframe as an Excel file&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;filtered_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_excel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;r'path\file_name.xlsx'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;If the error ModuleNotFoundError: No module named ‘openpyxl’ is encountered you will need to install openpyxl as Pandas is relying on this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;openpyxl&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;In a later tutorial we will look at more complex aggregation scenarios such as aggregating tick (ie trade-by-trade) data into OHLCV bars.&lt;/p&gt;

</description>
      <category>pandas</category>
      <category>tutorial</category>
      <category>python</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
