Hi guys. Today I would like to talk about implementing Object-oriented (OO) Design to build a Business Intelligence module with Python and Pandas.
Pandas is widely used for data analytics, it is easy to use, one can easily build a script of hundreds of lines of codes using pandas to handle complex data. However, having worked with many data analysts, I found that most of the scripts produced by data analysts should fall into the category of "procedural code", which is not reusable. Changing the context a bit would often result in a complete re-write of the code. It is fine if you are in the "research" area where your code is used for "exploration" or "experimental" purposes, but if you are to write code that contributes to a larger system or code that will be run repeatedly, you need to implement some architecture.
My Background
I work in finance where I need to handle data with complex relationship daily. The raw data that I received is usually too "raw" such that extracting any useful information from it often requires a lot of steps. Code could easily become impossible to maintain due to the mismatch of raw data structure and business logic.
The solution
Soon I realized that I needed a OO layer on top of the raw data to handle all the complexity. The reason I use the term OO layer here is that OO is essentially an abstraction layer. It allows you to focus on business logic only, instead of having to worry about how to implement business from raw data. Architecture is all about reducing your cognitive burden when you try to edit a codebase.
The Data
We will try to build a OO model for some trade data:
Account, | Stock_Code, | BuySell, | Date, | Quantity, | Unit_Price |
---|---|---|---|---|---|
1001 | 001 | B | 20190105 | 8 | 2 |
1001 | 002 | B | 20190105 | 4 | 3 |
1001 | 002 | S | 20190106 | 3 | 1 |
1001 | 003 | B | 20190106 | 6 | 5 |
1001 | 003 | S | 20190106 | 4 | 6 |
1001 | 001 | S | 20190107 | 6 | 3 |
1002 | 001 | B | 20190105 | 6 | 2 |
1002 | 002 | B | 20190105 | 8 | 3 |
1002 | 002 | S | 20190106 | 3 | 1 |
1002 | 003 | B | 20190106 | 5 | 5 |
1002 | 003 | S | 20190106 | 4 | 6 |
1002 | 001 | S | 20190107 | 6 | 3 |
some account information:
Account, | Name |
---|---|
1001 | David |
1002 | Tom |
and some closing price data:
Stock_Code, | Closing_Price, | Data |
---|---|---|
001 | 2 | 20190105 |
001 | 3 | 20190106 |
001 | 2 | 20190107 |
002 | 2 | 20190105 |
002 | 3 | 20190106 |
002 | 5 | 20190107 |
003 | 5 | 20190105 |
003 | 6 | 20190106 |
003 | 7 | 20190107 |
First, we will have a "Data Layer" to handle the i/o of source data. since we don't have the actual data source, we will
mock it up:
import pandas as pd
from io import StringIO
class Data():
def trade_df(self):
# This method should handle the import of source data
data = StringIO()
s = """
Account|Stock_Code|BuySell|Date|Quantity|Unit_Price
1001|001|B|20190105|8|2
1001|002|B|20190105|4|3
1001|002|S|20190106|3|1
1001|003|B|20190106|6|5
1001|003|S|20190106|4|6
1001|001|S|20190107|6|3
1002|001|B|20190105|6|2
1002|002|B|20190105|8|3
1002|002|S|20190106|3|1
1002|003|B|20190106|5|5
1002|003|S|20190106|4|6
1002|001|S|20190107|6|3
"""
data.write(s.replace(' ',''))
data.seek(0)
df = pd.read_csv(data,sep='|',dtype={'Account':str,
'Date':str,
'Stock_Code':str,
"Date":str})
df.Date = pd.to_datetime(df.Date,format='%Y%m%d')
return df
def account_df(self):
data = StringIO()
s = """
Account|Name
1001|David
1002|Tom
"""
data.write(s.replace(' ',''))
data.seek(0)
df = pd.read_csv(data, sep='|', dtype={'Account': str,
'Name': str})
return df
def stock_prices(self):
data = StringIO()
s = """
Stock_Code|Closing_Price|Date
001|2|20190105
001|3|20190106
001|2|20190107
002|2|20190105
002|3|20190106
002|5|20190107
003|5|20190105
003|6|20190106
003|7|20190107
"""
data.write(s.replace(' ',''))
data.seek(0)
df = pd.read_csv(data, sep='|', dtype={'Stock_Code': str,
'Date': str})
df.Date = pd.to_datetime(df.Date,format='%Y%m%d')
return df
The Data
object should handle the import and validation of the data only, and it should not implement and business logic.
Then on top of the Data
object, we will build our first OO layer:
class Book():
def __init__(self):
self._data = Data()
@property
def trade_book_df(self):
if not hasattr(self,'_trade_book_df'):
df = self._data.trade_df().join(self._data.account_df().set_index('Account'),on='Account')
df['Trade_Amount'] = df.Quantity * df.Unit_Price
self._trade_book_df = df
return self._trade_book_df
def stock_prices(self,date=None):
df = self._data.stock_prices()
if date:
df = df[df.Date == date]
return df
@property
def holdings(self):
return Holdings(self)
@property
def accounts(self):
return Accounts(self.holdings)
Book().trade_book_df
| | Account | Stock_Code | BuySell | Date | Quantity | Unit_Price | Name | Trade_Amount |
|---:|----------:|-------------:|:----------|:--------------------|-----------:|-------------:|:-------|---------------:|
| 0 | 1001 | 001 | B | 2019-01-05 00:00:00 | 8 | 2 | David | 16 |
| 1 | 1001 | 002 | B | 2019-01-05 00:00:00 | 4 | 3 | David | 12 |
| 2 | 1001 | 002 | S | 2019-01-06 00:00:00 | 3 | 1 | David | 3 |
| 3 | 1001 | 003 | B | 2019-01-06 00:00:00 | 6 | 5 | David | 30 |
| 4 | 1001 | 003 | S | 2019-01-06 00:00:00 | 4 | 6 | David | 24 |
| 5 | 1001 | 001 | S | 2019-01-07 00:00:00 | 6 | 3 | David | 18 |
| 6 | 1002 | 001 | B | 2019-01-05 00:00:00 | 6 | 2 | Tom | 12 |
| 7 | 1002 | 002 | B | 2019-01-05 00:00:00 | 8 | 3 | Tom | 24 |
| 8 | 1002 | 002 | S | 2019-01-06 00:00:00 | 3 | 1 | Tom | 3 |
| 9 | 1002 | 003 | B | 2019-01-06 00:00:00 | 5 | 5 | Tom | 25 |
| 10 | 1002 | 003 | S | 2019-01-06 00:00:00 | 4 | 6 | Tom | 24 |
| 11 | 1002 | 001 | S | 2019-01-07 00:00:00 | 6 | 3 | Tom | 18 |
- The
Book
object has a_data
attribute which owns theData
object. - The
trade_book_df
property implemented two business logic:- The joining of
trade_df
andaccount_df
. - The definition of
Trade_Amount
->Quantity
*Unit_Price
.
- The joining of
- The
stock_prices
method extracts stock prices of a specific date. - It provides the gateways to the
Holdings
andAccounts
context which we are going to talk about.
Contexts
"Context" is the dimension you want to view the data, all the views with the same dimension should be encapsulated
in one object.
For example, to view the data in the "Holding" context, we need to:
- aggregate the trade data up to a certain point of time.
- multiply holding quantity and stock's closing price to obtain the market value.
class Holdings():
def __init__(self,book):
self._book = book
def holdings_of(self,date):
trades_df = self._book.trade_book_df
date_hld_df = trades_df[trades_df.Date <= date]
date_hld_df['qnt_change'] = date_hld_df['BuySell'].map({'B':1,'S':-1}) * date_hld_df.Quantity
hld_df = date_hld_df.groupby(['Account','Stock_Code'],as_index=False)\
.agg({'qnt_change':sum})\
.rename(columns={'qnt_change':'Holdings'})
hld_df = hld_df.join(self._book.stock_prices(date).set_index('Stock_Code'),on='Stock_Code')
hld_df['Market_Value'] = hld_df.Closing_Price * hld_df.Holdings
return hld_df
from datetime import date
Book().holdings.holdings_of(date(2019,1,6))
| | Account | Stock_Code | Holdings | Closing_Price | Date | Market_Value |
|---:|----------:|-------------:|-----------:|----------------:|:--------------------|---------------:|
| 0 | 1001 | 001 | 8 | 3 | 2019-01-06 00:00:00 | 24 |
| 1 | 1001 | 002 | 1 | 3 | 2019-01-06 00:00:00 | 3 |
| 2 | 1001 | 003 | 2 | 6 | 2019-01-06 00:00:00 | 12 |
| 3 | 1002 | 001 | 6 | 3 | 2019-01-06 00:00:00 | 18 |
| 4 | 1002 | 002 | 5 | 3 | 2019-01-06 00:00:00 | 15 |
| 5 | 1002 | 003 | 1 | 6 | 2019-01-06 00:00:00 | 6 |
You can then build context on top of context. For example you can have an Accounts
context built on top of the
Holdings
context:
class Accounts():
def __init__(self,holdings):
self._holdings = holdings
def account_value(self,date):
df = self._holdings.holdings_of(date)
return df.groupby('Account',as_index=False).agg({'Market_Value':sum,
'Date':'first'})
Book().accounts.account_value(date(2019,1,6))
| | Account | Market_Value | Date |
|---:|----------:|---------------:|:--------------------|
| 0 | 1001 | 39 | 2019-01-06 00:00:00 |
| 1 | 1002 | 39 | 2019-01-06 00:00:00 |
Summary
The above is a simple example of how OO Design can be applied to your pandas script. The architecture allows you to scale up easily as the number of data source increase. This has been just a very brief introduction, there are many more modeling techniques you can use to manage complex data. Finally, I want to stress that these techniques are becoming more and more important.
Traditionally, this kind of problem can be handled by the database and relational model, but given the growing complexity and scatteredness of the data, you are more likely to encounter situations where your database cannot handle everything and you will need some supplement from your code,
especially when you need data from multiple data source.
Top comments (0)