Haseeb Mohammed

Posted on Feb 11, 2020

Making $$$ with ML

#machinelearning

You've got $1000 to burn. You've decided you want to invest in the stock market, specifically Tesla.
Let's see if we can use Machine Learning to optimize our returns.

Download TSLA.csv from here: https://www.kaggle.com/timoboz/tesla-stock-data-from-2010-to-2020/data

Let's get started.

Let's first do some Exploratory Data Analysis (EDA) on the file we've got.

This file is a comma-separated values (CSV) file with 7 columns.

The columns are:

Date
Opening price
Highest price that day
Lowest price that day
Closing price
Adjusted closing price, taking splits etc into account
Trading volume

# Importing pandas. "pandas is a fast, powerful, flexible and easy to use open source data
# analysis and manipulation tool, built on top of the Python programming language."
import pandas as pd                                                                                         
pd.options.display.max_rows = 30
# Read in the CSV, save it to a pandas dataframe variable called 'tsla_data'.
tsla_data = pd.read_csv("TSLA.csv");

# .head() gives us the first 5 rows of the data frame.
# You can also pass .head() a parameter to return any number of rows. Like .head(10) for 10 rows.
tsla_data.head()

	Date	Open	High	Low	Close	Adj Close	Volume
0	2010-06-29	19.000000	25.00	17.540001	23.889999	23.889999	18766300
1	2010-06-30	25.790001	30.42	23.299999	23.830000	23.830000	17187100
2	2010-07-01	25.000000	25.92	20.270000	21.959999	21.959999	8218800
3	2010-07-02	23.000000	23.10	18.709999	19.200001	19.200001	5139800
4	2010-07-06	20.000000	20.00	15.830000	16.110001	16.110001	6866900

# .shape tells us the number of rows, and the number of columns.
# This dataset has 2416 rows, and 7 columns.
# The NYSE and NASDAQ average about 253 trading days a year. 
# This is from 365.25 (days on average per year) * 5/7 (proportion work days per week) 
# - 6 (weekday holidays) - 3*5/7 (fixed date holidays) = 252.75 ≈ 253.
# 10 * 253 = 2530, this dataset is pretty close. Let's assume it's not missing any days.
tsla_data.shape

(2416, 7)

This is 10 years of data, with information about the stock starting from 2010.

Let's make some assumptions for the sake of time, we're not hedge fund managers yet.

Assumptions

We can only place one order a day (buy or sell), for the entire amount held.
If we place an order, we assume it will go through at that price.
We start with $1000

We're going to track a few key pieces of information.

Money in wallet
Number of stocks held

Let's start with just 2010, to see how much money we would have made if we started with $1000 on the first day of this file.

# We're going to just pull the 2010 data. I like sticking this in variable, and array,  
# because we'll likely do this again, and by multiple years.
years_to_pull = [2010]

# Let's tell pandas to treat the 'Date' column as a date.
tsla_data['Date'] = pd.to_datetime(tsla_data['Date'])

# Let's make a function for re-use
def pull_data_by_year(tsla_data, years_to_pull):
  tsla_data_by_year = tsla_data[tsla_data['Date'].dt.year.isin(years_to_pull)]
  return tsla_data_by_year

tsla_data_by_year = pull_data_by_year(tsla_data, years_to_pull)
tsla_data_by_year.shape

(130, 7)

# Sort by date ASC
tsla_data_by_year = tsla_data_by_year.sort_values(by = 'Date')

Let's add a couple columns to help us with the data. I want to see tomorrow's adjusted close, and I want to know if it's higher than today's adjusted close.

# .shift(-1) brings the next row into the equation, so that we can add a column that  
# shows tomorrow's adjusted close.
tsla_data_by_year["Adj Close Tomorrow"] = tsla_data_by_year["Adj Close"].shift(-1)
# This adds another column as a bool to quickly show if the stock goes up or down tomorrow.
tsla_data_by_year["Stock Goes Up Tomorrow"] = tsla_data_by_year["Adj Close"] < tsla_data_by_year["Adj Close Tomorrow"]
# Let's look at the first 10 rows to see if this looks correct.
tsla_data_by_year.head(10)

	Date	Open	High	Low	Close	Adj Close	Volume	Adj Close Tomorrow	Stock Goes Up Tomorrow
0	2010-06-29	19.000000	25.000000	17.540001	23.889999	23.889999	18766300	23.830000	False
1	2010-06-30	25.790001	30.420000	23.299999	23.830000	23.830000	17187100	21.959999	False
2	2010-07-01	25.000000	25.920000	20.270000	21.959999	21.959999	8218800	19.200001	False
3	2010-07-02	23.000000	23.100000	18.709999	19.200001	19.200001	5139800	16.110001	False
4	2010-07-06	20.000000	20.000000	15.830000	16.110001	16.110001	6866900	15.800000	False
5	2010-07-07	16.400000	16.629999	14.980000	15.800000	15.800000	6921700	17.459999	True
6	2010-07-08	16.139999	17.520000	15.570000	17.459999	17.459999	7711400	17.400000	False
7	2010-07-09	17.580000	17.900000	16.549999	17.400000	17.400000	4050600	17.049999	False
8	2010-07-12	17.950001	18.070000	17.000000	17.049999	17.049999	2202500	18.139999	True
9	2010-07-13	17.389999	18.639999	16.900000	18.139999	18.139999	2680100	19.840000	True

Following the rule buy low sell high, and we're looking at historical data, we can say the following.

To start pick the first day whose following day's Adj Close price goes up, and buy $1000 worth of shares on that day.

We'll have 3 positions.

Buy
Sell
Hold

In code:

haveNoStock && !goesUpTomorrow = hold

haveNoStock && goesUpTomorrow = buy

haveStock && !goesUpTomorrow = sell

haveStock && goesUpTomorrow = hold

# Setting some default values of the new columns. 
# Position can be Hold/Sell/Buy
tsla_data_by_year['Position'] = 'Hold'
tsla_data_by_year['Number Of Stocks Held'] = 0
tsla_data_by_year['Money In Wallet'] = 0
# .at says at row 0, column 'Money in Wallet', save $1000
tsla_data_by_year.at[0, 'Money In Wallet'] = 1000
tsla_data_by_year.head()

	Date	Open	High	Low	Close	Adj Close	Volume	Adj Close Tomorrow	Stock Goes Up Tomorrow	Position	Money In Wallet
0	2010-06-29	19.000000	25.00	17.540001	23.889999	23.889999	18766300	23.830000	False	Hold	1000
1	2010-06-30	25.790001	30.42	23.299999	23.830000	23.830000	17187100	21.959999	False	Hold	0
2	2010-07-01	25.000000	25.92	20.270000	21.959999	21.959999	8218800	19.200001	False	Hold	0
3	2010-07-02	23.000000	23.10	18.709999	19.200001	19.200001	5139800	16.110001	False	Hold	0
4	2010-07-06	20.000000	20.00	15.830000	16.110001	16.110001	6866900	15.800000	False	Hold	0

# Here's my code for determining if I should buy/sell/hold. 
# We'll put this in a function down the line.
previousRow = ''
for index, row in tsla_data_by_year.iterrows():
  if(index > 0):
    row['Money In Wallet'] = previousRow['Money In Wallet']
    row['Number Of Stocks Held'] = previousRow['Number Of Stocks Held']
  if(row['Number Of Stocks Held'] == 0 and not row['Stock Goes Up Tomorrow']):
    row['Position'] = 'Hold'
    # print(1)
  elif(row['Number Of Stocks Held'] == 0 and row['Stock Goes Up Tomorrow']):
    row['Position'] = 'Buy'
    row['Number Of Stocks Held'] = row['Money In Wallet'] / row['Adj Close']
    row['Money In Wallet'] -= row['Number Of Stocks Held'] * row['Adj Close']
    # print(2)
  elif(row['Number Of Stocks Held'] > 0 and not row['Stock Goes Up Tomorrow']):
    row['Position'] = 'Sell'
    row['Money In Wallet'] += row['Number Of Stocks Held'] * row['Adj Close']
    row['Number Of Stocks Held'] = 0
    # print(3)
  elif(row['Number Of Stocks Held'] > 0 and row['Stock Goes Up Tomorrow']):
    row['Position'] = 'Hold'
    # print(4)
  previousRow = row
  tsla_data_by_year.at[index] = row

# Round each number to 2 decimal places.
tsla_data_by_year = tsla_data_by_year.round(2)
# Let's look at the last row to see how much money or stock we have at the end of the year.
tsla_data_by_year.tail(1)

	Date	Open	High	Low	Close	Adj Close	Volume	Adj Close Tomorrow	Stock Goes Up Tomorrow	Position	Number Of Stocks Held	Money In Wallet
129	2010-12-31	26.57	27.25	26.5	26.63	26.63	1417900	NaN	False	Sell	0.0	8645.73

As the end of 2010 we would have $8,645 if we knew the future.

Now let's apply some ML to this. The key here to remember is this is a science, this is an experiment. We need to follow the scientific method.

Our hypothesis: We will attempt to predict if the stock will go up or down tomorrow.

When we are trying to predict two possible outcomes it is called Binary Classification.

The goal is for a given row we can predict the column that tells us if the stock will go up tomorrow, but all we have are today's highs/lows and prices. That doesn't give us much information, we need to try and trend out historical prices. For this we use technical indicators.

## Add bollinger bands
## To learn more about bollinger bands: https://www.investopedia.com/terms/b/bollingerbands.asp
import matplotlib.pyplot as plt

tsla_data_by_year['30 Day MA'] = tsla_data_by_year['Adj Close'].rolling(window=30).mean()
tsla_data_by_year['30 Day STD'] = tsla_data_by_year['Adj Close'].rolling(window=30).std() 

tsla_data_by_year['Upper Band'] = tsla_data_by_year['30 Day MA'] + (tsla_data_by_year['30 Day STD'] * 2)
tsla_data_by_year['Lower Band'] = tsla_data_by_year['30 Day MA'] - (tsla_data_by_year['30 Day STD'] * 2)

# Simple 30 Day Bollinger Band for Tesla
tsla_data_by_year[['Adj Close', '30 Day MA', 'Upper Band', 'Lower Band']].plot(figsize=(12,6))
plt.title('30 Day Bollinger Band for Tesla')
plt.ylabel('Price (USD)')
plt.show();

# This plot will show us the adjusted close, the rolling average, and the upper  
# and lower bands of the TSLA stock.

# Since we used a 30 day moving average, the starting 30 days do not have  
# bollinger bands information.
# We use dropna() to drop the nulls.
tsla_data_by_year = tsla_data_by_year.dropna()
tsla_data_by_year.head()

	Date	Open	High	Low	Close	Adj Close	Volume	Adj Close Tomorrow	Stock Goes Up Tomorrow	Position	Number Of Stocks Held	Money In Wallet	30 Day MA	30 Day STD	Upper Band	Lower Band
29	2010-08-10	19.65	19.65	18.82	19.03	19.03	1281300	17.90	False	Hold	0.00	1660.38	20.041333	1.937226	23.915786	16.166880
30	2010-08-11	18.69	18.88	17.85	17.90	17.90	797600	17.60	False	Hold	0.00	1660.38	19.841667	1.832744	23.507156	16.176178
31	2010-08-12	17.80	17.90	17.39	17.60	17.60	691000	18.32	True	Buy	94.34	0.00	19.634000	1.714383	23.062765	16.205235
32	2010-08-13	18.18	18.45	17.66	18.32	18.32	634000	18.78	True	Hold	94.34	0.00	19.512667	1.672380	22.857427	16.167907
33	2010-08-16	18.45	18.80	18.26	18.78	18.78	485800	19.15	True	Hold	94.34	0.00	19.498667	1.676840	22.852346	16.144987

# Some fantastical python. 

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
import numpy as np
from matplotlib import pyplot, dates

# Here we are saying we want to predict the column 'Stock Goes Up Tomorrow' by 
#  storing the column name in a variable.
predict = 'Stock Goes Up Tomorrow'
X = tsla_data_by_year
# Treat the date as a number
X['Date'] = X['Date'].dt.strftime('%Y%m%d')

# For each column, apply a LabelEncoder. Regression problems need numerical values 
#  or categorical values. 
# With columns like 'Position', we need to apply a LabelEncoding  
# to set 1 = Hold, 2 = Buy, 3 = Sell  
# (this is an example, the LabelEncoder will determine  
# the numerical values of the categories at runtime.)
for column in X.columns:
  if column != 'Date':
    if X[column].dtype == type(object):
        le = LabelEncoder()
        X[column] = le.fit_transform(X[column])

# Set the y dataset to just the single column we want to predict.
y = tsla_data_by_year[predict]

# Set the X dataset (what we will use to predict), to all the columns mentioned.
X = tsla_data_by_year[['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', '30 Day MA', '30 Day STD', 'Upper Band', 'Lower Band']]

# This is used to stratify. Learn more here: https://en.wikipedia.org/wiki/Stratified_sampling
targets = tsla_data_by_year[predict]

# This splits the dataset into training and testing. 60% of the data will be  
# used to train, 40% will be used to test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=101, stratify=targets)

X_test.head()

	Date	Open	High	Low	Close	Adj Close	Volume	30 Day MA	30 Day STD	Upper Band	Lower Band
65	20100930	22.00	22.15	20.19	20.41	20.41	2195800	20.382333	0.786847	21.956028	18.808639
106	20101129	35.41	35.95	33.33	34.33	34.33	1145600	26.106333	5.340324	36.786982	15.425684
57	20100920	20.67	21.35	20.16	21.06	21.06	947500	19.866667	1.058318	21.983302	17.750031
99	20101117	30.20	30.75	28.61	29.49	29.49	750000	23.079667	3.586195	30.252057	15.907276
90	20101104	22.60	25.33	22.15	24.90	24.90	1874000	20.957667	0.907603	22.772872	19.142461

from sklearn import ensemble

# hyper parameters for the GradientBoostingRegressor algorithm.. More on this much later.
params = {'n_estimators': 100, 'max_depth': 7, 'min_samples_split': 3,
          'learning_rate': 0.01, 'loss': 'ls'}
clf = ensemble.GradientBoostingRegressor(**params)

# Fit the classifier with the training data.
clf.fit(X_train, y_train)

# Use the trained model to predict the testing dataset.
y_pred_original = clf.predict(X_test)

from sklearn.metrics import (confusion_matrix, precision_score, recall_score, f1_score, classification_report)
y_pred = y_pred_original > .5
y_pred = pd.DataFrame(y_pred)
y_pred = y_pred * 1
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# The confusion matrix will show us True/False Positives, True/False Negatives.
# This dataset is really small to get an accurate reading of the score,  
# but so far it looks like we're close to 50% accurate.

[[12  7]
 [14  7]]
              precision    recall  f1-score   support

       False       0.46      0.63      0.53        19
        True       0.50      0.33      0.40        21

    accuracy                           0.48        40
   macro avg       0.48      0.48      0.47        40
weighted avg       0.48      0.47      0.46        40

We're slightly worse than a coinflip! Let's see how this works.

predictions = clf.predict(X)
# The values of predictions are stored as a value from 0.00 to 1.00, but we need them 
#  as a true/false to work with our algorithm to calculate $$. Here I compare to .5 (threshold)  
# to determine if the prediction is true or false.
# You can manually adjust the threshold to get a better True Positive / True Negative rate,  
# sometimes it's beneficial if they're trying to reduce a particular metric.
predictions = predictions > .5
X['Stock Goes Up Tomorrow'] = predictions

# Same code as above, functionized. Use a dataset to determine  
# how much money we'll have made with our trades.
def howMuchMoneyDidWeMake(X):
  if('Money In Wallet' not in X ):
    X['Position'] = 'Hold'
    X['Number Of Stocks Held'] = 0
    X['Money In Wallet'] = 0
    X.at[0, 'Money In Wallet'] = 1000

  previousRow = ''
  for index, row in X.iterrows():
    # print(row)
    if(index > 0):
      row['Money In Wallet'] = previousRow['Money In Wallet']
      row['Number Of Stocks Held'] = previousRow['Number Of Stocks Held']
    if(row['Number Of Stocks Held'] == 0 and not row['Stock Goes Up Tomorrow']):
      row['Position'] = 'Hold'
      # print(1)
    elif(row['Number Of Stocks Held'] == 0 and row['Stock Goes Up Tomorrow']):
      row['Position'] = 'Buy'
      row['Number Of Stocks Held'] = row['Money In Wallet'] / row['Adj Close']
      row['Money In Wallet'] -= row['Number Of Stocks Held'] * row['Adj Close']
      # print(2)
    elif(row['Number Of Stocks Held'] > 0 and not row['Stock Goes Up Tomorrow']):
      row['Position'] = 'Sell'
      row['Money In Wallet'] += row['Number Of Stocks Held'] * row['Adj Close']
      row['Number Of Stocks Held'] = 0
      # print(3)
    elif(row['Number Of Stocks Held'] > 0 and row['Stock Goes Up Tomorrow']):
      row['Position'] = 'Hold'
      # print(4)
    previousRow = row
    X.at[index] = row
    X = X.round(2)

  return X

X = X.reset_index()
X = howMuchMoneyDidWeMake(X)
X.tail(1)

	index	Date	Open	High	Low	Close	Adj Close	Volume	30 Day MA	30 Day STD	Upper Band	Lower Band	Stock Goes Up Tomorrow	Position	Number Of Stocks Held	Money In Wallet
99	128	20101230	27.7	27.9	26.38	26.5	26.5	2041100	31.28	2.63	36.55	26.02	False	Hold	0.0	2917.96

Not bad! We almost tripled our money. We end up with $2917.

But there is a rookie mistake here. We're measuring our success with the same data we used to train the algorithm. We need another dataset to test this against. Let's pull the next year and try again.

# Let's pull 2011 data.
years_to_pull = [2011]
tsla_data_by_year = pull_data_by_year(tsla_data, years_to_pull)

def addBollingerBands(df):
  df['30 Day MA'] = df['Adj Close'].rolling(window=30).mean()
  df['30 Day STD'] = df['Adj Close'].rolling(window=30).std() 
  df['Upper Band'] = df['30 Day MA'] + (df['30 Day STD'] * 2)
  df['Lower Band'] = df['30 Day MA'] - (df['30 Day STD'] * 2)
  df = df.dropna()
  return df

tsla_data_by_year = addBollingerBands(tsla_data_by_year)
tsla_data_by_year.head()

	Date	Open	High	Low	Close	Adj Close	Volume	30 Day MA	30 Day STD	Upper Band	Lower Band
159	2011-02-14	23.639999	24.139999	23.049999	23.08	23.08	1283100	24.937333	1.720240	28.377814	21.496853
160	2011-02-15	23.010000	23.170000	22.559999	22.84	22.84	953700	24.811333	1.731142	28.273617	21.349049
161	2011-02-16	23.100000	24.969999	23.070000	24.73	24.73	4115100	24.746667	1.695178	28.137023	21.356310
162	2011-02-17	24.629999	25.490000	23.549999	23.60	23.60	2618400	24.639000	1.660516	27.960031	21.317969
163	2011-02-18	23.330000	23.490000	22.959999	23.18	23.18	2370700	24.482333	1.563047	27.608427	21.356240

# Function to add the predicted column to a dataset using a trained classifier
def addPredictedColumn(df, clf):
  df['Date'] = df['Date'].dt.strftime('%Y%m%d')
  # df["Adj Close Tomorrow"] = df["Adj Close"].shift(-1)

  df = df.dropna()

  for column in df.columns:
    if column != 'Date':
      if df[column].dtype == type(object):
          le = LabelEncoder()
          df[column] = le.fit_transform(df[column])


  predictions = clf.predict(df)
  predictions = predictions > .5
  df['Stock Goes Up Tomorrow'] = predictions
  return df

tsla_data_by_year = addPredictedColumn(tsla_data_by_year, clf)
tsla_data_by_year = tsla_data_by_year.reset_index()
tsla_data_by_year = howMuchMoneyDidWeMake(tsla_data_by_year)
tsla_data_by_year.tail(1)

	index	Date	Open	High	Low	Close	Adj Close	Volume	30 Day MA	30 Day STD	Upper Band	Lower Band	Stock Goes Up Tomorrow	Position	Number Of Stocks Held	Money In Wallet
222	381	20111230	28.49	28.98	28.25	28.56	28.56	339800	30.66	2.33	35.33	25.99	False	Hold	0.0	1652.3

If we run a model trained in 2010 against data in 2011, we end up with a total of $1652.01. That's pretty bad. I think the issue is we're not taking into account new data about TSLA in 2011, we're simply using 2010 to predict 2011. That won't do.

What if we re-train the model every 30 days in 2011? That way the classifer 'resets' every 30 days to any new patterns discovered.

existing_df = tsla_data_by_year
years_to_pull = [2011]
tsla_data_by_year = pull_data_by_year(tsla_data, years_to_pull)
tsla_data_by_year = addBollingerBands(tsla_data_by_year)
tsla_data_by_year = tsla_data_by_year.reset_index()
working_df = tsla_data_by_year[:30]
working_df = working_df.drop(columns=['index'])
working_df = addPredictedColumn(working_df, clf)
working_df = howMuchMoneyDidWeMake(working_df)

all_the_money = pd.concat([working_df], sort=True)

for i in range(1, 8):
  new_first_row = working_df[-1:]
  for column in working_df.columns:
    if column != 'Date':
      if working_df[column].dtype == type(object):
          le = LabelEncoder()
          working_df[column] = le.fit_transform(working_df[column])
  existing_df = pd.concat([existing_df, working_df], sort=True)
  working_df = tsla_data_by_year[30*i:30*i+30]
  existing_df["Adj Close Tomorrow"] = existing_df["Adj Close"].shift(-1)
  existing_df = existing_df.dropna()
  existing_df["Stock Goes Up Tomorrow"] = existing_df["Adj Close"] < existing_df["Adj Close Tomorrow"]
  predict = 'Stock Goes Up Tomorrow'
  X = existing_df
  for column in X.columns:
    if column != 'Date':
      if X[column].dtype == type(object):
          le = LabelEncoder()
          X[column] = le.fit_transform(X[column])

  y = existing_df[predict]
  X = existing_df[['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume',
        '30 Day MA', '30 Day STD', 'Upper Band', 'Lower Band']]
  targets = existing_df[predict]
  params = {'n_estimators': 100, 'max_depth': 7, 'min_samples_split': 3,
            'learning_rate': 0.01, 'loss': 'ls'}
  clf = ensemble.GradientBoostingRegressor(**params)
  clf.fit(X, y)
  working_df = working_df.drop(columns=['index'])
  working_df = addPredictedColumn(working_df, clf)
  working_df = pd.concat([new_first_row, working_df], sort=True)
  working_df = working_df.reset_index(drop=True)
  working_df = howMuchMoneyDidWeMake(working_df)
  all_the_money = pd.concat([all_the_money, working_df], sort=True)

all_the_money.tail(1)