DEV Community

Haseeb Mohammed
Haseeb Mohammed

Posted on

Making $$$ with ML

You've got $1000 to burn. You've decided you want to invest in the stock market, specifically Tesla.
Let's see if we can use Machine Learning to optimize our returns.

Download TSLA.csv from here: https://www.kaggle.com/timoboz/tesla-stock-data-from-2010-to-2020/data

Let's get started.

Let's first do some Exploratory Data Analysis (EDA) on the file we've got.

This file is a comma-separated values (CSV) file with 7 columns.

The columns are:

  • Date
  • Opening price
  • Highest price that day
  • Lowest price that day
  • Closing price
  • Adjusted closing price, taking splits etc into account
  • Trading volume
# Importing pandas. "pandas is a fast, powerful, flexible and easy to use open source data
# analysis and manipulation tool, built on top of the Python programming language."
import pandas as pd                                                                                         
pd.options.display.max_rows = 30
# Read in the CSV, save it to a pandas dataframe variable called 'tsla_data'.
tsla_data = pd.read_csv("TSLA.csv");
# .head() gives us the first 5 rows of the data frame.
# You can also pass .head() a parameter to return any number of rows. Like .head(10) for 10 rows.
tsla_data.head()
Date Open High Low Close Adj Close Volume
0 2010-06-29 19.000000 25.00 17.540001 23.889999 23.889999 18766300
1 2010-06-30 25.790001 30.42 23.299999 23.830000 23.830000 17187100
2 2010-07-01 25.000000 25.92 20.270000 21.959999 21.959999 8218800
3 2010-07-02 23.000000 23.10 18.709999 19.200001 19.200001 5139800
4 2010-07-06 20.000000 20.00 15.830000 16.110001 16.110001 6866900
# .shape tells us the number of rows, and the number of columns.
# This dataset has 2416 rows, and 7 columns.
# The NYSE and NASDAQ average about 253 trading days a year. 
# This is from 365.25 (days on average per year) * 5/7 (proportion work days per week) 
# - 6 (weekday holidays) - 3*5/7 (fixed date holidays) = 252.75 ≈ 253.
# 10 * 253 = 2530, this dataset is pretty close. Let's assume it's not missing any days.
tsla_data.shape
(2416, 7)

This is 10 years of data, with information about the stock starting from 2010.

Let's make some assumptions for the sake of time, we're not hedge fund managers yet.

Assumptions

  • We can only place one order a day (buy or sell), for the entire amount held.
  • If we place an order, we assume it will go through at that price.
  • We start with $1000

We're going to track a few key pieces of information.

  • Money in wallet
  • Number of stocks held

Let's start with just 2010, to see how much money we would have made if we started with $1000 on the first day of this file.

# We're going to just pull the 2010 data. I like sticking this in variable, and array,  
# because we'll likely do this again, and by multiple years.
years_to_pull = [2010]

# Let's tell pandas to treat the 'Date' column as a date.
tsla_data['Date'] = pd.to_datetime(tsla_data['Date'])

# Let's make a function for re-use
def pull_data_by_year(tsla_data, years_to_pull):
  tsla_data_by_year = tsla_data[tsla_data['Date'].dt.year.isin(years_to_pull)]
  return tsla_data_by_year

tsla_data_by_year = pull_data_by_year(tsla_data, years_to_pull)
tsla_data_by_year.shape
(130, 7)
# Sort by date ASC
tsla_data_by_year = tsla_data_by_year.sort_values(by = 'Date')

Let's add a couple columns to help us with the data. I want to see tomorrow's adjusted close, and I want to know if it's higher than today's adjusted close.

# .shift(-1) brings the next row into the equation, so that we can add a column that  
# shows tomorrow's adjusted close.
tsla_data_by_year["Adj Close Tomorrow"] = tsla_data_by_year["Adj Close"].shift(-1)
# This adds another column as a bool to quickly show if the stock goes up or down tomorrow.
tsla_data_by_year["Stock Goes Up Tomorrow"] = tsla_data_by_year["Adj Close"] < tsla_data_by_year["Adj Close Tomorrow"]
# Let's look at the first 10 rows to see if this looks correct.
tsla_data_by_year.head(10)
Date Open High Low Close Adj Close Volume Adj Close Tomorrow Stock Goes Up Tomorrow
0 2010-06-29 19.000000 25.000000 17.540001 23.889999 23.889999 18766300 23.830000 False
1 2010-06-30 25.790001 30.420000 23.299999 23.830000 23.830000 17187100 21.959999 False
2 2010-07-01 25.000000 25.920000 20.270000 21.959999 21.959999 8218800 19.200001 False
3 2010-07-02 23.000000 23.100000 18.709999 19.200001 19.200001 5139800 16.110001 False
4 2010-07-06 20.000000 20.000000 15.830000 16.110001 16.110001 6866900 15.800000 False
5 2010-07-07 16.400000 16.629999 14.980000 15.800000 15.800000 6921700 17.459999 True
6 2010-07-08 16.139999 17.520000 15.570000 17.459999 17.459999 7711400 17.400000 False
7 2010-07-09 17.580000 17.900000 16.549999 17.400000 17.400000 4050600 17.049999 False
8 2010-07-12 17.950001 18.070000 17.000000 17.049999 17.049999 2202500 18.139999 True
9 2010-07-13 17.389999 18.639999 16.900000 18.139999 18.139999 2680100 19.840000 True

Following the rule buy low sell high, and we're looking at historical data, we can say the following.

To start pick the first day whose following day's Adj Close price goes up, and buy $1000 worth of shares on that day.

We'll have 3 positions.

Buy
Sell
Hold

In code:

haveNoStock && !goesUpTomorrow = hold

haveNoStock && goesUpTomorrow = buy

haveStock && !goesUpTomorrow = sell

haveStock && goesUpTomorrow = hold

# Setting some default values of the new columns. 
# Position can be Hold/Sell/Buy
tsla_data_by_year['Position'] = 'Hold'
tsla_data_by_year['Number Of Stocks Held'] = 0
tsla_data_by_year['Money In Wallet'] = 0
# .at says at row 0, column 'Money in Wallet', save $1000
tsla_data_by_year.at[0, 'Money In Wallet'] = 1000
tsla_data_by_year.head()
Date Open High Low Close Adj Close Volume Adj Close Tomorrow Stock Goes Up Tomorrow Position Number Of Stocks Held Money In Wallet
0 2010-06-29 19.000000 25.00 17.540001 23.889999 23.889999 18766300 23.830000 False Hold 0 1000
1 2010-06-30 25.790001 30.42 23.299999 23.830000 23.830000 17187100 21.959999 False Hold 0 0
2 2010-07-01 25.000000 25.92 20.270000 21.959999 21.959999 8218800 19.200001 False Hold 0 0
3 2010-07-02 23.000000 23.10 18.709999 19.200001 19.200001 5139800 16.110001 False Hold 0 0
4 2010-07-06 20.000000 20.00 15.830000 16.110001 16.110001 6866900 15.800000 False Hold 0 0
# Here's my code for determining if I should buy/sell/hold. 
# We'll put this in a function down the line.
previousRow = ''
for index, row in tsla_data_by_year.iterrows():
  if(index > 0):
    row['Money In Wallet'] = previousRow['Money In Wallet']
    row['Number Of Stocks Held'] = previousRow['Number Of Stocks Held']
  if(row['Number Of Stocks Held'] == 0 and not row['Stock Goes Up Tomorrow']):
    row['Position'] = 'Hold'
    # print(1)
  elif(row['Number Of Stocks Held'] == 0 and row['Stock Goes Up Tomorrow']):
    row['Position'] = 'Buy'
    row['Number Of Stocks Held'] = row['Money In Wallet'] / row['Adj Close']
    row['Money In Wallet'] -= row['Number Of Stocks Held'] * row['Adj Close']
    # print(2)
  elif(row['Number Of Stocks Held'] > 0 and not row['Stock Goes Up Tomorrow']):
    row['Position'] = 'Sell'
    row['Money In Wallet'] += row['Number Of Stocks Held'] * row['Adj Close']
    row['Number Of Stocks Held'] = 0
    # print(3)
  elif(row['Number Of Stocks Held'] > 0 and row['Stock Goes Up Tomorrow']):
    row['Position'] = 'Hold'
    # print(4)
  previousRow = row
  tsla_data_by_year.at[index] = row
# Round each number to 2 decimal places.
tsla_data_by_year = tsla_data_by_year.round(2)
# Let's look at the last row to see how much money or stock we have at the end of the year.
tsla_data_by_year.tail(1)
Date Open High Low Close Adj Close Volume Adj Close Tomorrow Stock Goes Up Tomorrow Position Number Of Stocks Held Money In Wallet
129 2010-12-31 26.57 27.25 26.5 26.63 26.63 1417900 NaN False Sell 0.0 8645.73

As the end of 2010 we would have $8,645 if we knew the future.

Now let's apply some ML to this. The key here to remember is this is a science, this is an experiment. We need to follow the scientific method.

Our hypothesis: We will attempt to predict if the stock will go up or down tomorrow.

When we are trying to predict two possible outcomes it is called Binary Classification.

The goal is for a given row we can predict the column that tells us if the stock will go up tomorrow, but all we have are today's highs/lows and prices. That doesn't give us much information, we need to try and trend out historical prices. For this we use technical indicators.

## Add bollinger bands
## To learn more about bollinger bands: https://www.investopedia.com/terms/b/bollingerbands.asp
import matplotlib.pyplot as plt

tsla_data_by_year['30 Day MA'] = tsla_data_by_year['Adj Close'].rolling(window=30).mean()
tsla_data_by_year['30 Day STD'] = tsla_data_by_year['Adj Close'].rolling(window=30).std() 

tsla_data_by_year['Upper Band'] = tsla_data_by_year['30 Day MA'] + (tsla_data_by_year['30 Day STD'] * 2)
tsla_data_by_year['Lower Band'] = tsla_data_by_year['30 Day MA'] - (tsla_data_by_year['30 Day STD'] * 2)

# Simple 30 Day Bollinger Band for Tesla
tsla_data_by_year[['Adj Close', '30 Day MA', 'Upper Band', 'Lower Band']].plot(figsize=(12,6))
plt.title('30 Day Bollinger Band for Tesla')
plt.ylabel('Price (USD)')
plt.show();

# This plot will show us the adjusted close, the rolling average, and the upper  
# and lower bands of the TSLA stock.

png

# Since we used a 30 day moving average, the starting 30 days do not have  
# bollinger bands information.
# We use dropna() to drop the nulls.
tsla_data_by_year = tsla_data_by_year.dropna()
tsla_data_by_year.head()
Date Open High Low Close Adj Close Volume Adj Close Tomorrow Stock Goes Up Tomorrow Position Number Of Stocks Held Money In Wallet 30 Day MA 30 Day STD Upper Band Lower Band
29 2010-08-10 19.65 19.65 18.82 19.03 19.03 1281300 17.90 False Hold 0.00 1660.38 20.041333 1.937226 23.915786 16.166880
30 2010-08-11 18.69 18.88 17.85 17.90 17.90 797600 17.60 False Hold 0.00 1660.38 19.841667 1.832744 23.507156 16.176178
31 2010-08-12 17.80 17.90 17.39 17.60 17.60 691000 18.32 True Buy 94.34 0.00 19.634000 1.714383 23.062765 16.205235
32 2010-08-13 18.18 18.45 17.66 18.32 18.32 634000 18.78 True Hold 94.34 0.00 19.512667 1.672380 22.857427 16.167907
33 2010-08-16 18.45 18.80 18.26 18.78 18.78 485800 19.15 True Hold 94.34 0.00 19.498667 1.676840 22.852346 16.144987
# Some fantastical python. 

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
import numpy as np
from matplotlib import pyplot, dates

# Here we are saying we want to predict the column 'Stock Goes Up Tomorrow' by 
#  storing the column name in a variable.
predict = 'Stock Goes Up Tomorrow'
X = tsla_data_by_year
# Treat the date as a number
X['Date'] = X['Date'].dt.strftime('%Y%m%d')

# For each column, apply a LabelEncoder. Regression problems need numerical values 
#  or categorical values. 
# With columns like 'Position', we need to apply a LabelEncoding  
# to set 1 = Hold, 2 = Buy, 3 = Sell  
# (this is an example, the LabelEncoder will determine  
# the numerical values of the categories at runtime.)
for column in X.columns:
  if column != 'Date':
    if X[column].dtype == type(object):
        le = LabelEncoder()
        X[column] = le.fit_transform(X[column])

# Set the y dataset to just the single column we want to predict.
y = tsla_data_by_year[predict]

# Set the X dataset (what we will use to predict), to all the columns mentioned.
X = tsla_data_by_year[['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', '30 Day MA', '30 Day STD', 'Upper Band', 'Lower Band']]

# This is used to stratify. Learn more here: https://en.wikipedia.org/wiki/Stratified_sampling
targets = tsla_data_by_year[predict]

# This splits the dataset into training and testing. 60% of the data will be  
# used to train, 40% will be used to test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=101, stratify=targets)
X_test.head()
Date Open High Low Close Adj Close Volume 30 Day MA 30 Day STD Upper Band Lower Band
65 20100930 22.00 22.15 20.19 20.41 20.41 2195800 20.382333 0.786847 21.956028 18.808639
106 20101129 35.41 35.95 33.33 34.33 34.33 1145600 26.106333 5.340324 36.786982 15.425684
57 20100920 20.67 21.35 20.16 21.06 21.06 947500 19.866667 1.058318 21.983302 17.750031
99 20101117 30.20 30.75 28.61 29.49 29.49 750000 23.079667 3.586195 30.252057 15.907276
90 20101104 22.60 25.33 22.15 24.90 24.90 1874000 20.957667 0.907603 22.772872 19.142461
from sklearn import ensemble

# hyper parameters for the GradientBoostingRegressor algorithm.. More on this much later.
params = {'n_estimators': 100, 'max_depth': 7, 'min_samples_split': 3,
          'learning_rate': 0.01, 'loss': 'ls'}
clf = ensemble.GradientBoostingRegressor(**params)

# Fit the classifier with the training data.
clf.fit(X_train, y_train)

# Use the trained model to predict the testing dataset.
y_pred_original = clf.predict(X_test)
from sklearn.metrics import (confusion_matrix, precision_score, recall_score, f1_score, classification_report)
y_pred = y_pred_original > .5
y_pred = pd.DataFrame(y_pred)
y_pred = y_pred * 1
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# The confusion matrix will show us True/False Positives, True/False Negatives.
# This dataset is really small to get an accurate reading of the score,  
# but so far it looks like we're close to 50% accurate.
[[12  7]
 [14  7]]
              precision    recall  f1-score   support

       False       0.46      0.63      0.53        19
        True       0.50      0.33      0.40        21

    accuracy                           0.48        40
   macro avg       0.48      0.48      0.47        40
weighted avg       0.48      0.47      0.46        40

We're slightly worse than a coinflip! Let's see how this works.

predictions = clf.predict(X)
# The values of predictions are stored as a value from 0.00 to 1.00, but we need them 
#  as a true/false to work with our algorithm to calculate $$. Here I compare to .5 (threshold)  
# to determine if the prediction is true or false.
# You can manually adjust the threshold to get a better True Positive / True Negative rate,  
# sometimes it's beneficial if they're trying to reduce a particular metric.
predictions = predictions > .5
X['Stock Goes Up Tomorrow'] = predictions
# Same code as above, functionized. Use a dataset to determine  
# how much money we'll have made with our trades.
def howMuchMoneyDidWeMake(X):
  if('Money In Wallet' not in X ):
    X['Position'] = 'Hold'
    X['Number Of Stocks Held'] = 0
    X['Money In Wallet'] = 0
    X.at[0, 'Money In Wallet'] = 1000

  previousRow = ''
  for index, row in X.iterrows():
    # print(row)
    if(index > 0):
      row['Money In Wallet'] = previousRow['Money In Wallet']
      row['Number Of Stocks Held'] = previousRow['Number Of Stocks Held']
    if(row['Number Of Stocks Held'] == 0 and not row['Stock Goes Up Tomorrow']):
      row['Position'] = 'Hold'
      # print(1)
    elif(row['Number Of Stocks Held'] == 0 and row['Stock Goes Up Tomorrow']):
      row['Position'] = 'Buy'
      row['Number Of Stocks Held'] = row['Money In Wallet'] / row['Adj Close']
      row['Money In Wallet'] -= row['Number Of Stocks Held'] * row['Adj Close']
      # print(2)
    elif(row['Number Of Stocks Held'] > 0 and not row['Stock Goes Up Tomorrow']):
      row['Position'] = 'Sell'
      row['Money In Wallet'] += row['Number Of Stocks Held'] * row['Adj Close']
      row['Number Of Stocks Held'] = 0
      # print(3)
    elif(row['Number Of Stocks Held'] > 0 and row['Stock Goes Up Tomorrow']):
      row['Position'] = 'Hold'
      # print(4)
    previousRow = row
    X.at[index] = row
    X = X.round(2)

  return X
X = X.reset_index()
X = howMuchMoneyDidWeMake(X)
X.tail(1)
index Date Open High Low Close Adj Close Volume 30 Day MA 30 Day STD Upper Band Lower Band Stock Goes Up Tomorrow Position Number Of Stocks Held Money In Wallet
99 128 20101230 27.7 27.9 26.38 26.5 26.5 2041100 31.28 2.63 36.55 26.02 False Hold 0.0 2917.96

Not bad! We almost tripled our money. We end up with $2917.

But there is a rookie mistake here. We're measuring our success with the same data we used to train the algorithm. We need another dataset to test this against. Let's pull the next year and try again.


# Let's pull 2011 data.
years_to_pull = [2011]
tsla_data_by_year = pull_data_by_year(tsla_data, years_to_pull)

def addBollingerBands(df):
  df['30 Day MA'] = df['Adj Close'].rolling(window=30).mean()
  df['30 Day STD'] = df['Adj Close'].rolling(window=30).std() 
  df['Upper Band'] = df['30 Day MA'] + (df['30 Day STD'] * 2)
  df['Lower Band'] = df['30 Day MA'] - (df['30 Day STD'] * 2)
  df = df.dropna()
  return df

tsla_data_by_year = addBollingerBands(tsla_data_by_year)
tsla_data_by_year.head()
Date Open High Low Close Adj Close Volume 30 Day MA 30 Day STD Upper Band Lower Band
159 2011-02-14 23.639999 24.139999 23.049999 23.08 23.08 1283100 24.937333 1.720240 28.377814 21.496853
160 2011-02-15 23.010000 23.170000 22.559999 22.84 22.84 953700 24.811333 1.731142 28.273617 21.349049
161 2011-02-16 23.100000 24.969999 23.070000 24.73 24.73 4115100 24.746667 1.695178 28.137023 21.356310
162 2011-02-17 24.629999 25.490000 23.549999 23.60 23.60 2618400 24.639000 1.660516 27.960031 21.317969
163 2011-02-18 23.330000 23.490000 22.959999 23.18 23.18 2370700 24.482333 1.563047 27.608427 21.356240
# Function to add the predicted column to a dataset using a trained classifier
def addPredictedColumn(df, clf):
  df['Date'] = df['Date'].dt.strftime('%Y%m%d')
  # df["Adj Close Tomorrow"] = df["Adj Close"].shift(-1)

  df = df.dropna()

  for column in df.columns:
    if column != 'Date':
      if df[column].dtype == type(object):
          le = LabelEncoder()
          df[column] = le.fit_transform(df[column])


  predictions = clf.predict(df)
  predictions = predictions > .5
  df['Stock Goes Up Tomorrow'] = predictions
  return df

tsla_data_by_year = addPredictedColumn(tsla_data_by_year, clf)
tsla_data_by_year = tsla_data_by_year.reset_index()
tsla_data_by_year = howMuchMoneyDidWeMake(tsla_data_by_year)
tsla_data_by_year.tail(1)
index Date Open High Low Close Adj Close Volume 30 Day MA 30 Day STD Upper Band Lower Band Stock Goes Up Tomorrow Position Number Of Stocks Held Money In Wallet
222 381 20111230 28.49 28.98 28.25 28.56 28.56 339800 30.66 2.33 35.33 25.99 False Hold 0.0 1652.3

If we run a model trained in 2010 against data in 2011, we end up with a total of $1652.01. That's pretty bad. I think the issue is we're not taking into account new data about TSLA in 2011, we're simply using 2010 to predict 2011. That won't do.

What if we re-train the model every 30 days in 2011? That way the classifer 'resets' every 30 days to any new patterns discovered.

existing_df = tsla_data_by_year
years_to_pull = [2011]
tsla_data_by_year = pull_data_by_year(tsla_data, years_to_pull)
tsla_data_by_year = addBollingerBands(tsla_data_by_year)
tsla_data_by_year = tsla_data_by_year.reset_index()
working_df = tsla_data_by_year[:30]
working_df = working_df.drop(columns=['index'])
working_df = addPredictedColumn(working_df, clf)
working_df = howMuchMoneyDidWeMake(working_df)

all_the_money = pd.concat([working_df], sort=True)

for i in range(1, 8):
  new_first_row = working_df[-1:]
  for column in working_df.columns:
    if column != 'Date':
      if working_df[column].dtype == type(object):
          le = LabelEncoder()
          working_df[column] = le.fit_transform(working_df[column])
  existing_df = pd.concat([existing_df, working_df], sort=True)
  working_df = tsla_data_by_year[30*i:30*i+30]
  existing_df["Adj Close Tomorrow"] = existing_df["Adj Close"].shift(-1)
  existing_df = existing_df.dropna()
  existing_df["Stock Goes Up Tomorrow"] = existing_df["Adj Close"] < existing_df["Adj Close Tomorrow"]
  predict = 'Stock Goes Up Tomorrow'
  X = existing_df
  for column in X.columns:
    if column != 'Date':
      if X[column].dtype == type(object):
          le = LabelEncoder()
          X[column] = le.fit_transform(X[column])

  y = existing_df[predict]
  X = existing_df[['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume',
        '30 Day MA', '30 Day STD', 'Upper Band', 'Lower Band']]
  targets = existing_df[predict]
  params = {'n_estimators': 100, 'max_depth': 7, 'min_samples_split': 3,
            'learning_rate': 0.01, 'loss': 'ls'}
  clf = ensemble.GradientBoostingRegressor(**params)
  clf.fit(X, y)
  working_df = working_df.drop(columns=['index'])
  working_df = addPredictedColumn(working_df, clf)
  working_df = pd.concat([new_first_row, working_df], sort=True)
  working_df = working_df.reset_index(drop=True)
  working_df = howMuchMoneyDidWeMake(working_df)
  all_the_money = pd.concat([all_the_money, working_df], sort=True)

all_the_money.tail(1)
30 Day MA 30 Day STD Adj Close Close Date High Low Lower Band Money In Wallet Number Of Stocks Held Open Position Stock Goes Up Tomorrow Upper Band Volume
13 30.66 2.33 28.56 28.56 20111230 28.98 28.25 25.99 0.0 322.57 28.49 Buy True 35.33 339800

Retraining the model every 30 days in 2011 we end up with 322 shares at $28.56, that amounts to $9,196.

That's pretty good!

Brb calling my broker..


Top comments (0)