DEV Community: Yavuz

Financial Data Processing with AdaBoost Regression

Yavuz — Thu, 19 Oct 2023 13:40:08 +0000

In this article, I present a code I wrote in Python that is used to analyze Bitcoin prices. The method used is a machine learning based approach with AdaBoost Regression. This code offers great potential for those who want to perform financial analysis, create price predictions and develop trading strategies.

First of all, I imported the libraries I will use in the code.

import yfinance as yf
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
import numpy as np

In the first step, I pulled Bitcoin price data with the symbol "BTC-USD" using the yfinance library. You can also pull the price of any index or stock. I also limited this data to the date "2023-01-01" as I wanted to analyze this data over a more recent time period.

stock = yf.Ticker("BTC-USD")
data = stock.history(start="2023-01-01")

I then added a date column to this data to be able to do date-related operations. This is important for time series analysis.

data['Date_Int'] = pd.to_datetime(data.index).astype('int64')

I chose independent and dependent variables to process the data. The independent variable is set to "Date_Int" as a representation of the date column, while Bitcoin prices ("Close") are chosen as the dependent variable.

X = data[['Date_Int']].values
y = data['Close'].values

In this step, I created an AdaBoostRegressor model. AdaBoost is used to boost a regression model using decision trees. The model limits the depth of the trees.

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
regr = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4), n_estimators=1, random_state=1)

The model is trained on the selected dataset. This training aims to predict future prices using historical price data.

regr.fit(X, y)

Using the trained model, I obtained forecasts of future Bitcoin prices. I visualized these forecasts and the actual price data on a graph on a logarithmic scale. The reason for using logarithmic instead of linear graphs is that it is much more convenient and readable, especially over large time periods. This visualization shows how forecasts can be compared to actual data.

y_pred = regr.predict(X)
y = np.log(y)
y_pred = np.log(y_pred)

plt.figure(figsize=(14, 7))
plt.scatter(data['Date_Int'], y, color='blue', label='Logarithmic Real Prices')
plt.plot(data['Date_Int'], y_pred, color='red', label='Logarithmic Predictions', linewidth=2)
plt.title('Boosted Decision Tree Regression - Apple Stock Prices (2021 - Now) - Logarithmic Scale')
plt.xlabel('Date')
plt.ylabel('Log Price')
plt.legend()
plt.grid(True)
plt.show()

In particular, this code can help developers of financial software to create applications that predict the future prices of various stocks and indices. In addition, various financial reports can be generated by outputting the values of the code. You can find the full code below. Thank you

import yfinance as yf
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
import numpy as np

stock = yf.Ticker("BTC-USD")
data = stock.history(start="2023-01-01")

data['Date_Int'] = pd.to_datetime(data.index).astype('int64')

X = data[['Date_Int']].values
y = data['Close'].values

regr = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4), n_estimators=1, random_state=1)

regr.fit(X, y)

y_pred = regr.predict(X)
y = np.log(y)
y_pred = np.log(y_pred)

plt.figure(figsize=(14, 7))
plt.scatter(data['Date_Int'], y, color='blue', label='Logarithmic Real Prices')
plt.plot(data['Date_Int'], y_pred, color='red', label='Logarithmic Predictions', linewidth=2)
plt.title('Boosted Decision Tree Regression - Apple Stock Prices (2021 - Now) - Logarithmic Scale')
plt.xlabel('Date')
plt.ylabel('Log Price')
plt.legend()
plt.grid(True)
plt.show()

Autoscout24 SQL Analysis

Yavuz — Thu, 28 Sep 2023 22:15:44 +0000

In this article, I will share with you the various analyzes I made on a data set. To do some SQL analysis, I got the German cars dataset from Kaggle at this link:

https://www.kaggle.com/datasets/ander289386/cars-germany

This dataset provides me with various statistics on many German cars, new or used, from the autoscout24 site. These are generally things like price, brand, model, gear. Although the file is not large, it contains a variety of data, making it possible not only for SQL but also for those who will use matplotlib in Python.

The first analysis I did on this data set was Cumulative Price Analysis. My aim with this analysis was to determine the total value of the vehicles offered for sale according to their years.

SELECT DISTINCT Year,
       SUM(Price) OVER (ORDER BY Year) AS CumulativePrice
FROM autoscout24
ORDER BY Year DESC;

The second analysis I made will be more functional than the first one: it is the average price of vehicles on the market according to model years. Thus, we can find out which model vehicles have which average price. However, I grouped the averages according to vehicle types. As you know, it may also be useful to compare the average price of a used vehicle with the average price of a new vehicle in the last year when the data was entered.

SELECT Year,
       Type,
       AVG(Price) AS AveragePrice
FROM autoscout24
GROUP BY Year, Type
ORDER BY Year DESC, Type;

The purpose of my third analysis was to identify the most expensive and cheapest models of each brand on the market and their prices. In the printout we receive, the models are listed from A to Z and the prices are listed from expensive to cheap. Therefore, you can see the most expensive and cheapest vehicles of each brand in order.

SELECT Year, Make, Model, Price
FROM autoscout24 AS A
WHERE Price IN (SELECT MAX(Price) FROM autoscout24 WHERE Make = A.Make UNION SELECT MIN(Price) FROM autoscout24 WHERE Make = A.Make)
ORDER BY Make ASC, Price DESC;

The purpose of the fourth analysis is to find the average price and horsepower of each vehicle model. In this analysis, vehicles are ranked by horsepower.

SELECT Make, Model, AVG(Price) AS AvgPrice, AVG(HP) AS AvgHP
FROM autoscout24
GROUP BY Make, Model
HAVING AVG(HP) > (SELECT AVG(HP) FROM autoscout24)
ORDER BY AvgHP DESC;

For now, here are a few data analyzes I have done as examples with SQL. I will also share my more comprehensive analysis on my page. I hope it was useful.

Financial Portfolio Comparison

Yavuz — Tue, 26 Sep 2023 16:02:13 +0000

In this post, I will talk about the two portfolio comparison software I wrote. My goal is to be useful to those who develop software about financial markets. The libraries I use in the code are yfinance, pandas and matplotlib respectively.

First, let's start with the first code. This Python code is used to compare the performance of a specific portfolio (We call it Bayford Commodity Portfolio) and the GSCI Index over a specific period.

import yfinance as yf
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from datetime import datetime

index = yf.Ticker("GD=F") # 
index_data = index.history(period="max")
symbols = ["SB=F", "ZL=F", "KC=F", "ZS=F", "ZO=F", "OJ=F"]

data = {}
for symbol in symbols:
    stock = yf.Ticker(symbol)
    stock_data = stock.history(period="max")
    data[symbol] = stock_data['Close']

portfolio_index = pd.DataFrame(data).mean(axis=1)

end_date = datetime.today().strftime('%Y-%m-%d')
start_date = "2022-09-26"
index_comparison = index_data.loc[start_date:end_date]
portfolio_comparison = portfolio_index.loc[start_date:end_date]

gsci_return = (index_comparison['Close'][-1] - index_comparison['Close'][0])/index_comparison['Close'][0]

portfolio_return = (portfolio_comparison[-1] - portfolio_comparison[0])/portfolio_comparison[0]

print(f"GSCI INDEX: {gsci_return:.2%}")
print(f"BAYFORD COMMODITY PORTFOLIO: {portfolio_return:.2%}")

index_normalized = index_comparison['Close']/index_comparison['Close'][0]
portfolio_normalized = portfolio_comparison/portfolio_comparison[0]

plt.style.use('dark_background')
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(index_normalized, label="GSCI INDEX")
ax.plot(portfolio_normalized, label="BAYFORD COMMODITY PORTFOLIO")
text = plt.text(0.5, 0.5, 'BAYFORD ANALYTICS', fontsize=40, color='gray', ha='center', va='center', alpha=0.5, transform=ax.transAxes)
plt.savefig('myfigure.png', dpi=900)
plt.legend()
plt.show()

Using the yfinance library, I pulled historical data of specific stocks and the GSCI Index. I used this data to create a portfolio that included the average of the closing prices of stocks over a specific period. Then, I calculated the return of both the GSCI Index and the portfolio created exactly 1 year ago to date. Then, I normalized these two portfolio data so that we could see the difference more easily when we visualized them. Normalizing does not change the result, but it does change how easily we can see the result on the graph. Therefore, especially if the chart is long-term, it is much easier to make comparisons as the price change is larger. Finally, I took help from the matplotlib library to visualize these two returns.

Now there is the second code. This code, like the other one, shows the difference between portfolios.

import yfinance as yf
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt

symbols = ["SB=F", "ZL=F", "KC=F", "ZS=F", "ZO=F", "OJ=F"]
weights = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]

start_date = "2022-09-26"
end_date = "2023-09-26"

data = yf.download(symbols, start=start_date, end=end_date)['Adj Close']

daily_returns = data.pct_change()

portfolio_returns = daily_returns.dot(weights)

total_return = (portfolio_returns + 1).prod() - 1

print(f"Portfolio Return: {total_return:.2%}")

index = yf.Ticker("GD=F") # 
index_data = index.history(start=start_date, end=end_date)

index_return = (index_data['Close'][-1] - index_data['Close'][0])/index_data['Close'][0]

print(f"GSCI Index Return: {index_return:.2%}")

# Creating a graph
portfolio_normalized = (portfolio_returns + 1).cumprod()
index_normalized = index_data['Close']/index_data['Close'][0]

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(index_normalized, label="GSCI INDEX")
ax.plot(portfolio_normalized, label="BAYFORD COMMODITY PORTFOLIO")
text = plt.text(0.5, 0.5, 'BAYFORD ANALYTICS', fontsize=40, color='gray', ha='center', va='center', alpha=0.5, transform=ax.transAxes)
plt.savefig('myfigure.png', dpi=900)
plt.legend()
plt.show()

Both codes compare the performance of a particular portfolio and the GSCI Index. However, these codes use different calculation methods, which leads to different results. In the first code, portfolio return is calculated as the average of closing prices of stocks over a given period. This assumes each stock has equal weight in the portfolio. In the second code, the portfolio return is calculated as the average of the daily return (percentage change) of each stock. This assumes that each stock has a certain weight in the portfolio. These weights are often based on an investment strategy or risk tolerance. These two different calculation methods represent different investment strategies and risk tolerances and therefore produce different results. The first method represents an investment strategy in which all stocks are equally important, while the second method represents an investment strategy that gives greater weight to certain stocks.

I hope these projects will help those who want to develop portfolio management software at the entry level.