DEV Community: Viper

ImageBaker - Making Image Labelling Fun

Viper — Sun, 27 Apr 2025 07:22:57 +0000

ImageBaker - Making Image Labelling Fun

Originally published at q-viper.github.io.

What is the most boring task in machine learning? As a software engineer focusing on computer vision who has shipped more than a dozen computer vision models in production, I would say image labeling. Yet, a machine learning engineer can’t escape this essential task—their models depend on it. So, how do we make it bearable? For me, I try to make it fun. And that’s how ImageBaker was born.

The Data Labeling Challenge

When clients request an anomaly detection system based on camera feeds, we need curated, labeled datasets to train our algorithms. However, without examples of anomalies, we can’t effectively detect them. After repeatedly facing the challenge of developing anomaly detection systems without adequate datasets, I created smaller tools to generate synthetic anomaly data. These experiments eventually led to ImageBaker, a comprehensive solution I hope to use in the long term and improve with community feedback.

Garbage In, Garbage Out

The performance of any machine learning model depends heavily on the quality and quantity of the labeled data it’s trained on. This is especially true for computer vision tasks. The process typically involves multiple cycles of labeling, training, and evaluation, making it time-consuming and often tedious.

What if we could generate multiple realistic labeled datasets from a single image? This approach could significantly reduce the time spent on manual labeling while maintaining the quality needed for effective model training.

What’s in a Name?

The concept behind ImageBaker involves extracting portions of an image (such as objects of interest) using tools like polygons or models such as Segment Anything. These extractions are treated as layers that can be copied, pasted, and manipulated to create multiple instances of the desired object.

By combining these layers step by step, we create new labeled images with annotations in JSON format. The term “baking” refers to the process of merging these layers into a single cohesive image, similar to how ingredients are combined and baked to create something new.

Key Features of ImageBaker

An example of baked images. (Each object is a layer, and annotations are automatically extracted for all layers.)

1. Annotate with Ease

Load a folder of images and annotate them using bounding boxes or polygons. The intuitive interface makes labeling faster and more precise.

Annotation page

2. Model Testing

Define models for detection, segmentation, and prompts (e.g., points or rectangles) by following the base model structure. This allows you to test how your models perform on different data variations.

On the same page shown above, we can see on the bottom right corner DummyDetectionModel, which is where we select the models we define and run in the backed. Upon hitting the Predict button, the prompts will be passed to the selected model along with the loaded image. Then the result from the model will be annotated back to the application.

3. Layerify Images

Crop images based on annotations to create reusable layers. Each cropped image represents a single object that can be manipulated independently.

A sample baker page.

4. Bake Custom States

Arrange layers to create image variations by dragging, rotating, adjusting opacity, and more. Save these arrangements as states with a simple button click or keyboard shortcut.

Those states could be saved by the Save button or with the shortcut Control + S.

We can also draw on a selected layer with a brush.

5. Export for Training

Export the final annotated JSON and baked multilayer images for use in training your computer vision models.

A sample exported annotated image with fake leaves.

Powerful Shortcuts for Productivity

Ctrl + C : Copy selected annotation/layer.
Ctrl + V: Paste copied annotation/layer in its parent image/layer if it is currently open.
Delete: Delete selected annotation/layer.
Left Click: Select an annotation/layer at the mouse position.
Left Click + Drag : Drag a selected annotation/layer.
Double Left Click: When using polygon annotation, completes the polygon.
Right Click: Deselect an annotation/layer. While annotating the polygon, undo the last point.
Ctrl + Mouse Wheel: Zoom In/Out on the mouse position, i.e., resize the viewport.
Ctrl + Drag: If done on the background, the viewport is panned.
Ctrl + S : Save State on Baker Tab.
Ctrl + D : Draw Mode on Baker Tab. Drawing can happen on a selected or main layer.
Ctrl + E : Erase Mode on Baker Tab.
Wheel: Change the size of the drawing pointer.

The custom image generated can be tested within the application as well, and we can see the performance of the model. If the model does not predict better results, we can retrain the model with that image.

I have also made a video about the project on YouTube, which can be viewed below.

I made this application in a weekend, and hence could still contain bugs. To make it ease to use, the project is Open Source, and I am hoping that more people will find it useful and the app can be more stable.

MySQL Triggers

Viper — Sat, 25 Jun 2022 11:52:13 +0000

Triggers in SQL

Originally published in dataqoil.com.

Triggers in SQL is a way to invoke something as a response to the events on the table in which Trigger is attached. The example of the event can be Insert, Update, Delete. Triggers are of two type, Row Level and Statement Level. The row level trigger is triggered for each row while statement level trigger is triggered once per transaction or execution.

Why do we need trigger?

In Data Engineering or Data Pipelining, to reflect the change of the data without having to listen.
To perform data validation with the by executing trigger Before inserting data. Examples can be performing integrity checks.
To handle database layer errors.
To record the history of the data changes.
To achieve some kind of table monitoring functionalities.

Triggers in MySQL

MySQL provides only row level triggers.

Syntax

CREATE TRIGGER name_of_trigger
{BEFORE | AFTER} {INSERT | UPDATE| DELETE }
ON table_name FOR EACH ROW
body_of_trigger;

Trigger’s body can be a single line to multiple and it is enclosed inside BEGIN and END for multiple line body.

While using Update, we can access existing value and new value (existing as Old and new as New)and we can compare between them too. Example: to compare old and new value of a column age, we can do OLD.age != NEW.age.
While using Insert, we can access new value using New keyword.
While using Delete, we can access old value using Old keyword.

Alert After Insert

Lets insert into logs after inserting the values.

First of all, lets create a database, Student via MySQL.

create database Student;

Create table, student_bio.

create table Student.student_bio (
                        id INT AUTO_INCREMENT PRIMARY KEY,
                        `name` varchar(255),
                        class varchar(255),
                        age float
                        );

Create table, student_logs

CREATE TABLE Student.student_logs (
    id INT AUTO_INCREMENT PRIMARY KEY,
    student_name varchar(255) NOT NULL,
    student_age float NOT NULL,
    created_date DATETIME DEFAULT NULL,
    operation VARCHAR(50) DEFAULT NULL
);

Create a trigger to log info on logs on inserting.

CREATE TRIGGER Student.after_student_insert 
    after insert ON Student.student_bio
    FOR EACH ROW 
 INSERT INTO Student.student_logs
 SET operation = 'insert',
     student_name = new.name,
     student_age = new.age,
     created_date = NOW();

Insert few data into it.

INSERT into Student.student_bio values(1,'John', 5, 15), (1,'Johnny', 7, 25);

Now look into Student.student_logs

Alert Before Insert

Lets insert the logs before inserting the values.

Define a trigger as:

delimiter // 
CREATE TRIGGER Student.before_student_insert 
    before insert ON Student.student_bio
    FOR EACH ROW 

 begin
 INSERT INTO Student.student_logs (student_name, student_age, created_date, operation) values(new.name, new.age,now(), 'insert_before');
 end
 //
 delimiter ;

Now insert few data as:

INSERT into Student.student_bio(`name`, class, age) values('Diwo', 5, 15), ('Ben', 7, 25);

Now see the data of student_logs

Alert Before Update

Lets create a trigger which checks the new value before inserting. If new value is greater than old, then set age as average of them. Else set age as old age. And additionally, insert the logs too.

Create a trigger as:

 delimiter // 
CREATE TRIGGER Student.before_student_update
    before update ON Student.student_bio
    FOR EACH ROW 

 begin
if old.age<new.age then set new.age=(old.age+new.age)/2;
    else set new.age=old.age; 
 end if;
 INSERT INTO Student.student_logs (student_name, student_age, created_date, operation) values(old.name, new.age,now(), 'update_before');
 end
 //
 delimiter ;

Now update student_bio as:

update student.student_bio set age =10 where class=5;

Again, update student_bio as:

update student.student_bio set age =20 where class=5;

In first update, the condition was False so the age was not changed. But in the second update, the condition is True and thus the age was set to average of two.

Alert Before Delete

Will be updated sooon….

Drawbacks

Now we knew its benefits and the use cases, lets get into the drawbacks of Triggers:

It increases the server overhead and can cause server hang ups.
It is difficult to test triggers because they are run by Database itself.
Can be used for advanced data validation but simple ones can be achieved by constraints like Unique, Null, Check, foreign key etc.

Python for Stock Market Analysis: Exploring Technical Trend Indicators

Viper — Wed, 30 Mar 2022 13:40:19 +0000

Introduction

Hello and welcome back everyone to our second part of the new blog series Python for Stock Market Analysis. In the last part, we explored different types of moving averages like Simple Moving Average (SMA), Exponential Moving Average (EMA), Weighted Moving Average (WMA) and explored other moving metrics like Moving Median and Moving Variance. Until now we were looking only into the trend over the time and trend over the period of time. These simple metrics are used under the hood to make some assumptions in the stock markets. In this blog, we will explore some of popular metrics that are used in the stock markets which are based on Moving Averages. Please refer to the interactive version of this blog if you want to see the interactive plots.

Disclaimer: This blog is for educational purpose only and we do not recommend taking the knowledge gained from this blog to implement in real financial exercises.

Technical indicators in stock markets are categorized in many ways and some of the most common are:

Trend Indicators
Momentum Indicator
Volatility Indicator
Volume Indicator

All above 4 are used to either predict or alert us about the future of the stock. The indicators are often viewed in the terms of leading and lagging. Leading indicators give some kind of predictions about the price rise or trend by using short term moving averages (like EMA of period 12 in MACD (Moving Average Convergence Divergence)). Lagging indicators give the information that has happened and might continue to do so. Like EMA of different periods.

Before diving into the coding part, lets read our data.

import pandas as pd
import plotly.express as px
import cufflinks
import plotly.io as pio 
import yfinance as yf
import warnings 
warnings.filterwarnings("ignore")
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')
pio.renderers.default = "notebook" # should change by looking into pio.renderers

pd.options.display.max_columns = None


symbols = ["AAPL"]

df = yf.download(tickers=symbols)
df.head()


# convert column names into lowercase
df.columns = [c.lower() for c in df.columns]
df.rename(columns={"adj close":"adj_close"},inplace=True)

Trend Indicators

Trend indicators are used as a basic way to visualize the flow of the stock’s performance over the course of the time (daily, monthly, weekly, in last 3 weeks etc). We can apply these indicators in stocks’s performances like volume, price and transactions. Trend indicators are different in kinds and we have explored some of them in the previous blog where have explored trend of Open, High, Low and Volume of the Apple’s Floorsheet data. The trend itself doesnot predict anything about the price rise or fall on the future but we can make some kind of analogy based on the recent performance of the stock.

Despite the price being high/low throughout the day, most traders find the closing price to be most important to describe the performance of the stock on that day. So, we will calculate most single variate indicators based on Closing Price of that day.

Some of popular trend indicators are:

Moving Average
Guppy Multiple Moving Average
Moving Average Convergence Divergence

We will calculate these in our data next.

Moving Averages

Moving Averages are common trend indicators that are a building blocks of popular indicators like GMMA (Guppy Multiple Moving Average), MACD (Moving Average Convergence Divergence)and PPO (Percentage Price Oscillator). But first lets write a simple function that could give us moving average of given window.

def moving_average(series, window=5, kind="sma"):
    if kind=="sma":
        return series.rolling(window=window, min_periods=window).mean()
    elif kind=="ema":
        return series.rolling(window=window, min_periods=window).mean()
    elif kind=="wma":
        return series.rolling(window=window, min_periods=window).apply(lambda x: np.average(x, weights=np.arange(1, window+1,1)))




tdf = df.copy()

window=30
tdf[f"close_sma_{window}"] = moving_average(tdf.close, window=window)
tdf[f"close_ema_{window}"] = moving_average(tdf.close, window=window, kind="ema")
tdf[f"close_wma_{window}"] = moving_average(tdf.close, window=window, kind="wma")
tdf

Trend of Closing Price Over a 30d Periods

cols = [c for c in tdf.columns if "close" in c and "adj" not in c]
tdf[cols].iplot(kind="line")

Above plot seems little bit spiked but if we zoomed it little bit, we could see the changes in the closing price and the performance of the moving averages. We can say that EMA are more closer toward the close’s actual trend because it gives more importance to the latest values based on the decay term.

Guppy Multiple Moving Average (GMMA)

GMMA is a technical indicator where we use two groups of EMAs (total 12) and compare their flow over the time to make assumptions. Guppy in GMMA comes from the Australian trader named as Daryl Guppy.

Two groups of EMAs are Long term EMA and Short term EMA. Where periods in long term EMAs (6 terms) are typically set 30, 35, 40, 45, 50 and 60. And short term EMA periods as 3, 5, 8,10, 12 and 15.
When SEMA (Short term EMA) moves above the LEMA (Long term EMA), then it indicates a price rise in the stocks could be happening.
Reversely, when SEMA moves below the LEMA, it indicates a price fall in the stocks could be happening.
Trade is recommended when one group crosses over another. Which means, sell in price rise and buy in price fall case.
The separation between two EMA groups gives a strength of the trend, thus the higher the difference between two groups of EMAs, higher the strength of rise/fall probability.
Reversals are when SEMA crosses over LEMA and vice versa.
Bullish crossover happens when SEMA crosses over the LEMA. And it indicates a bullish reversal occurrence.
Bearish crossover happens when SEMA crosses below the LEMA. And it indicates a bearish reversal occurrence.
When two groups are trending horizontally or parallel, then this the when no trend was found.

Calculation of GMMA

Calculate EMAs for both short term and long term trends.
Plot them and check whether a trend forms or not.

def guppy_multiple_ma(tdf,col="close", sma=[], lma=[]):
    """
        sma: [3, 5, 8, 10, 12, 15]
        lma: [30, 35, 40, 45, 50, 60]
    """

    if sma == []:
        sma = [3, 5, 8, 10, 12, 15]# 
    if lma == []:
        lma = [30, 35, 40, 45, 50, 60] #


    for sm in sma:
        tdf[f"sema_{col}_{sm}"] = tdf[col].ewm(span=sm, min_periods=sm, adjust=False).mean()
    for lm in lma:
        tdf[f"lema_{col}_{lm}"] = tdf[col].ewm(span=lm, min_periods=lm, adjust=False).mean()
    return tdf
tdf = guppy_multiple_ma(tdf, col="close") 
tdf

Viewing GMMA with Candlestick

Lets try to use candlestick to visualize OHLC and the trend at the same time. Any stick will be shown green if closing price is higher than opening and red if smaller than opening price. The top stick part is high, bottom stick part is low and top rectangle line reflects open if close is smaller else it reflects closing price. An example of candlestick is:

In above image, green represents where closing is greater than the opening price.

import plotly.graph_objects as go

layout = go.Layout(
    autosize=False,
    width=1000,
    height=1000,

    xaxis= go.layout.XAxis(linecolor = 'black',
                          linewidth = 1,
                          mirror = True),

    yaxis= go.layout.YAxis(linecolor = 'black',
                          linewidth = 1,
                          mirror = True),

)
fig=go.Figure(layout=layout)

lastn = 1000
ldf = tdf[-lastn:]
fig.add_trace(go.Candlestick(x=ldf.index,
                open=ldf['open'],
                high=ldf['high'],
                low=ldf['low'],
                close=ldf['close'], 
                 name = 'OHLC Market Data'))

for s in tdf.columns:
    if "sema" in s:
        fig.add_trace(go.Line(x=ldf.index, y=ldf[s], line=dict(
                        color='rgb(104, 204, 204)',

                    ), 
                      name=s.upper()))
    if "lema" in s:
        fig.add_trace(go.Line(x=ldf.index, y=ldf[s], line=dict(
                        color='rgb(255, 24, 24)',

                    ), name=s.upper()))

fig.update_layout(
    title= "AAPL Stock Data",
    yaxis_title="Stock's Price in USD",
    xaxis_title="Date")               

fig.show()

Looking over the last 1000 days of the trends, there can be seen crossover in around November 2 where SEMA were crossing over the LEMA, that is the sign of the price fall. Similarly, after February 15, SEMA again crossed over the LEMA and that is the sign of the price rise.

Above plot is interactive in our interactive blog, please refer there for the interactive version of this blog.

References

Percentage Price Oscillator

This is a momentum indicator (determines the strength or weakness of a value). But we can view the volatility too.
Two EMAs, 26 period and 12 periods are used to calculate PPO.
It contains 2 lines, PPO line and signal line. Signal line is an EMA of the 9 Period PPO, so it moves slower than PPO.
When PPO line crosses the signal line, it is the time for rise/fall of the price or stock.
When PPO line crosses over the signal line from below, then it is a buy signal. Reversely, it is a sell signal when PPO line crosses belo the signal line from above.
When PPO line is below the 0, the short term average is below the longer-term average average, which helps indicate a fall of price.
Conversely, when PPO line is above 0, the short term average is above the long term average, which helps indicate rise of price.

Calculation

Calculate the 12 and 26 period EMA of Closing Price.
Apply EMAs in below formula to get current PPO value: [PPO = \frac{\text{12 Period EMA - 26 Period EMA}}{\text{26 Period EMA}} * 100 \\ \text{signal_line} = \text{9 period EMA of PPO} \\ \text{PPO_histogram} = \text{PPO - Signal Line}]
Calculate signal line as the 9 Period EMA of PPO generated from above step.
We can compare different assets in terms of performance and volatility when the assets vary significantly in price.
MACD (Moving Average Convergence Divergence) is identical to PPO in the sense that these two compares two EMAs. The main difference between these is that PPO measures percentage difference between two EMAs, while the MACD measures absolute difference.
RSI (Relative Strength Index) is identical to PPO in the sense that these two compares two EMAs. The main difference is that it measure the magnitude of recent price changes.

def ppo(tdf, col="close", sm=12, lm=26):


    tdf[f"sema_{col}_{sm}"] = tdf[col].ewm(span=sm, min_periods=sm, adjust=False).mean()
    tdf[f"lema_{col}_{lm}"] = tdf[col].ewm(span=lm, min_periods=lm, adjust=False).mean()

    tdf["ppo"] = (tdf[f"sema_{col}_{sm}"]-tdf[f"lema_{col}_{lm}"]) / tdf[f"lema_{col}_{lm}"] * 100
    tdf["signal_line"] = tdf.ppo.ewm(span=9, min_periods=9, adjust=False).mean()
    tdf["ppo_hist"] = tdf["ppo"]-tdf["signal_line"]

    return tdf
tdf = df.copy()
tdf=ppo(tdf)
tdf


from plotly.subplots import make_subplots

fig=make_subplots(specs=[[{"secondary_y": True}]])

lastn = 1000
ldf = tdf[-lastn:]
fig.add_trace(go.Candlestick(x=ldf.index,
                open=ldf['open'],
                high=ldf['high'],
                low=ldf['low'],
                close=ldf['close'], 
                 name = 'OHLC Market Data'))

for s in tdf.columns:
    if "sema" in s:
        fig.add_trace(go.Line(x=ldf.index, y=ldf[s], line=dict(
                        color='rgb(104, 204, 204)',

                    ), 
                      name=s.upper()))
    if "lema" in s:
        fig.add_trace(go.Line(x=ldf.index, y=ldf[s], line=dict(
                        color='rgb(255, 24, 24)',

                    ), name=s.upper()))

clrred = 'rgb(222,0,0)'
clrgrn = 'rgb(0,222,0)'
clrs = [clrred if p<0 else clrgrn for p in ldf.ppo_hist]

fig.add_trace(go.Line(x=ldf.index, y=ldf.ppo, name="PPO"),secondary_y=True)
fig.add_trace(go.Bar(x=ldf.index, y=ldf.ppo_hist, name="PPO_Hist", marker=dict(color=clrs)),secondary_y=True)
fig.add_trace(go.Line(x=ldf.index, y=ldf.signal_line, name="Signal_Line"),secondary_y=True)


fig.update_layout(
    title= "AAPL Stock Data (PPO Plot)",
    yaxis_title="Stock's Price in USD",
    xaxis_title="Date")  

fig.show()

In above plot, we have changed the color of the histogram once the crossover happens. This allowed us to make assumptions based on the color. Also, we can see the performance of daily and the period of time at the same time by plotting candlestick.

Reference

Percentage Price Oscillator-Investopedia

Moving Average Convergence Divergence (MACD)

MACD is often considered as a Oscillator Indicator but this does give trend and some sort of volatility over a period of time by subtracting the 26 period EMA from 12 period EMA. Period in this case can be day, week, month and so on thus the periods can be changed according to our need. This is exactly similar to the PPO except we do not take Percentage.

Calculation

When MACD crosses above 0, then bullish is considered happening and conversely, when MACd crosses below 0, then bearish is considered happening.
Divergence happens when MACD forms highs When the MACD forms highs or lows that diverge from the corresponding highs and lows on the price, it is called a divergence.
A bullish divergence appears when the MACD forms two rising lows that correspond with two falling lows on the price. This is a valid bullish signal when the long-term trend is still positive.
Signal Line is plotted on the top of MACD line. Signal line is EMA of MACD of 9 period.
When MACD crosses below the signal line, it is sign of sell and if MACD crosses above the signal line, it is a signal of sell.

def macd(tdf, col="close", sm=12, lm=26):


    tdf[f"sema_{col}_{sm}"] = tdf[col].ewm(span=sm, min_periods=sm, adjust=False).mean()
    tdf[f"lema_{col}_{lm}"] = tdf[col].ewm(span=lm, min_periods=lm, adjust=False).mean()

    tdf["macd"] = (tdf[f"sema_{col}_{sm}"]-tdf[f"lema_{col}_{lm}"])
    tdf["signal_line"] = tdf.macd.ewm(span=9, min_periods=9, adjust=False).mean()
    tdf["macd_hist"] = tdf["macd"]-tdf["signal_line"]

    return tdf
tdf = df.copy()
tdf=macd(tdf)
tdf


from plotly.subplots import make_subplots

fig=make_subplots(specs=[[{"secondary_y": True}]])

lastn = 1000
ldf = tdf[-lastn:]
fig.add_trace(go.Candlestick(x=ldf.index,
                open=ldf['open'],
                high=ldf['high'],
                low=ldf['low'],
                close=ldf['close'], 
                 name = 'OHLC Market Data'))

for s in tdf.columns:
    if "sema" in s:
        fig.add_trace(go.Line(x=ldf.index, y=ldf[s], line=dict(
                        color='rgb(104, 204, 204)',

                    ), 
                      name=s.upper()))
    if "lema" in s:
        fig.add_trace(go.Line(x=ldf.index, y=ldf[s], line=dict(
                        color='rgb(255, 24, 24)',

                    ), name=s.upper()))

clrred = 'rgb(222,0,0)'
clrgrn = 'rgb(0,222,0)'
clrs = [clrred if p<0 else clrgrn for p in ldf.macd_hist]

fig.add_trace(go.Line(x=ldf.index, y=ldf.macd, name="MACD"),secondary_y=True)
fig.add_trace(go.Bar(x=ldf.index, y=ldf.macd_hist, name="MACD_Hist", marker=dict(color=clrs)),secondary_y=True)
fig.add_trace(go.Line(x=ldf.index, y=ldf.signal_line, name="Signal_Line"),secondary_y=True)


fig.update_layout(
    title= "AAPL Stock Data (MACD Plot)",
    yaxis_title="Stock's Price in USD",
    xaxis_title="Date")  

fig.show()

Above plot looks similar to the PPO plot and it is because they both use same EMAs and only difference is the percentage. Looking over a zoomed version.

References

MACD-Fidelity.com

Conclusion

In this blog, we we explored some of popular trend indicators like GMMA, PPO and MACD. In the next blog, we will explore other indicators and so on. This blogging series will not end soon :P.

Python for Stock Market Analysis: Working with Moving Averages

Viper — Mon, 14 Mar 2022 13:43:23 +0000

Introduction

Originally published in dataqoil.com.
This blog is a part of our series Python for Stock Market Analysis.

Disclaimer: This blog is for educational purpose only and we do not recommend taking the knowledge gained from this blog to implement in real financial exercises.

This blog tries to implement preliminary metrics that are used in the stock market analysis. The dataset we will be using is available via yahoofinance.

For interactive version of this blog, please visit this link.

Preliminary Actions

Install Libraries

Please install:

YahooFinance as pip install yfinance for downloading data of stock’s history.
Pandas as pip install pandas for data analysis.
Plotly as pip install plotly for interactive visualizations.
Cufflinks as pip install cufflinks for using interactive plots in pandas DataFrame.

You might need to install pip install -U kaleido if you need to save plots as png image.

If you are new into plotly, then we have an awesome blog about it where we have done plots based on COVID 19 dataset.

!pip install yfinance


Requirement already satisfied: yfinance in c:\programdata\anaconda3\lib\site-packages (0.1.63)
Requirement already satisfied: numpy>=1.15 in c:\users\dell\appdata\roaming\python\python38\site-packages (from yfinance) (1.19.5)
Requirement already satisfied: requests>=2.20 in c:\users\dell\appdata\roaming\python\python38\site-packages (from yfinance) (2.26.0)
Requirement already satisfied: multitasking>=0.0.7 in c:\programdata\anaconda3\lib\site-packages (from yfinance) (0.0.9)
Requirement already satisfied: pandas>=0.24 in c:\programdata\anaconda3\lib\site-packages (from yfinance) (1.2.4)
Requirement already satisfied: lxml>=4.5.1 in c:\programdata\anaconda3\lib\site-packages (from yfinance) (4.6.3)
Requirement already satisfied: pytz>=2017.3 in c:\programdata\anaconda3\lib\site-packages (from pandas>=0.24->yfinance) (2021.1)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\programdata\anaconda3\lib\site-packages (from pandas>=0.24->yfinance) (2.8.1)
Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.7.3->pandas>=0.24->yfinance) (1.15.0)
Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\dell\appdata\roaming\python\python38\site-packages (from requests>=2.20->yfinance) (2.0.7)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\programdata\anaconda3\lib\site-packages (from requests>=2.20->yfinance) (1.26.4)
Requirement already satisfied: certifi>=2017.4.17 in c:\programdata\anaconda3\lib\site-packages (from requests>=2.20->yfinance) (2020.12.5)
Requirement already satisfied: idna<4,>=2.5 in c:\programdata\anaconda3\lib\site-packages (from requests>=2.20->yfinance) (2.10)

Import Required Libraries

import pandas as pd
import plotly.express as px
import cufflinks
import plotly.io as pio 
import yfinance as yf
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')
pio.renderers.default = "notebook" # should change by looking into pio.renderers

pd.options.display.max_columns = None

Download Stock Data of Apple

By default, we are allowed to download data from 1900-01-01

symbols = ["AAPL"]

df = yf.download(tickers=symbols)
df.head()


[*********************100%***********************] 1 of 1 completed

	Open	High	Low	Close	Adj Close	Volume
Date
---	---	---	---	---	---	---
1980-12-12	0.128348	0.128906	0.128348	0.128348	0.100326	469033600
1980-12-15	0.122210	0.122210	0.121652	0.121652	0.095092	175884800
1980-12-16	0.113281	0.113281	0.112723	0.112723	0.088112	105728000
1980-12-17	0.115513	0.116071	0.115513	0.115513	0.090293	86441600
1980-12-18	0.118862	0.119420	0.118862	0.118862	0.092911	73449600

It seems that data is only available from 1980-12-12. The column names in the above fields are:

Open: It is the price of the stock at the start of the day of that date.
Close: It is the price of the stock at the end of that date.
High: It is the highest price of the stock on that date.
Low: It is the lowest price of the stock on that date.
Volume: It is the number of shares traded on that date.

Perform EDA

EDA or Exploratory Data Analysis is the first step in any Data Analysis and lets do that in our Stock Data too. We have blogs about doing EDA, Statistical and Inferential Analysis please check them out for more about EDAs.

Checking for Null Value

# convert column names into lowercase
df.columns = [c.lower() for c in df.columns]


ndf = pd.DataFrame()
for c in df.columns:
    nc = df[c].isna().sum()
    tr = len(df[c])
    rate = nc/tr
    ndf = ndf.append({"col_name":c,"total_rows": tr, 
                "null_rows": nc,
                "rate": rate},ignore_index=True)
ndf

	col_name	total_rows
0	open	10390.0
1	high	10390.0
2	low	10390.0
3	close	10390.0
4	adj_close	10390.0
5	volume	10390.0

It seems that we do not have any null rows present on the data.

View the Distribution

It gives us the frequency of value’s some range. It is simply a histogram.

fig = df.iplot(kind="hist",subplots=True, title="Distribution of All Variables", asFigure=True)
fig.write_image("stock_analysis/dist.png")
fig.show()

It seems that all values of the columns are left tailed.

View the Box Plot

Box Plot gives the clear picture of our descriptive nature of the data.

fig = df.iplot(kind="box",subplots=True, title="Box of All Variables", asFigure=True)
fig.write_image("stock_analysis/box.png")
fig.show()

It seems that we have too many outliers but it does not matter right now.

Summary of our data

df.describe()

	open	high	low	close	adj_close	volume
count	10390.000000	10390.000000	10390.000000	10390.000000	10390.000000	1.039000e+04
mean	13.689530	13.837209	13.542035	13.695320	13.077773	3.326112e+08
std	29.525352	29.857351	29.199483	29.542847	29.249790	3.394925e+08
min	0.049665	0.049665	0.049107	0.049107	0.038385	0.000000e+00
25%	0.281250	0.287946	0.273996	0.281250	0.234167	1.251712e+08
50%	0.466518	0.476004	0.459732	0.466518	0.385693	2.205952e+08
75%	14.034375	14.205357	13.918214	14.033482	12.025377	4.136293e+08
max	182.630005	182.940002	179.119995	182.009995	181.778397	7.421641e+09

Box Plot already gave us the summary of the data. We can see that the average volume is 3.326112e+08 but will it give a true picture about the volume’s flow over the course of the time? It won’t because there will be certain rise and falls of the values over the time. Lets try to visualize it as line plot.

fig=df.iplot(kind="line",subplots=True, title="Trend of All Variables", asFigure=True)
fig.write_image("stock_analysis/trend.png")
fig.show()

As we can see in the above plot that, the trend of the OHLC is in increasing order while Volume is not. The values of share increases/decreases but in overall, it seems to be increasing.

Moving Average

Moving average is a kind of average where we take the average of data within some time frame only. While looking at the time series data that have high volatility (e.g. standard deviation), the simple average DOES NOT give a clear picture of the mean or average value. One reason is that, in real world financial data, the amount/price does increase/decrease with some unexpected factors like COVID outbreak, or expected factors like Tesla’s new car. So to get the figure that will well represent the average amount, we will take the average over some time only. By doing so, we wont be caring much about the history that is too much old and does not affect much to our present.

Simple Moving Average (SMA)

Simple Moving Average is the simplest example of the Moving Average where we take the data from some time frame and divide it by number of data points. The size of the time frame is often known as the window of movement. It is an example of Technical Indicator (heuristic or pattern-based signals produced by the price or volume).

A formula to calculate Simple Moving Average is:

[SMA = \frac{V_1 + V_2 + V_3 + ... + V_n}{n}]

Where,

V is a value at period n
n is number of periods

Lets try to implement this concept in our data, we will take window size or n as 5.

tdf = df.copy()
smadf = tdf.rolling(window=5).mean()
smadf

	open	high	low	close	adj_close	volume
Date
---	---	---	---	---	---	---
1980-12-12	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-15	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-16	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-17	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-18	0.119643	0.119978	0.119420	0.119420	0.093347	182107520.0
...	...	...	...	...	...	...
2022-02-18	170.208002	171.663998	168.304001	170.080002	170.080002	72770540.0
2022-02-22	169.730002	171.085999	167.422000	169.168002	169.168002	73766000.0
2022-02-23	168.644000	169.725998	165.322000	166.624005	166.624005	78910580.0
2022-02-24	164.789999	167.628000	161.712000	164.662006	164.662006	94904600.0
2022-02-25	163.351999	166.269998	160.191998	163.856006	163.856006	99363080.0

10390 rows × 6 columns

for c in smadf.columns:
    tdf[f"sma_{c}"] = smadf[c]
tdf

	open	high	low	close	adj_close	volume	sma_open	sma_high	sma_low	sma_close	sma_adj_close	sma_volume
Date
---	---	---	---	---	---	---	---	---	---	---	---	---
1980-12-12	0.128348	0.128906	0.128348	0.128348	0.100326	469033600	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-15	0.122210	0.122210	0.121652	0.121652	0.095092	175884800	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-16	0.113281	0.113281	0.112723	0.112723	0.088112	105728000	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-17	0.115513	0.116071	0.115513	0.115513	0.090293	86441600	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-18	0.118862	0.119420	0.118862	0.118862	0.092911	73449600	0.119643	0.119978	0.119420	0.119420	0.093347	182107520.0
...	...	...	...	...	...	...	...	...	...	...	...	...
2022-02-18	169.820007	170.539993	166.190002	167.300003	167.300003	82614200	170.208002	171.663998	168.304001	170.080002	170.080002	72770540.0
2022-02-22	164.979996	166.690002	162.149994	164.320007	164.320007	91162800	169.730002	171.085999	167.422000	169.168002	169.168002	73766000.0
2022-02-23	165.539993	166.149994	159.750000	160.070007	160.070007	90009200	168.644000	169.725998	165.322000	166.624005	166.624005	78910580.0
2022-02-24	152.580002	162.850006	152.000000	162.740005	162.740005	141147500	164.789999	167.628000	161.712000	164.662006	164.662006	94904600.0
2022-02-25	163.839996	165.119995	160.869995	164.850006	164.850006	91881700	163.351999	166.269998	160.191998	163.856006	163.856006	99363080.0

10390 rows × 12 columns

Plotting SMA of All

smac = [c for c in tdf.columns if "sma" in c]
col = [c for c in tdf.columns if "sma" not in c]

for s,c in zip(smac,col):
    fig = tdf[[c, s]].iplot(kind="line", title=f"{s} vs {c}", xTitle="Date", asFigure=True)
    fig.write_image(f"stock_analysis/sma_{c}.png")
    fig.show()

We can not see the much difference between WMA and SMA and it is because of the level (daily) of our data. Lets try to plot data of last 100 days only.

for s,c,w in zip(smac,col, wmac):
    fig=tdf[-100:][[c, s, w]].iplot(kind="line", title=f"{s} vs {c} vs {w}", xTitle="Date", asFigure=True)
    fig.write_image(f"stock_analysis/sma_{c}2.png")
    fig.show()

Now it is more clearer. Looking over the plot of open,

We can clearly see that the default value of the open have some spikes and huge downfalls.
But the value of SMA and WMA is not getting affected that much by those sudden rise/fall in the value because it contains the value from the past.
If we look over the November to December, the Open value is increasing rapidly but the SMA is increasing slowly because it knows that there were some small values in last 5 days. Similarly WMA is also increasing slowly. But we can see that WMA is always much nearer to the Open value because it is giving much importance to the latest value.
The sudden rise/fall in open is not the sign or price up/down in long run because one must always consider moving averages.

Exponential Moving Average (EMA)

It is similar to the WMA in the sense of giving weights to values but, instead of the linear weights, we will give exponential weights.

A general formula of EMA at time t is:

[EMA_t = \left[V_t * \left(\frac{s}{1+d}\right)\right] + EMA_y * \left[1-\left(\frac{s}{1+d}\right)\right]]

Where,

EMAt is EMA value at t
Vt is value at t
EMAy is EMA at t-1
s is smoothing parameter
d is number of ts

Purpose of using EMA is to give high weights to more recent values and shows more sensitivity to more recent data. This average is more responsive to the latest price changes than SMA.

We do not have to use this scary formula from the scratch because pandas gives us some ways to do it with little code. Please refer to Pandas documentation for more info about EWM.

[y_0 = x_0 \\ y_t = (1 - \alpha) y_{t-1} + \alpha x_t,]

Where, alpha is either the value given by us or smoothing/(time periods+1). Smoothing is generally taken as 2 and time periods is taken as our requirement.

emadf=df.ewm(span=5, min_periods=5, adjust=True).mean()
emadf

	open	high	low	close	adj_close	volume
Date
---	---	---	---	---	---	---
1980-12-12	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-15	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-16	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-17	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-18	0.118153	0.118552	0.117994	0.117994	0.092232	1.239301e+08
...	...	...	...	...	...	...
2022-02-18	170.786802	171.961456	168.279383	169.672761	169.670227	7.574409e+07
2022-02-22	168.851200	170.204305	166.236253	167.888510	167.886821	8.088366e+07
2022-02-23	167.747464	168.852868	164.074169	165.282342	165.281216	8.392551e+07
2022-02-24	162.691644	166.851914	160.049446	164.434897	164.434146	1.029995e+08
2022-02-25	163.074428	166.274608	160.322962	164.573267	164.572766	9.929357e+07

10390 rows × 6 columns

for c in emadf.columns:
    tdf[f"ema_{c}"] = emadf[c]
tdf

	open	high	low	close	adj_close	volume	sma_open	sma_high	sma_low	sma_close	sma_adj_close	sma_volume	wma_open	wma_high	wma_low	wma_close	wma_adj_close	wma_volume	ema_open	ema_high	ema_low	ema_close	ema_adj_close	ema_volume
Date
---	---	---	---	---	---	---	---	---	---	---	---	---	---	---	---	---	---	---	---	---	---	---	---	---
1980-12-12	0.128348	0.128906	0.128348	0.128348	0.100326	469033600	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-15	0.122210	0.122210	0.121652	0.121652	0.095092	175884800	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-16	0.113281	0.113281	0.112723	0.112723	0.088112	105728000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-17	0.115513	0.116071	0.115513	0.115513	0.090293	86441600	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-18	0.118862	0.119420	0.118862	0.118862	0.092911	73449600	0.119643	0.119978	0.119420	0.119420	0.093347	182107520.0	0.117932	0.118304	0.117746	0.117746	0.092038	1.234001e+08	0.118153	0.118552	0.117994	0.117994	0.092232	1.239301e+08
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2022-02-18	169.820007	170.539993	166.190002	167.300003	167.300003	82614200	170.208002	171.663998	168.304001	170.080002	170.080002	72770540.0	170.538670	171.722664	168.136002	169.608669	169.608669	7.264790e+07	170.786802	171.961456	168.279383	169.672761	169.670227	7.574409e+07
2022-02-22	164.979996	166.690002	162.149994	164.320007	164.320007	91162800	169.730002	171.085999	167.422000	169.168002	169.168002	73766000.0	168.796001	170.064666	166.084666	167.688671	167.688671	7.877865e+07	168.851200	170.204305	166.236253	167.888510	167.886821	8.088366e+07
2022-02-23	165.539993	166.149994	159.750000	160.070007	160.070007	90009200	168.644000	169.725998	165.322000	166.624005	166.624005	78910580.0	167.399332	168.419331	163.527333	164.656006	164.656006	8.419305e+07	167.747464	168.852868	164.074169	165.282342	165.281216	8.392551e+07
2022-02-24	152.580002	162.850006	152.000000	162.740005	162.740005	141147500	164.789999	167.628000	161.712000	164.662006	164.662006	94904600.0	162.044666	166.127334	159.086666	163.361339	163.361339	1.049387e+08	162.691644	166.851914	160.049446	164.434897	164.434146	1.029995e+08
2022-02-25	163.839996	165.119995	160.869995	164.850006	164.850006	91881700	163.351999	166.269998	160.191998	163.856006	163.856006	99363080.0	161.727998	165.291332	158.805998	163.424006	163.424006	1.039311e+08	163.074428	166.274608	160.322962	164.573267	164.572766	9.929357e+07

10390 rows × 24 columns

Plotting EMA of All

Instead viewing EMA of entire data, lets view it of last 100 days only.

smac = [c for c in tdf.columns if "sma" in c]
wmac = [c for c in tdf.columns if "wma" in c]
emac = [c for c in tdf.columns if "ema" in c]
col = [c for c in tdf.columns if "sma" not in c and "wma" not in c and "ema" not in c]

for s,c,w,e in zip(smac,col, wmac, emac):
    fig=tdf[-100:][[c, s, w, e]].iplot(kind="line", title=f"{s} vs {c} vs {w} vs {e}", xTitle="Date", asFigure=True)
    fig.write_image(f"stock_analysis/ema_{c}.png")
    fig.show()

Looking over the EMA,it seems that it is much more smoother than the other values. But the smoothness depends on the value of the smoothing. Based on EMA, lots of other important metrics are calculated in Stock Market Analysis and to note down few:

Guppy Moving Average (GMMA)
Percentage Price Oscillator (PPO)
Relative Strength Index (RSI)
Moving Average Convergence Divergence (MCAD)

We will be exploring all above 4 metrics in the next blog please stay tuned for that.

Plotting Candlestick

Candlesticks are often used in stock data analysis for clear visualization and lets try that as well. We will use graph_objects of Plotly.

import plotly.graph_objects as go

fig=go.Figure()

fig.add_trace(go.Candlestick(x=tdf[-1000:].index,
                open=tdf[-1000:]['open'],
                high=tdf[-1000:]['high'],
                low=tdf[-1000:]['low'],
                close=tdf[-1000:]['close'], 
                 name = 'Stock Market Data'))
fig.add_trace(go.Candlestick(x=tdf[-1000:].index,
                open=tdf[-1000:]['ema_open'],
                high=tdf[-1000:]['ema_high'],
                low=tdf[-1000:]['ema_low'],
                close=tdf[-1000:]['ema_close'], 
                 name = 'EMA Stock Market Data'))

fig.update_layout(
    title= "AAPL Stock Data",
    yaxis_title="Stock's Price in USD",
    xaxis_title="Date")               

fig.update_xaxes(
    rangeslider_visible=True,
    rangeselector=dict(
        buttons=list([
            dict(count=150, label="150D", step="day", stepmode="backward"),
            dict(count=4, label="4m", step="month", stepmode="backward"),
            dict(step="all")
        ])
    )
)

color_hi_fill = 'black'
color_hi_line = 'blue'

color_lo_fill = 'yellow'
color_lo_line = 'purple'

fig.data[0].increasing.fillcolor = color_hi_fill
fig.data[0].increasing.line.color = color_hi_line
fig.data[0].decreasing.fillcolor = 'rgba(0,0,0,0)'
fig.data[0].decreasing.line.color = 'rgba(0,0,0,0)'

fig.data[1].increasing.fillcolor = 'rgba(0,0,0,0)'
fig.data[1].increasing.line.color = 'rgba(0,0,0,0)'
fig.data[1].decreasing.fillcolor = color_lo_fill
fig.data[1].decreasing.line.color = color_lo_line

fig.write_image("stock_analysis/candle.png")

fig.show()

Moving Median

What if we used median instead of the mean? Lets copy and paste the codes written in above steps and calculate median instead of the mean.

tdf = df.copy()
smmdf = tdf.rolling(window=5).median()

for c in smmdf.columns:
    tdf[f"smm_{c}"] = smmdf[c]

emadf=df.ewm(span=5, min_periods=5, adjust=True).mean()

for c in emadf.columns:
    tdf[f"ema_{c}"] = emadf[c]

smmc = [c for c in tdf.columns if "smm" in c]
emac = [c for c in tdf.columns if "ema" in c]
col = [c for c in tdf.columns if "smm" not in c and "ema" not in c]

for s,c,e in zip(smmc,col,emac):
    fig=tdf[-100:][[c, s, e]].iplot(kind="line", title=f"{s} vs {c} vs {e}", xTitle="Date", asFigure=True)
    fig.write_image(f"stock_analysis/mma_{c}.png")
    fig.show()

EMA seems to be much near to the open and EMA is more sensitive towards the change than Simple Moving Median.

Moving Variance

tdf = df.copy()
smmdf = tdf.rolling(window=5).var()

for c in smmdf.columns:
    tdf[f"smv_{c}"] = smmdf[c]

emadf=df.ewm(span=5, min_periods=5, adjust=True).var()

for c in emadf.columns:
    tdf[f"emv_{c}"] = emadf[c]

smmc = [c for c in tdf.columns if "smv" in c]
emac = [c for c in tdf.columns if "emv" in c]
col = [c for c in tdf.columns if "smv" not in c and "emv" not in c]

for s,c,e in zip(smmc,col,emac):
    fig=tdf[-100:][[c, s, e]].iplot(kind="line", y = [s,e], secondary_y=c, title=f"{s} vs vs {e}", xTitle="Date", asFigure=True)
    fig.write_image(f"stock_analysis/mva_{c}.png")
    fig.show()

Variance seems to be increasing when there is sudden change in the trend and it seems to be decreasing when the change seems to be normal.

Conclusion

In this blog, we have explored some of popular moving average algorithms used in the stock market analysis and in the next blog, we will explore some of the popular metrics that uses Moving Average as the base metric.

References

Data Analysis and Importance of Groupby in Pandas but not Just pd.groupby

Viper — Tue, 08 Mar 2022 16:58:09 +0000

Data Analysis and Importance of Groupby in Pandas but not Just pd.groupby

Originally published in dataqoil.com.
This blog will be continuously updated as I find new ways, tricks to make things work faster and easier.

Updates

January 5 2022
- Started blog and written up to Rate of Views Change Per Month According to Category.

What would you like to become in $y= mx+c$ ? Please don’t say +.

Introduction

I have been working with Pandas frequently and most of the time I have to do groupby. But I have noticed that pd.groupby is not always what I should do. Before diving into hands on experience, I would like to share some scenarios but first lets assume that you are working in a media company:

What if your manager asks you to find the trend of content reach/growth in monthly basis so that they could know whether the contents have desired effect or not? Where you have one datetime column in timestamp format.
What if your social media manager asks you to find the top 10 category of post with respect to profession of viewers so that they could make more focused and personalized contents dedicating to them and increase vies.
You see there is a chance of being promoted and you want to give some valuable insights? What if you to present a best time to post a particular type of content. For example, a comedy or funny content might get best views during the day, a nature or motivating content might get good views during morning and a loving or musical content might get good views during the night.

Above 3 examples are some high level problem statement but in the ground level, almost every analyst have to group the data. Here in this blog, I am going to create a dummy data and perform some of analysis using groupby with it.

Creating a Dummy Data

The data will be generated randomly and thus it might not make any sense in the realworld but the goal of this blog is to explain/explore ways to do groupby in Pandas.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings

sns.set(rc={'figure.figsize':(40, 20),
                "axes.titlesize" : 24,
                "axes.labelsize" : 20,
                "xtick.labelsize" : 16,
                "ytick.labelsize" : 16})
plt.rc("figure", figsize=(16,8))
warnings.filterwarnings('ignore')

Lets suppose your number of contents per day ranges from 3 to 7. Your views from the date of publish to 1 week could range from 2k to 100k and it also grows by 0.1% after reaching 100 views.

dates = pd.date_range(pd.to_datetime("2020-01-01"), pd.to_datetime("2021-01-01"))
times = ["Morning", "Day", "Night"]
categories = ["Motivating", "Musical", "Career", "News", "Funny"]
posts = list(range(5, 11))

content_dict = {"post_id":[], "date":[], "dtime": [], "category":[], "views": []}
post_id = 0

month = []
rate = 0.1

for d in dates:
    post_count = posts[np.random.randint(len(posts))]

    for p in range(post_count):
        if len(content_dict["date"])%100==0:
            rate+=0.1
        dtime = times[np.random.randint(len(times))]
        category = categories[np.random.randint(len(categories))]
        views = np.random.randint(20000, 100000) * rate

        content_dict["post_id"].append(post_id)
        content_dict["date"].append(d)
        content_dict["dtime"].append(dtime)
        content_dict["category"].append(category)
        content_dict["views"].append(views)
        post_id+=1

df = pd.DataFrame(content_dict, columns=list(content_dict.keys()))
df

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	post_id	date	dtime	category	views
0	0	2020-01-01	Night	Funny	19536.8
1	1	2020-01-01	Day	Funny	5048.0
2	2	2020-01-01	Night	News	13165.0
3	3	2020-01-01	Day	Career	12326.8
4	4	2020-01-01	Morning	News	19512.8
...	...	...	...	...	...
2735	2735	2021-01-01	Day	Musical	180803.4
2736	2736	2021-01-01	Morning	Motivating	203542.3
2737	2737	2021-01-01	Morning	Musical	161295.1
2738	2738	2021-01-01	Morning	Motivating	143900.9
2739	2739	2021-01-01	Night	Career	84401.6

2740 rows × 5 columns

The data is ready and now we could start our analysis.

Number of Posts According to Category

Using normal groupby. More at here.

df.groupby("category").post_id.count().plot(kind="bar")


<AxesSubplot:xlabel='category'>

Number of Views According to Category

df.groupby("category").views.sum().plot(kind="bar")


<AxesSubplot:xlabel='category'>

Pretty easy right?

Lets try something more.

Views and Count According to Day Time

df.groupby("dtime").post_id.count().plot(kind="bar")


<AxesSubplot:xlabel='dtime'>

df.groupby("dtime").views.sum().plot(kind="bar")


<AxesSubplot:xlabel='dtime'>

Views According to Month

Using resample on date according to month. We could use week, quarter and also more flexible times to resample. (More at here)[https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html#pandas.DataFrame.resample].

df.resample(rule='M', on='date')["views"].sum().plot(kind="bar")


<AxesSubplot:xlabel='date'>

In above step, we groupped the data according the month and took sum of views. But will it meet our next requirement?

Number of Views Per Month According to Category

Using Grouper to groupby month inside a groupby. More at here.

df.groupby(["category", pd.Grouper(key="date", freq="1M")]).views.sum().plot(kind="bar")


<AxesSubplot:xlabel='category,date'>

Why not make our plot little bit more awesome?

vdf = df.groupby(["category", pd.Grouper(key="date", freq="1M")]).views.sum().rename("Views").reset_index()
vdf["date"] = vdf["date"].dt.date

def bar_plot(data, title="Views", xax=None,yax=None, hue=None):

    fig, ax = plt.subplots(figsize = (50, 30))   
    fig = sns.barplot(x = xax, y = yax, data = data, 
                 ci = None, ax=ax, hue=hue)
    plt.legend(fontsize=40)
    plt.yticks(fontsize=40)
    plt.xticks(fontsize=40, rotation=80)
    plt.title(title, fontsize=50)
    plt.xlabel(xax, fontsize=50)
    plt.ylabel(yax, fontsize=50)
    plt.show()

I love to make my own custom visualization function. That gives me more flexibility and less time to tune sizes.

bar_plot(vdf, title="Views Plot", xax="date", yax="Views", hue="category")

Can you find some insights or make some argument by looking over above data? Your result will definately be different than mine because of the random data used on above.

Rate of Views Change Per Month

df.groupby([pd.Grouper(key="date", freq="1M")]).views.sum().pct_change().plot(kind="line")


<AxesSubplot:xlabel='date'>

Rate of Views Change Per Month According to Category

Using shift inside the groupby object.

vdf = df.groupby(["category", pd.Grouper(key="date", freq="1M")]).views.sum().rename("Sums").reset_index()
lags = vdf.groupby("category").Sums.shift(1)
vdf["Rate"] = (vdf["Sums"]-lags)/lags
vdf

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	category	date	Sums	Rate
0	Career	2020-01-31	738533.2	NaN
1	Career	2020-02-29	1199889.7	0.624693
2	Career	2020-03-31	1913763.8	0.594950
3	Career	2020-04-30	3330908.0	0.740501
4	Career	2020-05-31	3390679.1	0.017944
...	...	...	...	...
58	News	2020-08-31	4054761.3	0.221326
59	News	2020-09-30	4865650.4	0.199984
60	News	2020-10-31	4976088.2	0.022697
61	News	2020-11-30	5903048.6	0.186283
62	News	2020-12-31	9044649.8	0.532200

63 rows × 4 columns

sns.lineplot(data=vdf,x="date", y="Rate", hue="category")


<AxesSubplot:xlabel='date', ylabel='Rate'>

Because of being random data, we can not find any valuable information but we can see that the views has been decreased up to negative values in months like June. Lets verify that.

vdf[vdf.Rate<0]

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	category	date	Sums	Rate
5	Career	2020-06-30	3044816.8	-0.102004
11	Career	2020-12-31	8259240.0	-0.183001
12	Career	2021-01-31	84401.6	-0.989781
16	Funny	2020-04-30	2716427.3	-0.029595
21	Funny	2020-09-30	5526894.3	-0.108572
23	Funny	2020-11-30	5058885.1	-0.180990
30	Motivating	2020-06-30	3270395.5	-0.029415
33	Motivating	2020-09-30	4252837.6	-0.032287
36	Motivating	2020-12-31	6643837.5	-0.075897
37	Motivating	2021-01-31	347443.2	-0.947704
42	Musical	2020-05-31	2558136.9	-0.110001
47	Musical	2020-10-31	5345765.1	-0.058601
50	Musical	2021-01-31	621339.5	-0.933231
57	News	2020-07-31	3319967.6	-0.035456

More ways and ideas will be updated soon.

Creating Awesome Data Dashboard with Plotly in Streamlit: Clustering

Viper — Thu, 03 Mar 2022 15:04:31 +0000

This is a continuation of our previous blog entitled as Creating Data Dashboards With Streamlit and Python.

Adding A Clustering Functionality

Before diving into the clustering functionality in our existing app, please make sure you are following previous part. Or you can grab all the codes from below:

import streamlit as st
import numpy as np
import pandas as pd
import cufflinks

@st.cache
def get_data(url):
    df = pd.read_csv(url)
    df["date"] = pd.to_datetime(df.date).dt.date
    df['date'] = pd.DatetimeIndex(df.date)

    return df

url = "https://covid.ourworldindata.org/data/owid-covid-data.csv"
data = get_data(url)

locations = data.location.unique().tolist()

sidebar = st.sidebar

analysis_type = sidebar.radio("Analysis Type", ["Single", "Multiple"])
st.markdown(f"Analysis Mode: {analysis_type}")

if analysis_type=="Single":
    location_selector = sidebar.selectbox(
        "Select a Location",
        locations
    )
    st.markdown(f"# Currently Selected {location_selector}")
    trend_level = sidebar.selectbox("Trend Level", ["Daily", "Weekly", "Monthly", "Quarterly", "Yearly"])
    st.markdown(f"### Currently Selected {trend_level}")

    show_data = sidebar.checkbox("Show Data")

    trend_kwds = {"Daily": "1D", "Weekly": "1W", "Monthly": "1M", "Quarterly": "1Q", "Yearly": "1Y"}
    trend_data = data.query(f"location=='{location_selector}'").\
        groupby(pd.Grouper(key="date", 
        freq=trend_kwds[trend_level])).aggregate(new_cases=("new_cases", "sum"),
        new_deaths = ("new_deaths", "sum"),
        new_vaccinations = ("new_vaccinations", "sum"),
        new_tests = ("new_tests", "sum")).reset_index()

    trend_data["date"] = trend_data.date.dt.date

    new_cases = sidebar.checkbox("New Cases")
    new_deaths = sidebar.checkbox("New Deaths")
    new_vaccinations = sidebar.checkbox("New Vaccinations")
    new_tests = sidebar.checkbox("New Tests")

    lines = [new_cases, new_deaths, new_vaccinations, new_tests]
    line_cols = ["new_cases", "new_deaths", "new_vaccinations", "new_tests"]
    trends = [c[1] for c in zip(lines,line_cols) if c[0]==True]

    if show_data:
        tcols = ["date"] + trends
        st.dataframe(trend_data[tcols])

    subplots=sidebar.checkbox("Show Subplots", True)
    if len(trends)>0:
        fig=trend_data.iplot(kind="line", asFigure=True, xTitle="Date", yTitle="Values",
                            x="date", y=trends, title=f"{trend_level} Trend of {', '.join(trends)}.", subplots=subplots)
        st.plotly_chart(fig, use_container_width=False)

if analysis_type=="Multiple":
    selected = sidebar.multiselect("Select Locations ", locations)
    st.markdown(f"## Selected Locations: {', '.join(selected)}")
    show_data = sidebar.checkbox("Show Data")
    trend_level = sidebar.selectbox("Trend Level", ["Daily", "Weekly", "Monthly", "Quarterly", "Yearly"])
    st.markdown(f"### Currently Selected {trend_level}")

    trend_kwds = {"Daily": "1D", "Weekly": "1W", "Monthly": "1M", "Quarterly": "1Q", "Yearly": "1Y"}

    trend_data = data.query(f"location in {selected}").\
        groupby(["location", pd.Grouper(key="date", 
        freq=trend_kwds[trend_level])]).aggregate(new_cases=("new_cases", "sum"),
        new_deaths = ("new_deaths", "sum"),
        new_vaccinations = ("new_vaccinations", "sum"),
        new_tests = ("new_tests", "sum")).reset_index()

    trend_data["date"] = trend_data.date.dt.date

    new_cases = sidebar.checkbox("New Cases")
    new_deaths = sidebar.checkbox("New Deaths")
    new_vaccinations = sidebar.checkbox("New Vaccinations")
    new_tests = sidebar.checkbox("New Tests")

    lines = [new_cases, new_deaths, new_vaccinations, new_tests]
    line_cols = ["new_cases", "new_deaths", "new_vaccinations", "new_tests"]
    trends = [c[1] for c in zip(lines,line_cols) if c[0]==True]

    ndf = pd.DataFrame(data=trend_data.date.unique(),columns=["date"])

    for s in selected:
        new_cols = ["date"]+[f"{s}_{c}" for c in line_cols]
        tdf = trend_data.query(f"location=='{s}'")
        tdf.drop("location", axis=1, inplace=True)
        tdf.columns=new_cols
        ndf=ndf.merge(tdf,on="date",how="inner")

    if show_data:
        if len(ndf)>0:
            st.dataframe(ndf)
        else:
            st.markdown("Empty Dataframe")

    new_trends = []
    for c in trends:
        new_trends.extend([f"{s}_{c}" for s in selected])

    subplots=sidebar.checkbox("Show Subplots", True)
    if len(trends)>0:
        st.markdown("### Trend of Selected Locations")

        fig=ndf.iplot(kind="line", asFigure=True, xTitle="Date", yTitle="Values",
                            x="date", y=new_trends, title=f"{trend_level} Trend of {', '.join(trends)}.", subplots=subplots)
        st.plotly_chart(fig, use_container_width=False)

K Means Clustering

We have created an awesome blog about K Means clustering from the scratch if you would like to implement it. But now we are going to use it from SKlearn.

Algorithm

Let (P = {p_1,p_2,p_3,...,p_n}) be the set of data points and (C = {c_1,c_2,c_3,...,c_n}) be the set of centers.

Step 1 : Initially randomly select appropriate numbers of “c” cluster center.
Step 2 : Calculate distance between each data point (P = {p_1,p_2,p_3,...,p_n}) and cluster center ‘c’.
Step 3 : Keep data points to the cluster center whose distance from the cluster center is minimum of all the cluster centers. Here we calculate the distance using euclidean distance. Mathematically, (= \sum_{i=1}^n (x_i^2-y_i^2))
Step 4 : Now, recalculate the new cluster center using (\frac{1}{c}\sum_{i=1}^c x_i) where (c_i) represent the number of data point in (i^th) clusters.
Step 5 : Again calculate the distance between new cluster centers and each data points.
Step 6 : If number of data points in a cluster are updated then repeat step 3 otherwise terminate.

For our experiment, lets try to make cluster of countries based on total deaths.

Lets import sklearn’s KMeans class along with Plotly’s functions to plot and also import cufflinks and configure it. We wont be using this in our Strealit app though.

import cufflinks
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import plotly.express as px
import plotly.io as pio 
import plotly.graph_objects as go
import warnings
warnings.filterwarnings("ignore")
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')
pio.renderers.default = "notebook" # should change by looking into pio.renderers


data = pd.read_csv("owid-covid-data.csv")

First lets clean our data and take only those rows for which date is latest and only valid country names. That means, we will exclude those locations which are not country.

df = data[~data.location.isin(["Lower middle income", "North America", "World", "Asia", "Europe", 
                           "European Union", "Upper middle income", 
                           "High income", "South America"])]

tdf = df.sort_values("date").drop_duplicates(subset=["location"],keep="last")

Lets select some of useful columns in our data because there are lots of them.

columns=['total_cases', 'new_cases',
       'new_cases_smoothed', 'total_deaths', 'new_deaths',
       'new_deaths_smoothed', 'total_cases_per_million',
       'new_cases_per_million', 'new_cases_smoothed_per_million',
       'total_deaths_per_million', 'new_deaths_per_million',
       'new_deaths_smoothed_per_million', 'new_tests', 'total_tests',
       'total_tests_per_thousand', 'new_tests_per_thousand',
       'new_tests_smoothed', 'new_tests_smoothed_per_thousand',
       'tests_per_case', 'positive_rate', 'stringency_index',
       'population', 'population_density', 'median_age', 'aged_65_older',
       'aged_70_older', 'gdp_per_capita', 'extreme_poverty',
       'cardiovasc_death_rate', 'diabetes_prevalence', 'female_smokers',
       'male_smokers', 'handwashing_facilities', 'hospital_beds_per_thousand',
       'life_expectancy', 'human_development_index']

Lets train KMeans and find the optimal number of K for our case. We will only run a simple step here just because our goal is to implement it in the Streamlit app but not here.

from sklearn.cluster import KMeans

k=5

cols=["gdp_per_capita","new_cases_per_million", "new_deaths_per_million", "population_density","total_deaths"]

vdf = tdf.dropna(subset=cols)
X = vdf[cols]

inertias = []

# loop through range of Ks to find optimal value of k
for c in range(2, k):
    # initialize model with number of clusters
    model = KMeans(n_clusters=c)

    # fit a model
    model.fit(X)

    # predict a model
    y_kmeans = model.predict(X)

    # insert the prediction
    vdf["cluster"]=y_kmeans

    # append the intertia so that we could visualize it later on
    inertias.append((c, model.inertia_))

    # plot a scatterplot using plotly's go, select x as cols[0] and y as cols[1]
    # show location name while hovering point, 
    # make size of marker variable of total deaths
    # make color variable of cluster value
    # also plot cluster center
    fig=go.Figure(data=[go.Scatter(x=vdf[cols[0]],y=vdf[cols[1]],mode="markers",
                                   name="Countries",
                                   text=vdf["location"],
                                      marker = dict(
                                    size = vdf["total_deaths"]%20,
                                    opacity = 0.9,
                                    reversescale = True,
                                    symbol = 'pentagon',
                                    color=vdf["cluster"]

                                    ),
                                  ),
                        go.Scatter(x=model.cluster_centers_[:,0], y=model.cluster_centers_[:,1],
                                   mode="markers", name="Cluster Center",
                                   text=("Cluster " + vdf.cluster.astype(str)).unique(),
                                   marker = dict(
                                    size = 20,
                                    opacity = 0.8,
                                    reversescale = True,
                                    autocolorscale = False,
                                    symbol = 'circle',
                                    color=vdf.cluster.unique(),
                                    line = dict(
                                        width=1,
                                        color='rgba(102, 102, 102)'
                                    ))),
                       ])
    fig.show()

# plot a intertia to find optimal value of k in KMeans
inertias=np.array(inertias).reshape(-1,2)
performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
layout = go.Layout(
    title="Cluster Number vs Inertia",
    xaxis=dict(
        title="Ks"
    ),
    yaxis=dict(
        title="Inertia"
    ) ) 
fig=go.Figure(data=go.Data([performance]))
fig.update_layout(layout)
fig.show()

While choosing a best value of clusters or K, we should look into the last plot, an elbow curve. The best optimal value of K is the value from where our inertia starts to decrease slowly. Also, as we can see the colors of markers are not that cool which can be changed with below values. For that purpose, we will select a random color from the list of supported colors by plotly. Please follow this comment for more about it. A new Python file colors.py is created and below code is added on it. And we will use it later on.

import random
def get_colors():
    s='''
        aliceblue, antiquewhite, aqua, aquamarine, azure,
        beige, bisque, black, blanchedalmond, blue,
        blueviolet, brown, burlywood, cadetblue,
        chartreuse, chocolate, coral, cornflowerblue,
        cornsilk, crimson, cyan, darkblue, darkcyan,
        darkgoldenrod, darkgray, darkgrey, darkgreen,
        darkkhaki, darkmagenta, darkolivegreen, darkorange,
        darkorchid, darkred, darksalmon, darkseagreen,
        darkslateblue, darkslategray, darkslategrey,
        darkturquoise, darkviolet, deeppink, deepskyblue,
        dimgray, dimgrey, dodgerblue, firebrick,
        floralwhite, forestgreen, fuchsia, gainsboro,
        ghostwhite, gold, goldenrod, gray, grey, green,
        greenyellow, honeydew, hotpink, indianred, indigo,
        ivory, khaki, lavender, lavenderblush, lawngreen,
        lemonchiffon, lightblue, lightcoral, lightcyan,
        lightgoldenrodyellow, lightgray, lightgrey,
        lightgreen, lightpink, lightsalmon, lightseagreen,
        lightskyblue, lightslategray, lightslategrey,
        lightsteelblue, lightyellow, lime, limegreen,
        linen, magenta, maroon, mediumaquamarine,
        mediumblue, mediumorchid, mediumpurple,
        mediumseagreen, mediumslateblue, mediumspringgreen,
        mediumturquoise, mediumvioletred, midnightblue,
        mintcream, mistyrose, moccasin, navajowhite, navy,
        oldlace, olive, olivedrab, orange, orangered,
        orchid, palegoldenrod, palegreen, paleturquoise,
        palevioletred, papayawhip, peachpuff, peru, pink,
        plum, powderblue, purple, red, rosybrown,
        royalblue, saddlebrown, salmon, sandybrown,
        seagreen, seashell, sienna, silver, skyblue,
        slateblue, slategray, slategrey, snow, springgreen,
        steelblue, tan, teal, thistle, tomato, turquoise,
        violet, wheat, white, whitesmoke, yellow,
        yellowgreen
        '''
    li=s.split(',')
    li=[l.replace('\n','') for l in li]
    li=[l.replace(' ','') for l in li]
    random.shuffle(li)
    return li

Using PCA for Feature Reduction

We will have a separate blog about the PCA in future but right now, we will be using PCA with the purpose of dimensionality reduction of data.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# scale our data as z = (x - u) / s
ks=5
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# initialize PCA with components as 2 and fit it!
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(X_scaled)
feat = list(range(pca.n_components_))

# using the new features given by PCA, create new DF and the use it in training purposes.
PCA_components = pd.DataFrame(principalComponents, columns=list(range(len(feat))))
choosed_component=[0,1]

inertias = []
for c in range(1,ks+1):
    X = PCA_components[choosed_component]

    model = KMeans(n_clusters=c)
    model.fit(X)
    y_kmeans = model.predict(X)
    vdf["cluster"] = y_kmeans
    inertias.append((c,model.inertia_))

    trace0 = go.Scatter(x=X[0],y=X[1],mode='markers', marker=dict(
        color=vdf.cluster,
        colorscale='Viridis',
        showscale=True
    ),name="Cluster Points")

    trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
        color=vdf.cluster.unique(),
        size=20,
        showscale=True
    ),name="Cluster Mean")

    data7 = go.Data([trace0, trace1])
    fig = go.Figure(data=data7)
    fig.update_layout(title=f"Cluster Size {c}")
    fig.show()

inertias=np.array(inertias).reshape(-1,2)
performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
layout = go.Layout(
    title="Cluster Number vs Inertia",
    xaxis=dict(
        title="Ks"
    ),
    yaxis=dict(
        title="Inertia"
    ) ) 
fig=go.Figure(data=go.Data([performance]))
fig.update_layout(layout)
fig.show()

Adding K Means on Streamlit App

Starting from previous streamlit code of ours, we will modify it to add new features.

Import all required as before.

import streamlit as st
import numpy as np
import pandas as pd
import cufflinks
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from sklearn_extra.cluster import KMedoids
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from colors import *

Make a function to get data and call a get colors from previous function.

@st.cache
def get_data(url):
    df = pd.read_csv(url)
    df["date"] = pd.to_datetime(df.date).dt.date
    df['date'] = pd.DatetimeIndex(df.date)

    return df

colors = get_colors()

Get data, prepare locations, prepare sidebar and show radio buttons to select mode of our APP, mode will be EDA and Clustering. We have already done EDA. We will also show current mode in main panel.

url = "https://covid.ourworldindata.org/data/owid-covid-data.csv"
data = get_data(url)
columns = ['total_cases', 'new_cases',
                'new_cases_smoothed', 'total_deaths', 'new_deaths',
                'new_deaths_smoothed', 'total_cases_per_million',
                'new_cases_per_million', 'new_cases_smoothed_per_million',
                'total_deaths_per_million', 'new_deaths_per_million',
                'new_deaths_smoothed_per_million', 'new_tests', 'total_tests',
                'total_tests_per_thousand', 'new_tests_per_thousand',
                'new_tests_smoothed', 'new_tests_smoothed_per_thousand',
                'tests_per_case', 'positive_rate', 'stringency_index',
                'population', 'population_density', 'median_age', 'aged_65_older',
                'aged_70_older', 'gdp_per_capita', 'extreme_poverty',
                'cardiovasc_death_rate', 'diabetes_prevalence', 'female_smokers',
                'male_smokers', 'handwashing_facilities', 'hospital_beds_per_thousand',
                'life_expectancy', 'human_development_index']

locations = data.location.unique().tolist()

sidebar = st.sidebar

mode = sidebar.radio("Mode", ["EDA", "Clustering"])
st.markdown("<h1 style='text-align: center; color: #ff0000;'>COVID-19</h1>", unsafe_allow_html=True)
st.markdown("# Mode: {}".format(mode), unsafe_allow_html=True)

Put everything we have done in previous part inside a EDA mode.

if mode=="EDA":
    analysis_type = sidebar.radio("Analysis Type", ["Single", "Multiple"])
    st.markdown(f"# Analysis Mode: {analysis_type}")

    if analysis_type=="Single":
        location_selector = sidebar.selectbox(
            "Select a Location",
            locations
        )
        st.markdown(f"# Currently Selected {location_selector}")
        trend_level = sidebar.selectbox("Trend Level", ["Daily", "Weekly", "Monthly", "Quarterly", "Yearly"])
        st.markdown(f"### Currently Selected {trend_level}")

        show_data = sidebar.checkbox("Show Data")

        trend_kwds = {"Daily": "1D", "Weekly": "1W", "Monthly": "1M", "Quarterly": "1Q", "Yearly": "1Y"}
        trend_data = data.query(f"location=='{location_selector}'").\
            groupby(pd.Grouper(key="date", 
            freq=trend_kwds[trend_level])).aggregate(new_cases=("new_cases", "sum"),
            new_deaths = ("new_deaths", "sum"),
            new_vaccinations = ("new_vaccinations", "sum"),
            new_tests = ("new_tests", "sum")).reset_index()

        trend_data["date"] = trend_data.date.dt.date

        new_cases = sidebar.checkbox("New Cases")
        new_deaths = sidebar.checkbox("New Deaths")
        new_vaccinations = sidebar.checkbox("New Vaccinations")
        new_tests = sidebar.checkbox("New Tests")

        lines = [new_cases, new_deaths, new_vaccinations, new_tests]
        line_cols = ["new_cases", "new_deaths", "new_vaccinations", "new_tests"]
        trends = [c[1] for c in zip(lines,line_cols) if c[0]==True]

        if show_data:
            tcols = ["date"] + trends
            st.dataframe(trend_data[tcols])

        subplots=sidebar.checkbox("Show Subplots", True)
        if len(trends)>0:
            fig=trend_data.iplot(kind="line", asFigure=True, xTitle="Date", yTitle="Values",
                                x="date", y=trends, title=f"{trend_level} Trend of {', '.join(trends)}.", subplots=subplots)
            st.plotly_chart(fig, use_container_width=False)

    if analysis_type=="Multiple":
        selected = sidebar.multiselect("Select Locations ", locations)
        st.markdown(f"## Selected Locations: {', '.join(selected)}")
        show_data = sidebar.checkbox("Show Data")
        trend_level = sidebar.selectbox("Trend Level", ["Daily", "Weekly", "Monthly", "Quarterly", "Yearly"])
        st.markdown(f"### Currently Selected {trend_level}")

        trend_kwds = {"Daily": "1D", "Weekly": "1W", "Monthly": "1M", "Quarterly": "1Q", "Yearly": "1Y"}

        trend_data = data.query(f"location in {selected}").\
            groupby(["location", pd.Grouper(key="date", 
            freq=trend_kwds[trend_level])]).aggregate(new_cases=("new_cases", "sum"),
            new_deaths = ("new_deaths", "sum"),
            new_vaccinations = ("new_vaccinations", "sum"),
            new_tests = ("new_tests", "sum")).reset_index()

        trend_data["date"] = trend_data.date.dt.date

        new_cases = sidebar.checkbox("New Cases")
        new_deaths = sidebar.checkbox("New Deaths")
        new_vaccinations = sidebar.checkbox("New Vaccinations")
        new_tests = sidebar.checkbox("New Tests")

        lines = [new_cases, new_deaths, new_vaccinations, new_tests]
        line_cols = ["new_cases", "new_deaths", "new_vaccinations", "new_tests"]
        trends = [c[1] for c in zip(lines,line_cols) if c[0]==True]

        ndf = pd.DataFrame(data=trend_data.date.unique(),columns=["date"])

        for s in selected:
            new_cols = ["date"]+[f"{s}_{c}" for c in line_cols]
            tdf = trend_data.query(f"location=='{s}'")
            tdf.drop("location", axis=1, inplace=True)
            tdf.columns=new_cols
            ndf=ndf.merge(tdf,on="date",how="inner")

        if show_data:
            if len(ndf)>0:
                st.dataframe(ndf)
            else:
                st.markdown("Empty Dataframe")

        new_trends = []
        for c in trends:
            new_trends.extend([f"{s}_{c}" for s in selected])

        subplots=sidebar.checkbox("Show Subplots", True)
        if len(trends)>0:
            st.markdown("### Trend of Selected Locations")

            fig=ndf.iplot(kind="line", asFigure=True, xTitle="Date", yTitle="Values",
                                x="date", y=new_trends, title=f"{trend_level} Trend of {', '.join(trends)}.", subplots=subplots)
            st.plotly_chart(fig, use_container_width=False)

For Clustering mode, first select features from our predefined columns list. Please follow comment line above the code on below section. Most of the parts are already done on above steps.

if mode=="Clustering":    
    features = sidebar.multiselect("Select Features", columns, default=columns[:3])

    # select a clustering algorithm
    calg = sidebar.selectbox("Select a clustering algorithm", ["K-Means","K-Medoids", "Spectral Clustering", "Agglomerative Clustering"])

    # select number of clusters
    ks = sidebar.slider("Select number of clusters", min_value=2, max_value=10, value=2)

    # select a dataframe to apply cluster on

    udf = data.sort_values("date").drop_duplicates(subset=["location"],keep="last").dropna(subset=features)
    udf = udf[~udf.location.isin(["Lower middle income", "North America", "World", "Asia", "Europe", 
                           "European Union", "Upper middle income", 
                           "High income", "South America"])]

    # udf[features].dropna()

    if len(features)>=2:

        if calg == "K-Means":
            st.markdown("### K-Means Clustering")        
            use_pca = sidebar.radio("Use PCA?",["Yes","No"])
            if use_pca=="No":
                st.markdown("### Not Using PCA")
                inertias = []
                for c in range(1,ks+1):
                    tdf = udf.copy()
                    X = tdf[features]                
                    # colors=['red','green','blue','magenta','black','yellow']
                    model = KMeans(n_clusters=c)
                    model.fit(X)
                    y_kmeans = model.predict(X)
                    tdf["cluster"] = y_kmeans
                    inertias.append((c,model.inertia_))

                    trace0 = go.Scatter(x=tdf[features[0]],y=tdf[features[1]],mode='markers',  
                                        marker=dict(
                                                color=tdf.cluster.apply(lambda x: colors[x]),
                                                colorscale='Viridis',
                                                showscale=True,
                                                size = udf["total_cases"]%20,
                                                opacity = 0.9,
                                                reversescale = True,
                                                symbol = 'pentagon'
                                                ),
                                        name="Locations", text=udf["location"])

                    trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],
                                        mode='markers', 
                                        marker=dict(
                                            color=colors,
                                            size=20,
                                            symbol="circle",
                                            showscale=True,
                                            line = dict(
                                                width=1,
                                                color='rgba(102, 102, 102)'
                                                )

                                            ),
                                        name="Cluster Mean")

                    data7 = go.Data([trace0, trace1])
                    fig = go.Figure(data=data7)
                    layout = go.Layout(
                                height=600, width=800, title=f"KMeans Cluster Size {c}",
                                xaxis=dict(
                                    title=features[0],
                                ),
                                yaxis=dict(
                                    title=features[1]
                                ) ) 

                    fig.update_layout(layout)
                    st.plotly_chart(fig)

                inertias=np.array(inertias).reshape(-1,2)
                performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
                layout = go.Layout(
                    title="Cluster Number vs Inertia",
                    xaxis=dict(
                        title="Ks"
                    ),
                    yaxis=dict(
                        title="Inertia"
                    ) ) 
                fig=go.Figure(data=go.Data([performance]))
                fig.update_layout(layout)
                st.plotly_chart(fig)

            if use_pca=="Yes":
                st.markdown("### Using PCA")
                comp = sidebar.number_input("Choose Components",1,10,3)

                tdf=udf.copy()

                X = udf[features]
                scaler = StandardScaler()
                X_scaled = scaler.fit_transform(X)

                pca = PCA(n_components=int(comp))
                principalComponents = pca.fit_transform(X_scaled)
                feat = list(range(pca.n_components_))
                PCA_components = pd.DataFrame(principalComponents, columns=list(range(len(feat))))
                choosed_component = sidebar.multiselect("Choose Components",feat,default=[1,2])
                choosed_component=[int(i) for i in choosed_component]
                inertias = []
                if len(choosed_component)>1:
                    for c in range(1,ks+1):
                        X = PCA_components[choosed_component]

                        model = KMeans(n_clusters=c)
                        model.fit(X)
                        y_kmeans = model.predict(X)
                        tdf["cluster"] = y_kmeans
                        inertias.append((c,model.inertia_))

                        trace0 = go.Scatter(x=X[choosed_component[0]],y=X[choosed_component[1]],mode='markers',  
                                            marker=dict(
                                                    color=tdf.cluster.apply(lambda x: colors[x]),
                                                    colorscale='Viridis',
                                                    showscale=True,
                                                    size = udf["total_cases"]%20,
                                                    opacity = 0.9,
                                                    reversescale = True,
                                                    symbol = 'pentagon'
                                                    ),
                                            name="Locations", text=udf["location"])

                        trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],
                                            mode='markers', 
                                            marker=dict(
                                                color=colors,
                                                size=20,
                                                symbol="circle",
                                                showscale=True,
                                                line = dict(
                                                    width=1,
                                                    color='rgba(102, 102, 102)'
                                                    )

                                                ),
                                            name="Cluster Mean")


                        data7 = go.Data([trace0, trace1])
                        fig = go.Figure(data=data7)

                        layout = go.Layout(
                                    height=600, width=800, title=f"KMeans Cluster Size {c}",
                                    xaxis=dict(
                                        title=f"Component {choosed_component[0]}",
                                    ),
                                    yaxis=dict(
                                        title=f"Component {choosed_component[1]}"
                                    ) ) 
                        fig.update_layout(layout)
                        st.plotly_chart(fig)

                    inertias=np.array(inertias).reshape(-1,2)
                    performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
                    layout = go.Layout(
                        title="Cluster Number vs Inertia",
                        xaxis=dict(
                            title="Ks"
                        ),
                        yaxis=dict(
                            title="Inertia"
                        ) ) 
                    fig=go.Figure(data=go.Data([performance]))
                    fig.update_layout(layout)
                    st.plotly_chart(fig)
    else:
        st.markdown("### Please Select at Least 2 Features for Visualization.")

We should be seeing something like below on our APP:

K Medoids Clustering

K-medoids are a prominent clustering algorithm as an improvement of the predecessor, K-Means algorithm. Despite its widely used and less sensitive to noises and outliers, the performance of K-medoids clustering algorithm is affected by the distance function. From here.

When k-means algorithm is not appropriate to make a objects of cluster to the data points then k-medoid clustering algorithm is prefer. The medoid is objects of cluster whose dissimilarity to all the objects in the cluster is minimum. The main difference between K-means and K-medoid algorithm that we work with arbitrary matrix of distance instead of euclidean distance. K-medoid is a classical partitioning technique of clustering that cluster the dataset into k cluster. It is more robust to noise and outliers because it may minimize sum of pair-wise dissimilarities however k-means minimize sum of squared Euclidean distances. Most common distances used in KMedoids clustering techniques are Manhattan distance or Minkowski distance and here we will use Manhattan distance.

Manhattan Distance

| Of p1, p2 is: $$ | (x2-x1)+(y2-y1) | $$. |

Algorithm

Step 1 : Randomly select(without replacement) k of the n data points as the median.
Step 2 : Associate each data points to the closest median.
Step 3 : While the cost of the configuration decreases:
- For each medoid m, for each non-medoid data point o:
- Swap m and o, re-compute the cost.
- If the total cost of the configuration increased in the previous step, undo the swap.

We will use scikit learn extra instead of scikit learn this provides more features of algorithms than sklearn. But there is huge problem with KMedoids which is the time and memory complexity. We will be looping through data in big O. So we will try to cluster on sample data instead of the original data.

pip install sklearn-extra

Everything will be as we have done for the KMeans, thus we will skip its exploration here and directly insert it in App.

KMedoids to Streamlit App

To use KMedoids, we should import KMedoids as from sklearn_extra.cluster import KMedoids. Then everything is similar to the KMeans.

        # if selected kmedoids, do respective operations
        if calg == "K-Medoids":  
            st.markdown("### K-Medoids Clustering")      

            # if using PCA or not
            use_pca = sidebar.radio("Use PCA?",["Yes","No"])
            # if not using pca, do default clustering
            if use_pca=="No":
                st.markdown("### Not Using PCA")
                inertias = []
                for c in range(1,ks+1):
                    tdf = udf.copy()
                    X = tdf[features]                
                    # colors=['red','green','blue','magenta','black','yellow']
                    model = KMedoids(n_clusters=c)
                    model.fit(X)
                    y_kmeans = model.predict(X)
                    tdf["cluster"] = y_kmeans
                    inertias.append((c,model.inertia_))


                    trace0 = go.Scatter(x=tdf[features[0]],y=tdf[features[1]],mode='markers',  
                                        marker=dict(
                                                color=tdf.cluster.apply(lambda x: colors[x]),
                                                colorscale='Viridis',
                                                showscale=True,
                                                size = udf["total_cases"]%20,
                                                opacity = 0.9,
                                                reversescale = True,
                                                symbol = 'pentagon'
                                                ),
                                        name="Locations", text=udf["location"])

                    trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],
                                        mode='markers', 
                                        marker=dict(
                                            color=colors,
                                            size=20,
                                            symbol="circle",
                                            showscale=True,
                                            line = dict(
                                                width=1,
                                                color='rgba(102, 102, 102)'
                                                )

                                            ),
                                        name="Cluster Mean")

                    data7 = go.Data([trace0, trace1])
                    fig = go.Figure(data=data7)
                    layout = go.Layout(
                                height=600, width=800, title=f"KMedoids Cluster Size {c}",
                                xaxis=dict(
                                    title=features[0],
                                ),
                                yaxis=dict(
                                    title=features[1]
                                ) ) 

                    fig.update_layout(layout)
                    st.plotly_chart(fig)

                inertias=np.array(inertias).reshape(-1,2)
                performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
                layout = go.Layout(
                    title="Cluster Number vs Inertia",
                    xaxis=dict(
                        title="Ks"
                    ),
                    yaxis=dict(
                        title="Inertia"
                    ) ) 
                fig=go.Figure(data=go.Data([performance]))
                fig.update_layout(layout)
                st.plotly_chart(fig)

            # if using pca, use pca to reduce dimensionality and then do clustering    
            if use_pca=="Yes":
                st.markdown("### Using PCA")
                comp = sidebar.number_input("Choose Components",1,10,3)

                tdf=udf.copy()

                X = udf[features]
                scaler = StandardScaler()
                X_scaled = scaler.fit_transform(X)

                pca = PCA(n_components=int(comp))
                principalComponents = pca.fit_transform(X_scaled)
                feat = list(range(pca.n_components_))
                PCA_components = pd.DataFrame(principalComponents, columns=list(range(len(feat))))
                choosed_component = sidebar.multiselect("Choose Components",feat,default=[1,2])
                choosed_component=[int(i) for i in choosed_component]
                inertias = []
                if len(choosed_component)>1:
                    for c in range(1,ks+1):
                        X = PCA_components[choosed_component]

                        model = KMedoids(n_clusters=c)
                        model.fit(X)
                        y_kmeans = model.predict(X)
                        tdf["cluster"] = y_kmeans
                        inertias.append((c,model.inertia_))

                        trace0 = go.Scatter(x=X[choosed_component[0]],y=X[choosed_component[1]],mode='markers',  
                                            marker=dict(
                                                    color=tdf.cluster.apply(lambda x: colors[x]),
                                                    colorscale='Viridis',
                                                    showscale=True,
                                                    size = udf["total_cases"]%20,
                                                    opacity = 0.9,
                                                    reversescale = True,
                                                    symbol = 'pentagon'
                                                    ),
                                            name="Locations", text=udf["location"])

                        trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],
                                            mode='markers', 
                                            marker=dict(
                                                color=colors,
                                                size=20,
                                                symbol="circle",
                                                showscale=True,
                                                line = dict(
                                                    width=1,
                                                    color='rgba(102, 102, 102)'
                                                    )

                                                ),
                                            name="Cluster Median")


                        data7 = go.Data([trace0, trace1])
                        fig = go.Figure(data=data7)

                        layout = go.Layout(
                                    height=600, width=800, title=f"KMedoids Cluster Size {c}",
                                    xaxis=dict(
                                        title=f"Component {choosed_component[0]}",
                                    ),
                                    yaxis=dict(
                                        title=f"Component {choosed_component[1]}"
                                    ) ) 
                        fig.update_layout(layout)
                        st.plotly_chart(fig)

                    inertias=np.array(inertias).reshape(-1,2)
                    performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
                    layout = go.Layout(
                        title="Cluster Number vs Inertia",
                        xaxis=dict(
                            title="Ks"
                        ),
                        yaxis=dict(
                            title="Inertia"
                        ) ) 
                    fig=go.Figure(data=go.Data([performance]))
                    fig.update_layout(layout)
                    st.plotly_chart(fig)

We should be seeing something like below:

Making it little bit dynamic

We could do this by making a class with respect to the selected value of calg and we do not need to have if/else to make a clustering class.

import streamlit as st
import numpy as np
import pandas as pd
import cufflinks
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from sklearn_extra.cluster import KMedoids
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from colors import *

@st.cache
def get_data(url):
    df = pd.read_csv("owid-covid-data.csv")
    df["date"] = pd.to_datetime(df.date).dt.date
    df['date'] = pd.DatetimeIndex(df.date)

    return df

colors = get_colors()

url = "https://covid.ourworldindata.org/data/owid-covid-data.csv"
data = get_data(url)
columns = ['total_cases', 'new_cases',
                'new_cases_smoothed', 'total_deaths', 'new_deaths',
                'new_deaths_smoothed', 'total_cases_per_million',
                'new_cases_per_million', 'new_cases_smoothed_per_million',
                'total_deaths_per_million', 'new_deaths_per_million',
                'new_deaths_smoothed_per_million', 'new_tests', 'total_tests',
                'total_tests_per_thousand', 'new_tests_per_thousand',
                'new_tests_smoothed', 'new_tests_smoothed_per_thousand',
                'tests_per_case', 'positive_rate', 'stringency_index',
                'population', 'population_density', 'median_age', 'aged_65_older',
                'aged_70_older', 'gdp_per_capita', 'extreme_poverty',
                'cardiovasc_death_rate', 'diabetes_prevalence', 'female_smokers',
                'male_smokers', 'handwashing_facilities', 'hospital_beds_per_thousand',
                'life_expectancy', 'human_development_index']

locations = data.location.unique().tolist()

sidebar = st.sidebar

mode = sidebar.radio("Mode", ["EDA", "Clustering"])
st.markdown("<h1 style='text-align: center; color: #ff0000;'>COVID-19</h1>", unsafe_allow_html=True)
st.markdown("# Mode: {}".format(mode), unsafe_allow_html=True)

if mode=="EDA":
    analysis_type = sidebar.radio("Analysis Type", ["Single", "Multiple"])
    st.markdown(f"# Analysis Mode: {analysis_type}")

    if analysis_type=="Single":
        location_selector = sidebar.selectbox(
            "Select a Location",
            locations
        )
        st.markdown(f"# Currently Selected {location_selector}")
        trend_level = sidebar.selectbox("Trend Level", ["Daily", "Weekly", "Monthly", "Quarterly", "Yearly"])
        st.markdown(f"### Currently Selected {trend_level}")

        show_data = sidebar.checkbox("Show Data")

        trend_kwds = {"Daily": "1D", "Weekly": "1W", "Monthly": "1M", "Quarterly": "1Q", "Yearly": "1Y"}
        trend_data = data.query(f"location=='{location_selector}'").\
            groupby(pd.Grouper(key="date", 
            freq=trend_kwds[trend_level])).aggregate(new_cases=("new_cases", "sum"),
            new_deaths = ("new_deaths", "sum"),
            new_vaccinations = ("new_vaccinations", "sum"),
            new_tests = ("new_tests", "sum")).reset_index()

        trend_data["date"] = trend_data.date.dt.date

        new_cases = sidebar.checkbox("New Cases")
        new_deaths = sidebar.checkbox("New Deaths")
        new_vaccinations = sidebar.checkbox("New Vaccinations")
        new_tests = sidebar.checkbox("New Tests")

        lines = [new_cases, new_deaths, new_vaccinations, new_tests]
        line_cols = ["new_cases", "new_deaths", "new_vaccinations", "new_tests"]
        trends = [c[1] for c in zip(lines,line_cols) if c[0]==True]

        if show_data:
            tcols = ["date"] + trends
            st.dataframe(trend_data[tcols])

        subplots=sidebar.checkbox("Show Subplots", True)
        if len(trends)>0:
            fig=trend_data.iplot(kind="line", asFigure=True, xTitle="Date", yTitle="Values",
                                x="date", y=trends, title=f"{trend_level} Trend of {', '.join(trends)}.", subplots=subplots)
            st.plotly_chart(fig, use_container_width=False)

    if analysis_type=="Multiple":
        selected = sidebar.multiselect("Select Locations ", locations)
        st.markdown(f"## Selected Locations: {', '.join(selected)}")
        show_data = sidebar.checkbox("Show Data")
        trend_level = sidebar.selectbox("Trend Level", ["Daily", "Weekly", "Monthly", "Quarterly", "Yearly"])
        st.markdown(f"### Currently Selected {trend_level}")

        trend_kwds = {"Daily": "1D", "Weekly": "1W", "Monthly": "1M", "Quarterly": "1Q", "Yearly": "1Y"}

        trend_data = data.query(f"location in {selected}").\
            groupby(["location", pd.Grouper(key="date", 
            freq=trend_kwds[trend_level])]).aggregate(new_cases=("new_cases", "sum"),
            new_deaths = ("new_deaths", "sum"),
            new_vaccinations = ("new_vaccinations", "sum"),
            new_tests = ("new_tests", "sum")).reset_index()

        trend_data["date"] = trend_data.date.dt.date

        new_cases = sidebar.checkbox("New Cases")
        new_deaths = sidebar.checkbox("New Deaths")
        new_vaccinations = sidebar.checkbox("New Vaccinations")
        new_tests = sidebar.checkbox("New Tests")

        lines = [new_cases, new_deaths, new_vaccinations, new_tests]
        line_cols = ["new_cases", "new_deaths", "new_vaccinations", "new_tests"]
        trends = [c[1] for c in zip(lines,line_cols) if c[0]==True]

        ndf = pd.DataFrame(data=trend_data.date.unique(),columns=["date"])

        for s in selected:
            new_cols = ["date"]+[f"{s}_{c}" for c in line_cols]
            tdf = trend_data.query(f"location=='{s}'")
            tdf.drop("location", axis=1, inplace=True)
            tdf.columns=new_cols
            ndf=ndf.merge(tdf,on="date",how="inner")

        if show_data:
            if len(ndf)>0:
                st.dataframe(ndf)
            else:
                st.markdown("Empty Dataframe")

        new_trends = []
        for c in trends:
            new_trends.extend([f"{s}_{c}" for s in selected])

        subplots=sidebar.checkbox("Show Subplots", True)
        if len(trends)>0:
            st.markdown("### Trend of Selected Locations")

            fig=ndf.iplot(kind="line", asFigure=True, xTitle="Date", yTitle="Values",
                                x="date", y=new_trends, title=f"{trend_level} Trend of {', '.join(trends)}.", subplots=subplots)
            st.plotly_chart(fig, use_container_width=False)

elif mode=="Clustering":
    colors = get_colors()    
    features = sidebar.multiselect("Select Features", columns, default=columns[:3])

    # select a clustering algorithm
    calg = sidebar.selectbox("Select a clustering algorithm", ["K-Means","K-Medoids"])
    algs = {"K-Means": KMeans, "K-Medoids": KMedoids}

    # select number of clusters
    ks = sidebar.slider("Select number of clusters", min_value=2, max_value=10, value=2)

    # select a dataframe to apply cluster on

    udf = data.sort_values("date").drop_duplicates(subset=["location"],keep="last").dropna(subset=features)
    udf = udf[~udf.location.isin(["Lower middle income", "North America", "World", "Asia", "Europe", 
                           "European Union", "Upper middle income", 
                           "High income", "South America"])]

    # udf[features].dropna()

    if len(features)>=2:
        st.markdown(f"### {calg} Clustering")      

        # if using PCA or not
        use_pca = sidebar.radio("Use PCA?",["Yes","No"])
        # if not using pca, do default clustering
        if use_pca=="No":
            st.markdown("### Not Using PCA")
            inertias = []
            for c in range(1,ks+1):
                tdf = udf.copy()
                X = tdf[features]                
                # colors=['red','green','blue','magenta','black','yellow']
                model = algs[calg](n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))


                trace0 = go.Scatter(x=tdf[features[0]],y=tdf[features[1]],mode='markers',  
                                    marker=dict(
                                            color=tdf.cluster.apply(lambda x: colors[x]),
                                            colorscale='Viridis',
                                            showscale=True,
                                            size = udf["total_cases"]%20,
                                            opacity = 0.9,
                                            reversescale = True,
                                            symbol = 'pentagon'
                                            ),
                                    name="Locations", text=udf["location"])

                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],
                                    mode='markers', 
                                    marker=dict(
                                        color=colors,
                                        size=20,
                                        symbol="circle",
                                        showscale=True,
                                        line = dict(
                                            width=1,
                                            color='rgba(102, 102, 102)'
                                            )

                                        ),
                                    name="Cluster Center")

                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                layout = go.Layout(
                            height=600, width=800, title=f"{calg} Cluster Size {c}",
                            xaxis=dict(
                                title=features[0],
                            ),
                            yaxis=dict(
                                title=features[1]
                            ) ) 

                fig.update_layout(layout)
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)

        # if using pca, use pca to reduce dimensionality and then do clustering    
        if use_pca=="Yes":
            st.markdown("### Using PCA")
            comp = sidebar.number_input("Choose Components",1,10,3)

            tdf=udf.copy()

            X = udf[features]
            scaler = StandardScaler()
            X_scaled = scaler.fit_transform(X)

            pca = PCA(n_components=int(comp))
            principalComponents = pca.fit_transform(X_scaled)
            feat = list(range(pca.n_components_))
            PCA_components = pd.DataFrame(principalComponents, columns=list(range(len(feat))))
            choosed_component = sidebar.multiselect("Choose Components",feat,default=[1,2])
            choosed_component=[int(i) for i in choosed_component]
            inertias = []
            if len(choosed_component)>1:
                for c in range(1,ks+1):
                    X = PCA_components[choosed_component]

                    model = algs[calg](n_clusters=c)
                    model.fit(X)
                    y_kmeans = model.predict(X)
                    tdf["cluster"] = y_kmeans
                    inertias.append((c,model.inertia_))

                    trace0 = go.Scatter(x=X[choosed_component[0]],y=X[choosed_component[1]],mode='markers',  
                                        marker=dict(
                                                color=tdf.cluster.apply(lambda x: colors[x]),
                                                colorscale='Viridis',
                                                showscale=True,
                                                size = udf["total_cases"]%20,
                                                opacity = 0.9,
                                                reversescale = True,
                                                symbol = 'pentagon'
                                                ),
                                        name="Locations", text=udf["location"])

                    trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],
                                        mode='markers', 
                                        marker=dict(
                                            color=colors,
                                            size=20,
                                            symbol="circle",
                                            showscale=True,
                                            line = dict(
                                                width=1,
                                                color='rgba(102, 102, 102)'
                                                )

                                            ),
                                        name="Cluster Center")


                    data7 = go.Data([trace0, trace1])
                    fig = go.Figure(data=data7)

                    layout = go.Layout(
                                height=600, width=800, title=f"{calg} Cluster Size {c}",
                                xaxis=dict(
                                    title=f"Component {choosed_component[0]}",
                                ),
                                yaxis=dict(
                                    title=f"Component {choosed_component[1]}"
                                ) ) 
                    fig.update_layout(layout)
                    st.plotly_chart(fig)

                inertias=np.array(inertias).reshape(-1,2)
                performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
                layout = go.Layout(
                    title="Cluster Number vs Inertia",
                    xaxis=dict(
                        title="Ks"
                    ),
                    yaxis=dict(
                        title="Inertia"
                    ) ) 
                fig=go.Figure(data=go.Data([performance]))
                fig.update_layout(layout)
                st.plotly_chart(fig)


    else:
        st.markdown("### Please Select at Least 2 Features for Visualization.")

This is all for this part, now in the next part, we will add regression feature to our APP.

Make Awesome Data Dashboard using Streamlit and Plotly: Simple Trends

Viper — Sun, 27 Feb 2022 06:37:21 +0000

Originally published on dataqoil.com.

Make Awesome Data Dashboard using Streamlit and Plotly: Simple Trends

One of the few ways we find the insights from the data is via dashboards. And for Data Analysts, there are options like tableau. But not all of them are for free. However, we can make some cool dashboards using Streamlit and in this blog, we will explore how.

This blog is just a beginning of creating simple data dashboard with Plotly in Streamlit. Here we will only plot lines in this blog. Next blog will be about plotting maps. Please Stay TUNED.

Updates

2022/2/20 : This blog

Installation

We have written a cool blog about getting started with Plotly and Cufflinks for making awesome analysis and plots in Jupyter Notebook. Please do not forget to read them.

Plotting Interactive Plots with Plotly and Cufflinks

pip install plotly cufflinks streamlit

First Streamlit App

For making a first streamlit app:

We will simply create a new project folder (but it is not necessary)
We will create a new Python file named as main.py inside it.
Then inside that Python file we will add

import streamlit as st

st.markdown("Hello world, this is my new Data Dashboard.")

Now saving a file and then from the project folder, we will run streamlit:

streamlit run main.py

We could see something like below on the terminal:

If the link does not open to the browser by itself, open it. And we could see our markdown text on the web page.

First Plotly Plot in Streamlit

It is relatively easy to plot graphs and plots in Streamlit app than any other web apps. Lets do it how.

We will be making a data dashboard thus we will first prepare a real world data.
The data will be of COVID 19 data from this repository. The data is updated on daily level thus your results can be different than ours in this blog.

Lets put below code in our main.py file and see the changes in browser by refreshing.

import streamlit as st
import numpy as np
import pandas as pd
import cufflinks

@st.cache
def get_data(url):
    df = pd.read_csv(url)
    df["date"] = pd.to_datetime(df.date).dt.date
    df['date'] = pd.DatetimeIndex(df.date)

    return df

url = "https://covid.ourworldindata.org/data/owid-covid-data.csv"
data = get_data(url)

daily_cases = data.groupby(pd.Grouper(key="date", freq="1D")).aggregate(new_cases=("new_cases", "sum")).reset_index()
fig = daily_cases.iplot(kind="line", asFigure=True, 
                        x="date", y="new_cases")
st.plotly_chart(fig)

In above code, we did:

Imported NumPy, Pandas and Cufflinks.
Read a csv file from a given URL inside a function along with a cache decorator. The reason to do so is that we do not want the csv file to be reloaded everytime we make small changes in a source file.
We made a date column with date time index.
We then aggregated data on daily level by finding a sum of new cases.
We plotted a line plot using Pandas iplot attribute. Cufflinks allowed us to use iplot with Pandas object.
To be able to use that figure in streamlit app, we used asFigure=True in iplot and then passed figure inside st.plotly_chart

Adding Dropdown for Location

The above plot was for entire locations and if we look carefully to all the locations, there are values like World, Asia and so on which are aggregated values and if we want to view world’s daily trend, we must either filter out rows of locations like World, Asia or we must select rows with those values. But doing filter or selection inside a code will not be much of a good idea so lets make a drop down. Just below the function, we will modify code to look like below:

url = "https://covid.ourworldindata.org/data/owid-covid-data.csv"
data = get_data(url)

locations = data.location.unique().tolist()

sidebar = st.sidebar
location_selector = sidebar.selectbox(
    "Select a Location",
    locations
)
st.markdown(f"# Currently Selected {location_selector}")

daily_cases = data.groupby(pd.Grouper(key="date", freq="1D")).aggregate(new_cases=("new_cases", "sum")).reset_index()
fig = daily_cases.iplot(kind="line", asFigure=True, 
                        x="date", y="new_cases")
st.plotly_chart(fig)

What we did is:

Taken a unique list of countries or locations from above dataframe.
Create a sidebar object to make our drop down visible on sidebar.
Create a selectbox in that sidebar and give options as locations.
Then in markdown, show the currently selected location. Streamlit gives selected value in that selectbox.

We can see something like below:

Adding A Checkbox to Show Data

It is even simpler. Add below code just below the markdown to show location selected.


show_data = sidebar.checkbox("Show Data")

if show_data:
    st.dataframe(data)

daily_cases = data.groupby(pd.Grouper(key="date", freq="1D")).aggregate(new_cases=("new_cases", "sum")).reset_index()
fig = daily_cases.iplot(kind="line", asFigure=True, 
                        x="date", y="new_cases")
st.plotly_chart(fig)

We created a checkbox on sidebar and if it is clicked, we will push the data in st.dataframe. Below is the result in web app.

But the data is not much readable. So lets create a new drop down, where we will select the type of trend. But lets first create possible metrics or trend of data that we want visualize:

Daily Cases : How many of cases were there on daily level?
Daily Deaths : How many of the deaths were there on daily level?
Daily Tests : How many of the tests were there on daily level?
Daily Vaccination : How many of the daily vaccinations were there on daily level?

In above 4 metrics, we could make weekly, monthly, quarterly and yearly level aggregations easily so lets make it as a whole.

Date Level Trend Data

Just below the locations line, we will create code something like below:

sidebar = st.sidebar
location_selector = sidebar.selectbox(
    "Select a Location",
    locations
)
st.markdown(f"# Currently Selected {location_selector}")
trend_level = sidebar.selectbox("Trend Level", ["Daily", "Weekly", "Monthly", "Quarterly", "Yearly"])
st.markdown(f"### Currently Selected {trend_level}")

show_data = sidebar.checkbox("Show Data")

trend_kwds = {"Daily": "1D", "Weekly": "1W", "Monthly": "1M", "Quarterly": "1Q", "Yearly": "1Y"}
trend_data = data.query(f"location=='{location_selector}'").\
    groupby(pd.Grouper(key="date", 
    freq=trend_kwds[trend_level])).aggregate(new_cases=("new_cases", "sum"),
    new_deaths = ("new_deaths", "sum"),
    new_vaccinations = ("new_vaccinations", "sum"),
    new_tests = ("new_tests", "sum")).reset_index()

trend_data["date"] = trend_data.date.dt.date

new_cases = sidebar.checkbox("New Cases")
new_deaths = sidebar.checkbox("New Deaths")
new_vaccinations = sidebar.checkbox("New Vaccinations")
new_tests = sidebar.checkbox("New Tests")

lines = [new_cases, new_deaths, new_vaccinations, new_tests]
line_cols = ["new_cases", "new_deaths", "new_vaccinations", "new_tests"]
trends = [c[1] for c in zip(lines,line_cols) if c[0]==True]

if show_data:
    tcols = ["date"] + trends
    st.dataframe(trend_data[tcols])

daily_cases = data.groupby(pd.Grouper(key="date", freq="1D")).aggregate(new_cases=("new_cases", "sum")).reset_index()
fig = daily_cases.iplot(kind="line", asFigure=True, 
                        x="date", y="new_cases")
st.plotly_chart(fig)

What we did in above code is:

Created a selectbox for selecting a trend level, daily, weekly, monthly, quarterly and yearly.
Then we also showed the selected level in 3rd heading level in markdown.
We have already made a show data checkbox.
We also prepared a keywords for each level. This keywords dictionary is used while taking a group on respective date level. So 1D is for daily and W is for weekly and so on. We will select a trend level as a key to this dictionary and pass the value of this dictionary as a Grouper’s frequency later.
We took a data of currently selected location and then grouped the filtered data according to the given trend level. Then calculated summed values of new deaths, new vaccinations, new cases and new tests on that level.
We also make the date column more like normalized form.
We made separate checkbox for each of above created trend data column.
We will plot a line, thus we created another list lines, holding all the checkbox variables we created on previous step.
We also created another list, line_cols where we kepth the name of the columns from a trend_data with respective to the lines list’s checkboxes.
We created another list trend and we will put those column names from lines list for which its respective checkbox is checked on.
If the checkbox show_data is checked on, then we will show the data but show only those columns which is checked on.

The result should look like below:

And if we selected all the columns with weekly trend of Afghanistan,

Date Level Trend Visualization

In above web app, our dashboard contained only a data table and a plot that we initially created. But now, lets create a visualization of that as well.

Lets put below code just below we showed our data.

subplots=sidebar.checkbox("Show Subplots", True)
if len(trends)>0:
    fig=trend_data.iplot(kind="line", asFigure=True, xTitle="Date", yTitle="Values",
                         x="date", y=trends, title=f"{trend_level} Trend of {', '.join(trends)}.", subplots=subplots)
    st.plotly_chart(fig, use_container_width=False)

But remove the code of visualization we added earlier.

We can see something like below:

Comparison Between N Countries

In above plots, we were only plotting plots of a single location but what if we want to compare between two by viewing same on the same figure? This is not possible by default so we will tweak out code little bit.

Make a radio button and pass two values, Single and Multiple. If selected Single, we will do analysis on single location else on Multiple.
For Single selection, put everything we’ve done until now inside a if condition.

analysis_type = sidebar.radio("Analysis Type", ["Single", "Multiple"])
st.markdown(f"Analysis Mode: {analysis_type}")

if analysis_type=="Single":
    location_selector = sidebar.selectbox(
        "Select a Location",
        locations
    )
    st.markdown(f"# Currently Selected {location_selector}")
    trend_level = sidebar.selectbox("Trend Level", ["Daily", "Weekly", "Monthly", "Quarterly", "Yearly"])
    st.markdown(f"### Currently Selected {trend_level}")

    show_data = sidebar.checkbox("Show Data")

    trend_kwds = {"Daily": "1D", "Weekly": "1W", "Monthly": "1M", "Quarterly": "1Q", "Yearly": "1Y"}
    trend_data = data.query(f"location=='{location_selector}'").\
        groupby(pd.Grouper(key="date", 
        freq=trend_kwds[trend_level])).aggregate(new_cases=("new_cases", "sum"),
        new_deaths = ("new_deaths", "sum"),
        new_vaccinations = ("new_vaccinations", "sum"),
        new_tests = ("new_tests", "sum")).reset_index()

    trend_data["date"] = trend_data.date.dt.date

    new_cases = sidebar.checkbox("New Cases")
    new_deaths = sidebar.checkbox("New Deaths")
    new_vaccinations = sidebar.checkbox("New Vaccinations")
    new_tests = sidebar.checkbox("New Tests")

    lines = [new_cases, new_deaths, new_vaccinations, new_tests]
    line_cols = ["new_cases", "new_deaths", "new_vaccinations", "new_tests"]
    trends = [c[1] for c in zip(lines,line_cols) if c[0]==True]

    if show_data:
        tcols = ["date"] + trends
        st.dataframe(trend_data[tcols])

    subplots=sidebar.checkbox("Show Subplots", True)
    if len(trends)>0:
        fig=trend_data.iplot(kind="line", asFigure=True, xTitle="Date", yTitle="Values",
                            x="date", y=trends, title=f"{trend_level} Trend of {', '.join(trends)}.", subplots=subplots)
        st.plotly_chart(fig, use_container_width=False)

For multiple, we will first select few locations using multi select. Then show them in markdown.

if analysis_type=="Multiple":
    selected = sidebar.multiselect("Select Locations ", locations)
    st.markdown(f"## Selected Locations: {', '.join(selected)}")

Create a checkbox and do same as above until we created a trends list.

    show_data = sidebar.checkbox("Show Data")
    trend_level = sidebar.selectbox("Trend Level", ["Daily", "Weekly", "Monthly", "Quarterly", "Yearly"])
    st.markdown(f"### Currently Selected {trend_level}")

    trend_kwds = {"Daily": "1D", "Weekly": "1W", "Monthly": "1M", "Quarterly": "1Q", "Yearly": "1Y"}

    trend_data = data.query(f"location in {selected}").\
        groupby(["location", pd.Grouper(key="date", 
        freq=trend_kwds[trend_level])]).aggregate(new_cases=("new_cases", "sum"),
        new_deaths = ("new_deaths", "sum"),
        new_vaccinations = ("new_vaccinations", "sum"),
        new_tests = ("new_tests", "sum")).reset_index()

    trend_data["date"] = trend_data.date.dt.date

    new_cases = sidebar.checkbox("New Cases")
    new_deaths = sidebar.checkbox("New Deaths")
    new_vaccinations = sidebar.checkbox("New Vaccinations")
    new_tests = sidebar.checkbox("New Tests")

    lines = [new_cases, new_deaths, new_vaccinations, new_tests]
    line_cols = ["new_cases", "new_deaths", "new_vaccinations", "new_tests"]
    trends = [c[1] for c in zip(lines,line_cols) if c[0]==True]

Create a new data frame where we will create new columns based on each selected country.

    ndf = pd.DataFrame(data=trend_data.date.unique(),columns=["date"])

For each selected country, create new column and merge it back to ndf with key as a date.

    for s in selected:
        new_cols = ["date"]+[f"{s}_{c}" for c in line_cols]
        tdf = trend_data.query(f"location=='{s}'")
        tdf.drop("location", axis=1, inplace=True)
        tdf.columns=new_cols
        ndf=ndf.merge(tdf,on="date",how="inner")

If show_data is selected, we will show the dataframe.

    if show_data:
        if len(ndf)>0:
            st.dataframe(ndf)
        else:
            st.markdown("Empty Dataframe")

Create a new list where we will put columns related to location.

    new_trends = []
    for c in trends:
        new_trends.extend([f"{s}_{c}" for s in selected])

Create a subplots checkbox and plot a line plot with new_trends column names.

    subplots=sidebar.checkbox("Show Subplots", True)
    if len(trends)>0:
        st.markdown("### Trend of Selected Locations")

        fig=ndf.iplot(kind="line", asFigure=True, xTitle="Date", yTitle="Values",
                            x="date", y=new_trends, title=f"{trend_level} Trend of {', '.join(trends)}.", subplots=subplots)
        st.plotly_chart(fig, use_container_width=False)

Full Code

import streamlit as st
import numpy as np
import pandas as pd
import cufflinks

@st.cache
def get_data(url):
    df = pd.read_csv(url)
    df["date"] = pd.to_datetime(df.date).dt.date
    df['date'] = pd.DatetimeIndex(df.date)

    return df

url = "https://covid.ourworldindata.org/data/owid-covid-data.csv"
data = get_data(url)

locations = data.location.unique().tolist()

sidebar = st.sidebar

analysis_type = sidebar.radio("Analysis Type", ["Single", "Multiple"])
st.markdown(f"Analysis Mode: {analysis_type}")

if analysis_type=="Single":
    location_selector = sidebar.selectbox(
        "Select a Location",
        locations
    )
    st.markdown(f"# Currently Selected {location_selector}")
    trend_level = sidebar.selectbox("Trend Level", ["Daily", "Weekly", "Monthly", "Quarterly", "Yearly"])
    st.markdown(f"### Currently Selected {trend_level}")

    show_data = sidebar.checkbox("Show Data")

    trend_kwds = {"Daily": "1D", "Weekly": "1W", "Monthly": "1M", "Quarterly": "1Q", "Yearly": "1Y"}
    trend_data = data.query(f"location=='{location_selector}'").\
        groupby(pd.Grouper(key="date", 
        freq=trend_kwds[trend_level])).aggregate(new_cases=("new_cases", "sum"),
        new_deaths = ("new_deaths", "sum"),
        new_vaccinations = ("new_vaccinations", "sum"),
        new_tests = ("new_tests", "sum")).reset_index()

    trend_data["date"] = trend_data.date.dt.date

    new_cases = sidebar.checkbox("New Cases")
    new_deaths = sidebar.checkbox("New Deaths")
    new_vaccinations = sidebar.checkbox("New Vaccinations")
    new_tests = sidebar.checkbox("New Tests")

    lines = [new_cases, new_deaths, new_vaccinations, new_tests]
    line_cols = ["new_cases", "new_deaths", "new_vaccinations", "new_tests"]
    trends = [c[1] for c in zip(lines,line_cols) if c[0]==True]

    if show_data:
        tcols = ["date"] + trends
        st.dataframe(trend_data[tcols])

    subplots=sidebar.checkbox("Show Subplots", True)
    if len(trends)>0:
        fig=trend_data.iplot(kind="line", asFigure=True, xTitle="Date", yTitle="Values",
                            x="date", y=trends, title=f"{trend_level} Trend of {', '.join(trends)}.", subplots=subplots)
        st.plotly_chart(fig, use_container_width=False)

if analysis_type=="Multiple":
    selected = sidebar.multiselect("Select Locations ", locations)
    st.markdown(f"## Selected Locations: {', '.join(selected)}")
    show_data = sidebar.checkbox("Show Data")
    trend_level = sidebar.selectbox("Trend Level", ["Daily", "Weekly", "Monthly", "Quarterly", "Yearly"])
    st.markdown(f"### Currently Selected {trend_level}")

    trend_kwds = {"Daily": "1D", "Weekly": "1W", "Monthly": "1M", "Quarterly": "1Q", "Yearly": "1Y"}

    trend_data = data.query(f"location in {selected}").\
        groupby(["location", pd.Grouper(key="date", 
        freq=trend_kwds[trend_level])]).aggregate(new_cases=("new_cases", "sum"),
        new_deaths = ("new_deaths", "sum"),
        new_vaccinations = ("new_vaccinations", "sum"),
        new_tests = ("new_tests", "sum")).reset_index()

    trend_data["date"] = trend_data.date.dt.date

    new_cases = sidebar.checkbox("New Cases")
    new_deaths = sidebar.checkbox("New Deaths")
    new_vaccinations = sidebar.checkbox("New Vaccinations")
    new_tests = sidebar.checkbox("New Tests")

    lines = [new_cases, new_deaths, new_vaccinations, new_tests]
    line_cols = ["new_cases", "new_deaths", "new_vaccinations", "new_tests"]
    trends = [c[1] for c in zip(lines,line_cols) if c[0]==True]

    ndf = pd.DataFrame(data=trend_data.date.unique(),columns=["date"])

    for s in selected:
        new_cols = ["date"]+[f"{s}_{c}" for c in line_cols]
        tdf = trend_data.query(f"location=='{s}'")
        tdf.drop("location", axis=1, inplace=True)
        tdf.columns=new_cols
        ndf=ndf.merge(tdf,on="date",how="inner")

    if show_data:
        if len(ndf)>0:
            st.dataframe(ndf)
        else:
            st.markdown("Empty Dataframe")

    new_trends = []
    for c in trends:
        new_trends.extend([f"{s}_{c}" for s in selected])

    subplots=sidebar.checkbox("Show Subplots", True)
    if len(trends)>0:
        st.markdown("### Trend of Selected Locations")

        fig=ndf.iplot(kind="line", asFigure=True, xTitle="Date", yTitle="Values",
                            x="date", y=new_trends, title=f"{trend_level} Trend of {', '.join(trends)}.", subplots=subplots)
        st.plotly_chart(fig, use_container_width=False)

Output

Single

Multiple

Connecting MySQL Server in Windows Machine From WSL

Viper — Wed, 16 Feb 2022 09:27:39 +0000

Connecting MySQL Server in Windows Machine from WSL

Originally published at dataqoil.com.

What does this mean? In simple sentence, how do we connect to a MySQL server which is hosted in Windows from WSL. It might sound easy but let me tell you, IT IS NOT!!!!

I was trying to connect (from WSL) to my local MySQL which was installed on Windows Machine while using Airflow because my Airflow was installed in WSL. But it took me long to figure out the best way to do it. I hope it helps you too.

MySQL Client in WSL

First install MySQL client in WSL using below command which can be seen once we type mysql in WSL terminal.



sudo apt install mysql-client-core-8.0     # version 8.0.27-0ubuntu0.20.04.1, or
sudo apt install mariadb-client-core-10.3  # version 1:10.3.31-0ubuntu0.20.04.1

For me, I did first one.

Find IPv4 Address of WSL

Go to Settings -> Network and Internet -> Status -> View Hardware and connection properties. Look for the name vEthernet (WSL). It will usually be on the bottom.
My looks like below. But I've shaded the addresses.

Now try to connect to MySQL from WSL using below command:



mysql -u wsl_root -p -h 172.24.xxx.xxx

Please remember that in above command xxx is just a placeholder. Also, root is just a username that we tried to login with. We will get an error right now with above command and we will fix it.

Making New User in MySQL to make a Call from WSL



CREATE USER 'wsl_root'@'localhost' IDENTIFIED BY 'password';
GRANT ALL PRIVILEGES ON *.* TO 'wsl_root'@'localhost' WITH GRANT OPTION;
CREATE USER 'wsl_root'@'%' IDENTIFIED BY 'password';
GRANT ALL PRIVILEGES ON *.* TO 'wsl_root'@'%' WITH GRANT OPTION;
FLUSH PRIVILEGES;

In above query,

wsl_root is a name of user that we will use from WSL. localhost is a address where MySQL is and password is password. :)
We have granted all privileges to that user and it will be just another admin.

From WSL

Now running the command mysql -u wsl_root -p -h 172.24.xxx.xxx and giving password after it asked, we could connect to the MySQL server.

References

https://stackoverflow.com/questions/1559955/host-xxx-xx-xxx-xxx-is-not-allowed-to-connect-to-this-mysql-server

Walk through of Statistical Data Analysis in Python

Viper — Tue, 08 Feb 2022 04:22:42 +0000

What is Statistical Analysis in Data Science?

Originally published at dataqoil.com.

This blog starts from definition to explaining and experimenting with different part of statistics.

Introduction

Statistics is very important field and there are lots of definitions and use cases available and noting few.

Statistics is \

a systematic collection of data on measurements or observations, often related to demographic information such as population counts, incomes, population counts at different ages, etc. From Wikipedia.
the science concerned with developing and studying methods for collecting, analyzing, interpreting and presenting empirical data. From UCI.

Data Science is not a new term in this decade but it was nowhere to be heard before 90s. Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data. From Wikipedia.

If we carefully look into definition of Data Science above, we can feel that the Statistics seems to be interacting in Processing, extracting and finding actionable insights from the data. But when it comes to the insights, there are lots of approaches and insights are not predefined. The popular term, EDA (Exploratory Data Analysis) is one of earlier steps in finding insights and in that phase, one might try to answer below questions:

What is the format of the data? Is it structured or unstructured?
What is the types (categorical, boolean, integer, fractional, text and so on) of columns in the data?
What is the availability of the data in each columns?
What is the distribution of the each columns?
What is the summary (minimum, maximum, mean, spread, Kurtosis and so on) of the data?
Is there any outliers in the data?
Is there some pattern on the data?
Is there any relationship between columns?
Are all the fields on the single column behaves same while sampling?

The list goes on but the theory and the concepts on the base level are all the same. All of the questions above are answered by the Statistical techniques and in general, one could divide them into two categories:

Descriptive Statistics
Inferential Statistics

And our main goal in this blog article is to take a small dive into these two category to answer some of questions in EDA. And analysis that are done based on Statistics is statistical analysis.

Definitions

Descriptive Statistics is all about describing the data in the terms of some numbers, charts, graphs or plots. In descriptive statistics, our focus will be on the summary of the data like mean, spread, quartiles, percentiles and so on.

Where as, in Inferential Statistics, we take a step forward from the descriptive information we had and try to make some inferences or predictions. In general case, we try to prove, estimate and hypothesize something by taking a sample from the population. In inferential statistics, our focus will be on making conclusion about something.

Data We Are Using

We are using most popular dataset in the Machine Learning world, Titanic dataset.

import pandas as pd
import numpy as np
from plotly.offline import init_notebook_mode, iplot
import plotly.figure_factory as ff
import cufflinks
import plotly.io as pio 
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')
pio.renderers.default = "colab" # should change by looking into pio.renderers

pd.options.display.max_columns = None

df=pd.read_csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv")

We are using:

pandas for data analysis and without it where would we be now.
numpy for possible array operations.
plotly for plotting nice interactive plots.
cufflinks for makking a connection between pandas and plotly.
Set some configurations.

Descriptive Statistical Analysis

If we want to go further into the descriptive analysis, then we will measure:

Central tendency which focuses on the average.
Variability (measure of dispersion) which focuses on how far the data has spreaded.

We often focus on frequency distribution, quartile/ percentile of data along with above two kinds.

Measuring Frequency Distribution

The frequency distribution simply is the number of occurrence of certain value and it could be as simple as counting number of age of people. We could choose any but I found this one to be easy to explain.

Frequency Distribution of Numerical Data

Lets see the distribution of the age of the passengers.

df.Age.iplot(kind="hist")

In above plot, if we hover over the bar, we could see the number of people of certain age. But plotly is giving us result in some range. Which is not much of a big deal here.

Frequency Distribution of Categorical Data

Lets see the frequency distribution of gender.

df.Sex.iplot(kind="hist")

As expected, data is shown in categorical counts.

Central Tendency

The most common thing we do in this part is find the summary statistics of a data which includes calculating mean, median, range, mode and other percentile values.

Mean

What is the average age of the passengers?

We calculate mean using,

[\mu = \frac{\sum_{i=0}^{N} x_i}{N}]

df.Age.mean()


29.471443066516347

Mean age is 29.4 years.

Median

What is the mid age of the passengers?

We calculate median using,

[\text{median} = \left(\frac{N}{2}\right)^{th} \text{item}]

df.Age.median()


28.0

Median age is 28 years.

Mode

What is the mot repeated age?

It is simply the most repeated value.

df.Age.mode()


0 22.0
dtype: float64

Most repeated age seems to be 22.

Dispersion or Measure of Variability

The dispersion or variability, as the term suggests, is the measure of spreadness of the data. While measuring spread of the data, one might try to find out:

Range of the data (Min, Max)
Standard Deviation
Variance

Lets see them into action.

Range of the data

Range means the minimum and maximum of the value.

df.Age.min(), df.Age.max()


(0.42, 80.0)

It seems that the values in the field Age has been spreaded from 0.4 years to 80 years. But this does not give us clear picture of the how deviated a data is from the mean position? So lets find it out on Standard Deviation.

Variance

Variance is very important and widely used while measuring the volatility and spread of the data.

For population variance, it is denoted by greek letter sigma square and calculated as,

[\sigma^2 = \frac{\displaystyle\sum_{i=1}^{N}(X_i - \mu)^2} {N}]

But for sample variance,

[s = \frac{\displaystyle\sum_{i=1}^{n}(x_i - \bar{x})^2} {n-1}]

Where,

s is sample variance,
n is shape of the sample,
xi is ith element of the sample,
x bar is sample mean

df.Age.var()


199.428297012274

It seems that our data’s Age have a huge variance.

Standard Deviation

A standard deviation gives us the numerical value which represents the data’s location from the mean position. A low standard deviation means that data are around the mean and high standard deviation means data is far from the mean and is more spreaded.

It is a squared root of variance.

For population standard deviation,

[\sigma = \sqrt{\frac{\displaystyle\sum_{i=1}^{N}(X_i - \mu)^2} {N}}]

df.Age.std()


14.121908405462555

Our standard deviation is also huge. But as we saw above on the frequency distrubution, the data was widely spreaded and so did the range told us.

Percentiles

I find percentiles to be one of simplest way to calculate the outliers and spreadness on the data. Lets find out the summary of our Age data.

Using pandas, we do not have to manually calculate all these values and we could get it by simply doing .describe() on series or dataframe.

df.Age.describe()


count 887.000000
mean 29.471443
std 14.121908
min 0.420000
25% 20.250000
50% 28.000000
75% 38.000000
max 80.000000
Name: Age, dtype: float64

Looking over above output, we can say that the most density of the data is from age 20 to 38 and this can be well viewed by box plot.

df.Age.iplot(kind="box")

The box plot often gives us how much the data contains the outliers. The bars in the plot represents max, q3, median, q1, min from top to the bottom. Another variant of looking over these density is via dense plot.

df.Age.plot(kind="kde")

Univariate vs Multivariate Analysis

Until now we have done univariate analysis and operations but it does not end here. We often combine various columns of the data and view them to find relationship between them. Once we have done uni variate analysis and done required data processing, we could do multi variate analysis. One includes, getting the average age according to the gender.

df.groupby("Sex").Age.describe()

	count	mean	std	min	25%	50%	75%	max
Sex
---	---	---	---	---	---	---	---	---
female	314.0	27.719745	13.834740	0.75	18.0	27.0	36.0	63.0
male	573.0	30.431361	14.197273	0.42	21.0	28.0	38.0	80.0

In above output, we could see that number of female passenger is lower than that of the male and so on.

Correlation and Covariance

These both terms are used to find the relationship between two fields and major differences between two are:

Covariance is a measure of random variables change together. Where as correlation is a measure of how strongly two random variables are related to each other.
Covariance is a measure of correlation where as correlation is a scaled form of covariance.
Covariance can vary between -∞ and +∞ where as correlation ranges between -1 and +1.

Covariance

[\text{cov(X,Y)} = \frac{\sum_{i}^N {(X_i-\mu_x)(Y_i-\mu_y)}}{N}]

Correlation

[\rho(X,Y) = \frac{\sigma_{XY}}{\sigma_X \sigma_Y}]

# lets give numerical value to gender, say male as 1 female as 0
df["gender"] = 0
df.loc[df.Sex=='male', "gender"]=1
df.corr().iplot(kind="heatmap")

Looking over the heatmap above, we can conclude that,

positive correlation between Age and Fare,
positive correlation between Fare and Siblings/Spouses Aboard
positive correlation between Fare and Parents/Children Aboard
And also some relationship with gender and Survived but we can not claim it because gender is binary variable.

Inferential Statistical Analysis

While Descriptive Analysis was more about EDA, Inferential Analysis is taking a step further and make predictions, assumptions and conclusions out of the data. This is quite larger area than the descriptive and it does require another blog for the proper explanation but I am trying to include it as much as possible.

In all of the inferential analysis there there are mainly two things we do:

Making inferences or predictions about the population. Example,the average age of the passengers is 29 years.
Making and testing hypothesis about the populations. Example, whether the survival rate of one gender differs from another’s.

Sampling

One of popular thing we do in inferential statistics in sampling. And sampling is done when there is large population. The assumption is that if the distribution of sample is identical to the population then we could make assumptions about the population based on the sample’s test. Nearly similar thing is often done in Machine Learning world. For example, we first train a data with training set and validate it with our validation sets. Then we assume that the data was trained and tested on the data that was identical to the real data that will make prediction on.

While working with a sample two terms are used to represent sample and population metrics:

Statistics : It is a measure or metric of sample. e.g. smaple age.
Parameter : It is a measure or metric of a population. e.g. population age.

Problems with Sampling

Sample simply means to draw out the subset of the data from the population and whose size should always be smaller than that of the population. One major problem could be found in sampling is that the mean and variance of sample might not ressemble the population. It is often called as sample error.

Estimation

While working with prediction/hypothesis in inferential analysis, we often have to deal with two types of estimates:

Point Estimation : It is simply a single value estimation for example the sample mean age is equal to the population mean age.
Interval Estimation : This estimation is based on finding a value in some range. For example the confidence interval is used in tests like Chi Square, t-test etc. In above example of Titanic dataset we could make an estimation like, the difference between sample mean age and population mean age is not more than 5%.

Test

There are lots of test based upon the nature of estimation, calculation and prediction but all of those can be divided into 3 categories:

Comparison Test
Correlation Test
Regression Test

Based on parameters, we can also categorize tests into two groups:

Parametric Test : Parametric tests are those in which we work with parameters like mean and variance. One example of this test is t-test.
Non Parametric Test : These tests are non parametric because does not use parameters in the hypothesis. One example is Mann Whitney U test.

Based on the measurement (Nominal, Ordinal, Interval and Ratio) of the data we can choose bset test for our data.

Terms Widely Used in Testing

Confidence Interval : Confidence interval is all about giving some room for the error. Which is often used with tests. For example, if we are trying to make a test where we have set our hypothesis that the average mean of sample lies within the range 25 to 35 then asume that our sample mean was calculated to be 28 while population mean is 30. Then we will still be selecting the sample mean.
Confidence Level : It sounds similar to confidence interval but no it is not. But these two terms are related to each other. Confidence level tells us how much probability is there that the sample statistics or estimated parameter lies within the confidence interval. For example, if we set the confidence level to 5%, then we will be claiming that if there are 100 tests done, at max 5 will be predicting wrong prediction. Or in other words, out of 100 tests, 95 tests will have the estimated value lie within the confidence interval.
Hypothesis : As the term suggests, hypothesis is something that we are assuming to happen. In Hypothesis testing, we will have different hypothesis against the default or null hypothesis. Those hypothesis against the default are known as alternative hypothesis.

Comparison Test

Comparision test compares the parameters like mean, variance and median.

Test	Parametric	Comparison With	No. Samples
t-test	Yes	Mean, Variance	2
ANOVA	Yes	Variance, Mean	3+
Mann-Whitney U (Wilcoxon Rank Sum)	No	Sum of rankings	2
Wilcoxon Signed Rank	No	Distributions	2
Kruskal-Wallis H	No	Mean Rankings	3+
Mood’s Median	No	Medians	2+

Student’s t-test

This test is done in order to determine whether there is significant difference between means of two groups, mostly between sample and population. There are different variations of t-test.

“The t-test is any statistical hypothesis test in which the test statistic follows a Student’s t-distribution under the null hypothesis.”

Wikipedia

It is named as Student’s test because this test was developed by the student of William Gosset.

In this test, the means from two samples are checked if they are significantly different from each other. The significant difference is calculated by finding the standard error in the mean. The t statistic calculated by the test is compared with the critical values from the t-distribution. The critical values is calculated using degree of freedom and significance level (generally 5%) with percent point function (PPF).

T test is very useful in the case when we need to know whether a statistic is smaller or greater than another. In this case, test becomes two tailed test. Besides from two tailed test, we could do right tailed or left tailed test too. In comparison test, we have

Null Hypothesis: Two statistics are equal.
Alternative Hypothesis: Two statistics are not equal.

We can test the statistics in two ways either using t-statics and critical value or alpha (significance level) and cumulative probability (p).

If abs(t-statistic)<=critical value:
- Unable to reject null hypothesis.
Else:
- Reject null hypothesis.
If p>alpha:
- Unable to reject null hypothesis.
Else:
- Reject null hypothesis.

There are two main versions of Student’s t-test:

Independent Samples : The case where the two samples are unrelated.
Dependent Samples : The case where the samples are related, such as repeated measures on the same population. Also called a paired test.

To make this blog as small as possible, we will do only one example and that is of Independent Samples. We will take two samples from above data and compare if average age of each sample varies or not.

[t = \frac{\bar{x_1}-\bar{x_2}}{s_p \sqrt{\frac{2}{n}}} \\ Where, s_p \text{is a standard error (deviation).} \\ s_p = \sqrt{\frac{SE^2_1+SE^2_2}{2}} \\ and \\ SE = \frac{\text{standard deviation}}{\sqrt{n}}]

But we won’t do any of these calculation from scratch, we will use SciPy.

from scipy.stats import ttest_ind,t

# take sample of Age
sample1 = df.Age.sample(500)
sample2 = df.Age.sample(500)

print(f"Means of each sample is: {sample1.mean(), sample2.mean()}")
alpha = 0.05

stat, p = ttest_ind(sample1, sample2)

# degrees of freedom
dof = len(sample1) + len(sample2) - 2
# calculate the critical value
cv = t.ppf(1.0 - alpha, dof)

print('t=%.3f, p=%.3f' % (stat, p))

print(f"Comparing statistics with alpha={alpha}, Critical Value={cv}, p={p}, t-stat={stat}.")

print("\nUsing t-stat.")
if abs(stat) <= cv:
    print('Unable to reject the null hypothesis.')
else:
    print('Reject the null hypothesis that the means are equal.')

print("\nUsing p value.")
if p > alpha:
    print('Unable to reject the null hypothesis.')
else:
    print('Reject the null hypothesis that the means are equal.')


Means of each sample is: (29.6095, 29.06434)
t=0.614, p=0.540
Comparing statistics with alpha=0.05, Critical Value=1.6463818766348755, p=0.5396670298167929, t-stat=0.6135280743226035.

Using t-stat.
Unable to reject the null hypothesis.

Using p value.
Unable to reject the null hypothesis.

Above example is just a simple example and unable to reject a hypothesis does not means you accept a null hypothesis.

ANOVA Test

ANOVA means Analysis of Variance. This test is used when we have to compare statistics between two or more samples. If we have two sample, we will use t-test.

Lets compare average age of each Pclass and check if there is difference in average age based on passenger class.

import statsmodels.api as sm
from statsmodels.formula.api import ols

tdf = df.groupby("Pclass").Age.mean().rename("mage").reset_index()

model = ols('mage' + '~' + "Pclass", data = tdf).fit() #Oridnary least square method
result_anova = sm.stats.anova_lm(model) # ANOVA Test
print(result_anova)


           df sum_sq mean_sq F PR(>F)
Pclass 1.0 92.483183 92.483183 30.859642 0.113386
Residual 1.0 2.996897 2.996897 NaN NaN

Looking on the first row of above result’s p-value, we were unable to reject the null hypothesis.

Correlation Tests

Correlation tests are done to calculate the strength of the association between data.

Test	Parametric	Data Type
Pearson’s r	Yes	Interval/Ratio
Spearman’s r	No	Ordinal/Interval/Ratio
Chi Square Test of Independence	No	Nominal/Ordinal

Pearson’s r test is statistically powerful than Spearman’s but Spearman’s test is appropriate for interval and ratio type of data.

Only Chi Square Test of Independence is the only test that can be used with nominal variables.

Pearson’s r Test

The coefficient returns a value between -1 and 1 that represents the limits of correlation from a full negative correlation to a full positive correlation. A value of 0 means no correlation. The value must be interpreted, where often a value below -0.5 or above 0.5 indicates a notable correlation, and values below those values suggests a less notable correlation.

A formula is:

From Wikipedia

Lets compare whethere there is any relationship between fare and the age of passenger.

[\rho_{X,Y} = \frac{\sigma_{XY}}{\sigma_X \sigma_Y}]

df.Fare.corr(df.Age)


0.11232863699941618

It states that there is some positive correlation.

Spearman’s Correlation: Non-Linear Relationship between two variables.

Two variables may be related by a nonlinear relationship, such that the relationship is stronger or weaker across the distribution of the variables. In this case Spearman’s correlation is used.

Pearson correlation assumes the data is normally distributed. However, Spearman does not make any assumption on the distribution of the data. That is the main difference between these two.

From Wikipedia

df.Fare.corr(df.Age, method="spearman")


0.15606180426955454

Chi-Square Test: Does Survived depends on Gender?

When to use Chi Square?

The Chi-square test is a non-parametric statistic, also called a distribution free test. Non-parametric tests should be used when any one of the following conditions pertains to the data:

The level of measurement of all the variables is nominal or ordinal.
The sample sizes of the study groups are unequal; for the χ2 the groups may be of equal size or unequal size whereas some parametric tests require groups of equal or approximately equal size.
The original data were measured at an interval or ratio level, but violate one of the following assumptions of a parametric test:
- The distribution of the data was seriously skewed or kurtotic (parametric tests assume approximately normal distribution of the dependent variable), and thus the researcher must use a distribution free statistic rather than a parametric statistic.
- The data violate the assumptions of equal variance or homoscedasticity.
- For any of a number of reasons (1), the continuous data were collapsed into a small number of categories, and thus the data are no longer interval or ratio.

Note:

Null Hypothesis(H0): Two variables are not dependent. (no association between the two variables)
Alternate Hypothesis(H1): There is relationship between variables.
If Statistic >= Critical Value: significant result, reject null hypothesis (H0), dependent.
If Statistic < Critical Value: not significant result, fail to reject null hypothesis (H0), independent.

In terms of a p-value and a chosen significance level (alpha), the test can be interpreted as follows:

If p-value <= alpha: significant result, reject null hypothesis (H0), dependent.
If p-value > alpha: not significant result, fail to reject null hypothesis (H0), independent.

Chi-Square test works in our case by converting our data into categorical form like below.

Gender/Survived	1	0
1	a	b
0	c	d

# make a contingency table
cdf = pd.crosstab(df['Sex'],
                            df['Survived'],
                           margins=True, margins_name="Total")
cdf

Survived	0	1	Total
Sex
---	---	---	---
female	81	233	314
male	464	109	573
Total	545	342	887

from scipy import stats

# Calcualtion of Chisquare test statistics
chi_square = 0
rows = df['Sex'].unique()
columns = df['Survived'].unique()
for i in columns:
    for j in rows:
        O = cdf[i][j]
        E = cdf[i]['Total'] * cdf['Total'][j] / cdf['Total']['Total']
        chi_square += (O-E)**2/E

print("Approach 1: The p-value approach to hypothesis testing in the decision rule")
p_value = 1 - stats.norm.cdf(chi_square, (len(rows)-1)*(len(columns)-1))
conclusion = "Failed to reject the null hypothesis."
if p_value <= alpha:
    conclusion = "Null Hypothesis is rejected."

print("chisquare-score is:", chi_square, " and p value is:", p_value)
print(conclusion)


Approach 1: The p-value approach to hypothesis testing in the decision rule
chisquare-score is: 260.71530379938315 and p value is: 0.0
Null Hypothesis is rejected.

Since we were able to reject the null hypothesis, we can conclude that, there is some relationship between Sex and Survival of the person.

Readings

Chi Square Test With Python

McNemar’s Test: Test if sending SMS had any effects(before vs after).

This test works for binary data which includes cases like pass/fail, and in our case converted/not-converted.

We will assume that the cross selling technique will increase the rate of conversion, but we have to prove it statistically.
We will measure the rate of conversion before the implemenatation and after the implementation of our campaign. If the person took any service, his/her conversion value will be 1 else 0.
A contingency table will be something like below.

After/Before SMS	converted	not-converted
converted	a	b
not-converted	c	d

Assumptions

H0: There is no difference in conversion. i.e. sending sms had no effect on conversion.
H1: There is significant difference in conversion. i.e. sending sms had significant effect on conversion.
p > alpha: fail to reject H0, no difference in the disagreement (e.g. sending sms had no effect).
p <= alpha: reject H0, significant difference in the disagreement (e.g. sending sms had an effect).

Readings

Regression Tests

Regression tests are done where we try to estimate some parameter. If we have one dependent and one independent variable then we will be using simple linear regression like $y=mx+c$. If we have multiple variables then it will be mulilinear regression. But besides linear, there is logistic regression which tries to classify between two class.

The regression test examines whether the change is dependent variable have any effect in the independent variable or not.

Test	Predictor	Outcome
Simple Linear	1 interval/ratio	1 interval/ratio
Multi Linear	2+ interval/ratio	1 interval/ratio
Logistic regression	1+	1 binary
Nominal regression	1+	1 nominal
Ordinal regression	1+	1 ordinal

Problems in Tests

Errors

A Type I error, also called a false positive, is when we accept a hypothesis that is actually false; that is, we consider an effect significant when it was actually due to chance.
A Type II error, also called a false negative, is when we reject a hypothesis that is actually true; that is, we attribute an effect to chance when it was actually real.

We can decrease the chance of a false positive by decreasing the threshold. For example, if the threshold is 1%, there is only a 1% chance of a false positive.

But there is a price to pay: decreasing the threshold raises the standard of evidence, which increases the chance of rejecting a valid hypothesis. In general there is a tradeoff between Type I and Type II errors. The only way to decrease both at the same time is to increase the sample size (or, in some cases, decrease measurement error).

Takeways From Tests

How to find the relationship between two variables when dependent variable is in binary format?
- Chi-Square test will work if we convert our data into categorical format. i.e. make binary data into 2 categories 1 and 0.
How to handle the false test result. i.e. our statistics failed.
How to make an assumption that one population is better than another population?
- Two Sample t-test can be used. First find if the mean of two populations are different and if it is, find difference.
Which test should be best when the number of examples is different for each samples?
- Chi-Square test should work fine but data counts should be more than 5. Else use Fisher’s Exact Test is an alternative to a chi-square test.

References

Plotting High Quality Plots in Python with Plotly and Clufflinks

Viper — Sun, 30 Jan 2022 02:57:52 +0000

Originally Published on DataQoil.com.

Interactive Plot

This blog contains static images and is not rendering interactive plots thus we request you to visit

this or
this interactive blog.

Introduction

Hello everyone, in this blog we are going to explore some of most used and simplest plots in the data analysis. If you have made your hand dirty playing with data then you might have come across at least anyone of these plots. And in Python, we have been doing these plots using Matplotlib. But above that, we have some tools like Seaborn (built on the top of Matplotlib) which gave use nice graphs. But those were not interactive plots. Plotly is all about interactivity!

This blog will be updated frequently.

January 28 2022, started blog writing.

Installation

This blog was prepared and run on the google colab and if you are trying to run codes in local computer, please install plotly first by pip install plotly. You can visit official link if you want. Then cufflinks by pip install cufflinks.

import pandas as pd
import numpy as np
import warnings
from plotly.offline import init_notebook_mode, iplot
import plotly.figure_factory as ff
import cufflinks
import plotly.io as pio 
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')
pio.renderers.default = "colab" # should change by looking into pio.renderers

pd.options.display.max_columns = None
# pd.options.display.max_rows = None

pio.renderers

Renderers configuration
-----------------------
    Default renderer: 'colab'
    Available renderers:
        ['plotly_mimetype', 'jupyterlab', 'nteract', 'vscode',
         'notebook', 'notebook_connected', 'kaggle', 'azure', 'colab',
         'cocalc', 'databricks', 'json', 'png', 'jpeg', 'jpg', 'svg',
         'pdf', 'browser', 'firefox', 'chrome', 'chromium', 'iframe',
         'iframe_connected', 'sphinx_gallery', 'sphinx_gallery_png']

If you are running Plotly on colab then use pio.renderers.default = "colab" else choose according to your need.

Get Dataset

For the purpose of visualization, we are going to look into COVID 19 Dataset publicly available on GitHub.

Since the main goal of this blog is to explore visualization not the analysis part, we will be skipping most of analysis and focus only on the plots.

df = pd.read_csv("https://covid.ourworldindata.org/data/owid-covid-data.csv")
df["date"] = pd.to_datetime(df.date)
df

Data is not shown here to avoid huge page.

157476 rows × 67 columns

Check Missing Columns

First step of any data analysis is checking for missing columns.

total = df.isnull().sum().sort_values(ascending = False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending = False)
mdf = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
mdf = mdf.reset_index()
mdf

	index	Total	Percent
0	weekly_icu_admissions	153085	0.972116
1	weekly_icu_admissions_per_million	153085	0.972116
2	excess_mortality_cumulative_per_million	152056	0.965582
3	excess_mortality	152056	0.965582
4	excess_mortality_cumulative_absolute	152056	0.965582
...	...	...	...
62	total_cases	2850	0.018098
63	population	1037	0.006585
64	date	0	0.000000
65	location	0	0.000000
66	iso_code	0	0.000000

67 rows × 3 columns

It seems that we have lots of missing data (97%+).

Pie Chart

Missing Values Columns

How about plotting the counts of missing columns in pie chart?

To make it more fast, we will be using only columns that are missing more than 100000 values.

mdf.query("Total>100000").iplot(kind='pie',labels = "index", 
                                values="Total", textinfo="percent+label",
                                title='Top Columns with Missing Values', hole = 0.5)

Above plot seems little bit dirty and we could smoothen it by not providing textinfo.

mdf.query("Total>100000").iplot(kind='pie',labels = "index", 
                                values="Total",
                                title='Top Columns with Missing Values', hole = 0.5)

Line Chart

New Cases Per day

The location field of our data seems to be having country name, continent name and world so we will skip those locations first. Then we will calculate the aggregated value of each day by grouping on date level

Lets first plot simple line chart with only total cases. But we could always plot more lines within it.

todf = df[~df.location.isin(["Lower middle income", "North America", "World", "Asia", "Europe", 
                           "European Union", "Upper middle income", 
                           "High income", "South America"])]
tdf = todf.groupby("date").aggregate(new_cases=("new_cases", "sum"),
                                   new_deaths = ("new_deaths", "sum"),
                                   new_vaccinations = ("new_vaccinations", "sum"),
                                   new_tests = ("new_tests", "sum")
                                   ).reset_index()

tdf.iplot(kind="line",
          y="new_cases",
          x="date",
          xTitle="Date",
          width=2,
          yTitle="new_cases", 
          title="New Cases from Jan 2020 to Jan 2022")

Above plot seems to be cool but now lets plot multiple lines at the same time on same figure.

tdf.iplot(kind="line",
          y=["new_deaths", "new_vaccinations", "new_tests"],
          x="date",
          xTitle="Date",
          width=2,
          yTitle="Cases", 
          title="Cases from Jan 2020 to Jan 2022")

It does not look that good because the new_deaths is not clearly visible lets draw them in sub plots so that we could see each lines distinctly.

tdf.iplot(kind="line",
          y=["new_deaths", "new_vaccinations", "new_tests"],
          x="date",
          xTitle="Date",
          width=2,
          yTitle="Cases", 
          subplots=True,
          title="Cases from Jan 2020 to Jan 2022")

Now its better.

We could even plot secondary y variable. Now lets plot new tests and new vaccinations side by side.

tdf.iplot(kind="line",
          y=["new_vaccinations"],
          secondary_y = "new_tests",
          x="date",
          xTitle="Date",
          width=2,
          yTitle="new_vaccinations",
          secondary_y_title="new_tests", 
          title="Cases from Jan 2020 to Jan 2022")

In above plot, we were able to insert two y axes.

Scatter Plot

New deaths vs New Cases

How about viewing the deaths vs cases in scatter plot?

tdf.iplot(kind="scatter",
              y="new_deaths", x='new_cases',
              mode='markers',
              yTitle="New Deaths", xTitle="New Cases",
              title="New Deaths vs New Cases")

It seems that most of the deaths happened while cases were little.

We could even plot secondary y. Lets visualize new tests along with them.

tdf.iplot(kind="scatter",
              x="new_deaths", y='new_cases',
              secondary_y="new_tests",
              secondary_y_title="New Tests",
              mode='markers',
              xTitle="New Deaths", yTitle="New Cases",
              title="New Deaths vs New Cases")

We could even use subplots on it.

tdf.iplot(kind="scatter",
              x="new_deaths", y='new_cases',
              secondary_y="new_tests",
              secondary_y_title="New Tests",
              mode='markers',
              subplots=True,
              xTitle="New Deaths", yTitle="New Cases",
              title="New Deaths vs New Cases")

Bar Plot

How about plotting top 20 countries where most death have occured?

But first, take the aggregate data by taking maximum of total deaths column. Thanks to the author of this dataset we do not have to make our hands dirty much. Then take top 20 by using nlargest.

tdf = df[~df.location.isin(["Lower middle income", "North America", "World", "Asia", "Europe", 
                           "European Union", "Upper middle income", 
                           "High income", "South America"])].groupby("location").aggregate(total_deaths=("total_deaths", "max"),
                                                                                           total_cases = ("total_cases", "max"),
                                                                                           total_tests = ("total_tests", "max")).reset_index()
topdf = tdf.nlargest(20, "total_deaths")

topdf.iplot(kind="bar", x="location",
                                      y="total_deaths",
                                      theme="polar",
                                      xTitle="Countries", yTitle="Total Deaths", 
                                       title="Top 20 Countries according to total deaths")

It seems awesome. We could play with theme also.

We could even make it horizontal.

topdf.iplot(kind="bar", x="location",
            y="total_deaths",
            theme="polar", orientation='h',
            xTitle="Countries", yTitle="Total Deaths", 
            title="Top 20 Countries according to total deaths")

We could even plot multiple bars at the same time. In seaborn, we could do this by using Hue but here, we only have to pass it in y. Lets plot bars of total deaths, total cases and total tests.

topdf.iplot(kind="bar", x="location",
            y=["total_deaths", "total_cases", "total_tests"],
            theme="polar",
            xTitle="Countries", yTitle="Total Deaths", 
            title="Top 20 Countries according to total deaths")

But total deaths is not visible clearly, lets try to use different mode of bar. We could choose one from the 'stack', 'group', 'overlay', 'relative'.

topdf.iplot(kind="bar", x="location",
                        y=["total_deaths", "total_cases", "total_tests"],
                        theme="polar",
                        barmode="overlay",
                        xTitle="Countries", yTitle="Total Deaths", 
                        title="Top 20 Countries according to total deaths")

But it is still not clear. One solution is to plot in subplots.

topdf.iplot(kind="bar", x="location",
                        y=["total_deaths", "total_cases", "total_tests"],
                        theme="polar",
                        barmode="overlay",
                        xTitle="Countries", yTitle="Total Deaths", 
                        subplots=True,
                        title="Top 20 Countries according to total deaths")

Much better.

Histogram Chart

How about viewing the distribution of totel tests done?

tdf.iplot(kind="hist",
              bins=50, 
              colors=["red"],
              keys=["total_tests"],
              title="Total tests Histogram")

To see histogram of other columns in same figure we will use keys.

tdf.iplot(kind="hist",
              bins=100, 
              colors=["red"],
              keys=["total_tests", "total_cases", "total_deaths"],
              title="Multiple Histogram")

It does not look good as the data is not distributed properly. Lets visualize it in different plots.

tdf.iplot(kind="hist",
              subplots=True,
              keys=["total_tests", "total_cases", "total_deaths"],
              title="Multiple Histogram")

Box Plot

How about viewing outliers in data?

tdf.iplot(kind="box",
              keys=["total_tests", "total_cases", "total_deaths"], 
              boxpoints="outliers",
              x="location",
              xTitle="Columns", title="Box Plot Tests, Cases and Deaths")

It is not clearly visible as the data have lot of outliers and not all columns have similar distributions.

tdf.iplot(kind="box",
              keys=["total_tests", "total_cases", "total_deaths"], 
              boxpoints="outliers",
              x="location",
              subplots=True,
              xTitle="Columns", title="Box Plot Tests, Cases and Deaths")

HeatMaps

How about viewing the correlation between columns? We will not check with all the 67 columns but lets test with 3.

df[["new_cases", "new_deaths", "new_tests"]].corr().iplot(kind="heatmap")

Simple yet much informative and interactive right?

Choropleth on Map

Plotting on map was once mine dream but now it can be done within few clicks.

Lets plot a choropleth on world map for the total deaths as of the latest day

import plotly.graph_objects as go

ldf = df[~df.location.isin(["Lower middle income", "North America", "World", "Asia", "Europe", 
                           "European Union", "Upper middle income", 
                           "High income", "South America"])].drop_duplicates("location", keep="last") 

fig = go.Figure(data=go.Choropleth(
    locations = ldf['iso_code'],
    z = ldf['total_deaths'],
    text = ldf['location'],
    colorscale = 'Blues',
    autocolorscale=False,
    reversescale=True,
    marker_line_color='darkgray',
    marker_line_width=0.5,
    colorbar_title = 'total_deaths',
))

fig.update_layout(
    title_text='total_deaths vs Country',
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='equirectangular'
    )
)

fig.show()

Above plot is of current date only but what if w want to view data of each available date?

Choropleth with Slider

We could add a slider to slide between different dates but it will be too much power hungry plot so beware of your system. We will plot total number of cases at the end of the month for each country.

tldf = df[~df.location.isin(["Lower middle income", "North America", "World", "Asia", "Europe", 
                           "European Union", "Upper middle income", 
                           "High income", "South America"])]
tldf = tldf.groupby(["location", "iso_code", pd.Grouper(key="date", freq="1M")]).aggregate(total_cases=("total_cases", "max")).reset_index()
tldf["date"] = tldf["date"].dt.date
tldf

	location	iso_code	date	total_cases
0	Afghanistan	AFG	2020-02-29	5.0
1	Afghanistan	AFG	2020-03-31	166.0
2	Afghanistan	AFG	2020-04-30	1827.0
3	Afghanistan	AFG	2020-05-31	15180.0
4	Afghanistan	AFG	2020-06-30	31445.0
...	...	...	...	...
5101	Zimbabwe	ZWE	2021-09-30	130820.0
5102	Zimbabwe	ZWE	2021-10-31	132977.0
5103	Zimbabwe	ZWE	2021-11-30	134625.0
5104	Zimbabwe	ZWE	2021-12-31	213258.0
5105	Zimbabwe	ZWE	2022-01-31	228943.0

5106 rows × 4 columns


first_day = tldf.date.min()

scl = [[0.0, '#ffffff'],[0.2, '#b4a8ce'],[0.4, '#8573a9'],
       [0.6, '#7159a3'],[0.8, '#5732a1'],[1.0, '#2c0579']] # purples

data_slider = []
for date in tldf['date'].unique():
    df_segmented =  tldf[(tldf['date']== date)]

    for col in df_segmented.columns:
        df_segmented[col] = df_segmented[col].astype(str)

    data_each_yr = dict(
                        type='choropleth',
                        locations = df_segmented['iso_code'],
                        z=df_segmented["total_cases"].astype(float),
                        colorbar= {'title':'Total Cases'}
                        )

    data_slider.append(data_each_yr)

steps = []
for i,date in enumerate(tldf.date.unique()):
    step = dict(method='restyle',
                args=['visible', [False] * len(data_slider)],
                label='Date {}'.format(date))
    step['args'][1][i] = True
    steps.append(step)

sliders = [dict(active=0, pad={"t": 1}, steps=steps)]

layout = dict(title ='Total Cases at the End of Month Across the World',
              sliders=sliders)

fig = dict(data=data_slider, layout=layout)
iplot(fig)

If I have to explain the above code, we have created a data for each of slider point and in our case a slider's single point is end of the month.

Loop through unique date.
- Mask the data to get data of current date.
- Make a dictionary by giving common and essential values required to make a chloropeth.
- Give locations as iso_code.
- Give z axis as total cases.
- And use total cases on color bar title.
- Add this data to slider.
For each date step, prepare a label.
Update sliders and layout then make figure and plot it using iplot.

Density Mapbox

Another useful plot is density map box where we will plot density plot on the map. But we need longitude and latitude for that. And I have prepared it in GitHub already. Please find it on below link:

State Location Coordinates

country_df = pd.read_csv("https://github.com/q-viper/State-Location-Coordinates/raw/main/world_country.csv")
country_df = country_df[["country", "lon", "lat", "iso_con"]]
tldf["country"] = tldf.location
tldf = tldf.merge(country_df[["country", "lat", "lon"]], on="country")

tldf.head()

	location	iso_code	date	total_cases	country	lat_x	lon_x	lat_y	lon_y
0	Afghanistan	AFG	2020-02-29	5.0	Afghanistan	33.768006	66.238514	33.768006	66.238514
1	Afghanistan	AFG	2020-03-31	166.0	Afghanistan	33.768006	66.238514	33.768006	66.238514
2	Afghanistan	AFG	2020-04-30	1827.0	Afghanistan	33.768006	66.238514	33.768006	66.238514
3	Afghanistan	AFG	2020-05-31	15180.0	Afghanistan	33.768006	66.238514	33.768006	66.238514
4	Afghanistan	AFG	2020-06-30	31445.0	Afghanistan	33.768006	66.238514	33.768006	66.238514

import plotly.express as px


fig = px.density_mapbox(tldf.drop_duplicates(keep="last"), 
                          lat = tldf["lat"],
                          lon = tldf["lon"],
                          hover_name="location", 
                          hover_data=["total_cases"], 
                          color_continuous_scale="Portland",
                          radius=7, 
                          zoom=0,
                          height=700,
                          z="total_cases"
                          )
fig.update_layout(title=f'Country vs total_cases',
                  font=dict(family="Courier New, monospace",
                            size=18,
                            color="#7f7f7f")
                )
fig.update_layout(mapbox_style="open-street-map", mapbox_center_lon=0)


fig.show()

Density map plot is useful and clear when we are ploting onto state or city because it will make our plot little bit visible. Here it is not clearly visible.

Density Mapbox with Slider


first_day = tldf.date.min()

scl = [[0.0, '#ffffff'],[0.2, '#b4a8ce'],[0.4, '#8573a9'],
       [0.6, '#7159a3'],[0.8, '#5732a1'],[1.0, '#2c0579']] # purples

data_slider = []
for date in tldf['date'].unique():
    df_segmented =  tldf[(tldf['date']== date)]

    for col in df_segmented.columns:
        df_segmented[col] = df_segmented[col].astype(str)

    data_each_yr = dict(
                        type='densitymapbox',
                        lat = df_segmented["lat"],
                        lon = df_segmented["lon"],
                        hoverinfo="text",
                        # name = "country",
                        text = df_segmented["country"],                        
                        z=df_segmented["total_cases"].astype(float),
                        colorbar= {'title':'Total Cases'}
                        )

    data_slider.append(data_each_yr)

steps = []
for i,date in enumerate(tldf.date.unique()):
    step = dict(method='restyle',
                args=['visible', [False] * len(data_slider)],
                label='Date {}'.format(date))
    step['args'][1][i] = True
    steps.append(step)

sliders = [dict(active=0, pad={"t": 1}, steps=steps)]

layout = dict(mapbox_style="open-street-map",
              title ='Total Cases at the End of Month Across the World',
              sliders=sliders)

fig = dict(data=data_slider, layout=layout)

iplot(fig)

References

Advent of Code 2021 Python Solution: Day 16

Viper — Fri, 17 Dec 2021 12:06:31 +0000

I was too busy to solve this challenge (but I tried for around 30min) and I did not even want to skip a day so, I had to look over other people's code.
The following code is taken from here. All credit goes to the author of this repository.

Part 1

data,data1=get_data(day=16)

data = '''38006F45291200'''.splitlines()
data=data1[0].splitlines()

s = bin(int(data[0], 16))[2:]
n = len(s)
if n % 4 != 0:
    s = '0' * (4 - n % 4) + s
n = len(s)
res = 0
c = 0

while c < n and '1' in s[c:]:
    v = int(s[c: c + 3], 2)
    res += v
    c += 3
    t = int(s[c: c + 3], 2)
    c += 3

    if t == 4:
        num = ''
        while s[c] == '1':
            num += s[c + 1: c + 5]
            c += 5
        num += s[c + 1: c + 5]
        c += 5
        num = int(num, 2)
    else:
        l = int(s[c], 2)
        c += 1
        if l == 0:
            num = int(s[c: c + 15], 2)
            c += 15
        else:
            num = int(s[c: c + 11], 2)
            c += 11

print(res)

Part 2

from functools import reduce

funcDict = {
    0: sum,
    1: lambda a: reduce(lambda x, y: x * y, a),
    2: min,
    3: max,
    5: lambda a: int(a[0] > a[1]),
    6: lambda a: int(a[0] < a[1]),
    7: lambda a: int(a[0] == a[1])
}

def evaluate(u):
    if packets[u][1] == 4:
        return packets[u][2]

    res = []
    for v in graph[u]:
        res.append(evaluate(v))
    return funcDict[packets[u][1]](res)

s = bin(int(data[0], 16))[2:]
for i in data[0]:
    if i != '0':
        break
    s = '0' * 4 + s
n = len(s)
if n % 4 != 0:
    s = '0' * (4 - n % 4) + s
n = len(s)
c = 0
packets = []

while c < n and '1' in s[c:]:
    v = int(s[c: c + 3], 2)
    c += 3
    t = int(s[c: c + 3], 2)
    c += 3

    if t == 4:
        num = ''
        while s[c] == '1':
            num += s[c + 1: c + 5]
            c += 5
        num += s[c + 1: c + 5]
        c += 5
        num = int(num, 2)

        packets.append([v, t, num, c])
    else:
        l = int(s[c], 2)
        c += 1
        if l == 0:
            num = int(s[c: c + 15], 2)
            c += 15
        else:
            num = int(s[c: c + 11], 2)
            c += 11

        packets.append([v, t, l, num, c])

stack = []
graph = [[] for _ in range(len(packets))]

for i, u in enumerate(packets):
    if len(stack) > 0:
        p = stack[-1]
        graph[p].append(i)
        packets[p][3] -= 1
        if packets[p][3] == 0:
            stack.pop()

    while len(stack) > 0:
        p = stack[-1]
        if packets[p][2] == 0 and packets[p][3] <= u[-1] - packets[p][-1]:
            stack.pop()
        else:
            break

    if u[1] != 4:
        stack.append(i)

print(evaluate(0))

Advent of Code 2021 Python Solution: Day 15

Viper — Wed, 15 Dec 2021 18:16:03 +0000

Once I failed DSA in my bachelor's degree and I never really understood Graphs and Path Finding but each year Advent of Code makes me try it once. Instead I used something easier than Dijkastra from scratch. Skimage have a way to find Minimum Cost Path

Solution

import numpy as np
from skimage import graph

data,data1 = get_data(15)

data = np.array([int(i) for dt in data for i in dt ]).reshape(-1, len(data[0]))
data
data1 = np.array([int(i) for dt in data1 for i in dt ]).reshape(-1, len(data1[0]))

window = data1.copy()

rs,cs = window.shape

cost = graph.MCP(window, fully_connected=False)
cost.find_costs(starts = [(0,0)])

journey = [window[pos] for pos in  cost.traceback((rs-1,cs-1))[1:]]
print(f"Part1: {sum(journey)}")

# 5times bigger
new_window = window.copy()
nrow = np.hstack([new_window, new_window+1, new_window+2, new_window+3, new_window+4])
new_window = np.vstack([nrow,nrow+1,nrow+2,nrow+3,nrow+4])
rs,cs = new_window.shape

new_window%=9
new_window[new_window==0]=9

cost = graph.MCP(new_window, fully_connected=False)
cost.find_costs(starts = [(0,0)])

journey = [new_window[pos] for pos in  cost.traceback((rs-1,cs-1))[1:]]
print(f"Part2: {sum(journey)}")