DEV Community: Durga Pokharel

How Can We Generate Random Number From Congruential Method?

Durga Pokharel — Mon, 17 Oct 2022 12:30:12 +0000

What is Random Number

A random number is one that is selected at random, as the name suggests, from a group of numbers. As they tend to be excessively slow for most applications in statistics and cryptography, the first methods for producing random numbers, such as dice, coin flipping, and roulette wheels, are still employed today, primarily in games and gambling.

Who generated first random numbers?

John von Neuman gave idea to generate random numbers in 1946. His plan was to square an initial random seed value, remove the middle digits, and continue. The sequence of integers that results after repeatedly squaring the result and removing the middle digits exhibits the statistical characteristics of randomness.

Consider any significant number, such as 2934; its square is 8608356; you choose 083 at random; its square is 6889; its next random number is 88 etc. As we need to select initially large number to generate random number.

Main Characteristics of Random Number

Random number should have following desirable properties to become good random number,

Random

Random numbers as we generated should be random it means there is no any pattern in data. Random number are generated without any rule.

Reproducible

Another important property of random number is reproducibility. It means we can generated new random number from previous one.

Portable

Random number should be portable. It means it should be changeable.

Efficient

The random numbers we have generated should be efficient. It should produce desirable result.

Generating Random Number from Monte Carlo Method

Multiple sources of systematic and statistical mistakes can affect MC simulations. If our random number are poor quality as a result of which we get systematic error. The creation of random numbers and testing are still significant issues that haven’t been fully resolved. As was already indicated, RN sequences are required for MC. should not repeat over very long periods and should be uniform, uncorrelated, and of a very long length.

Additionally, if we employ parallel computing (which is obviously necessary to handle massive amounts of data), we must ensure that every generated random number sequences are separate and uncorrelated (8N).

Generating Random Number using Congruential Method

The fundamental principle is that a seed is picked together with a fixed number, c, and that successive numbers are then produced by simple multiplication.

[X_n = (c * X_n-1 + a_0)MOD N_{max}]

Where (X_n) is an integer between 1 to (N_{max})

Experience has shown that a good congruential 32 bit linear congruential i.e,

[X_n = (16807 * X_n-1 + a_0)MOD (2^{31}-1)]

The number 16807 is called miracle number.

Lets Generate Random numbers using Congruential Method by using python

First import necessary module

import numpy as np
import matplotlib.pyplot as plt

We can choose seed value to any number but it should be larger number. We set ran to ran = (16807*seed)%(2**31) and then seed is equal to ran value so that seed is change in each iterations.

seed = 10000
count = 0 
random1 = []
while count<100:
    ran = (16807*seed)%(2**31)
    seed = ran
    ran = ran
    random1.append(ran)
    count+=1
#print(random1)

Lets change above random number in the range [0,1)

L1 = [x/(2**31) for x in random1]

Lets generate another random number using different seed. Here we use same procedure except different seed value.

seed = 137474
count = 0 
random2 = []
while count<100:
    ran = (16807*seed)%(2**31)
    seed = ran
    ran = ran
    random2.append(ran)
    count+=1
#print(random1)

Change the above random number in the range [0,1)

L2 = [x/(2**31) for x in random2]

Plot of Two random number as we generated

We should check the randomness of the numbers we generated, so the scatter plot is the best choice. If the distribution of our scatter dotted points is uniform, we can get a random integer. However, we are unable to produce a perfect random number. We produce a pseudorandom number as the result of setting the seed value. The only sources of really random numbers are physical processes.

plt.scatter(random1,random2)
plt.show()

In the above plot, points are distributed nearly uniformly but not completely uniform. This happened because we generated a few numbers. If we increase the random number size, then what will happen? Let’s check.

seed = 10000
count = 0 
random1 = []
while count<10000:
    ran = (16807*seed)%(2**31)
    seed = ran
    ran = ran
    random1.append(ran)
    count+=1
L1 = [x/(2**31) for x in random1]





seed = 137474
count = 0 
random2 = []
while count<10000:
    ran = (16807*seed)%(2**31)
    seed = ran
    ran = ran
    random2.append(ran)
    count+=1
L2 = [x/(2**31) for x in random2]





plt.scatter(random1,random2,s = 0.3)
plt.show()

Hence, as random number size increases, we can find that points are randomly distributed. Hence, we are able to build a pseudo random number.

Let’s change given random number into Normal distribution

For this lets build new number from above numbers using following relation, (y_1 = (-2\log(random1))^\frac{1}{2} cos(2\pi \times random2))

[y_2 = (-2\log(random1))^\frac{1}{2} sin(2\pi \times random2)]

Using Python

arr1 = np.array(L1)
arr2 = np.array(L2)
ran_guss1 = ((-2*np.log(arr1)))**0.5 *np.cos(2*np.pi*arr2)
ran_guss2 = ((-2*np.log(arr1)))**0.5 *np.sin(2*np.pi*arr2)

Lets Check Distribution of generated new numbers

plt.hist(ran_guss1, bins= 10)
plt.title("Histogram From First Method")
plt.show()

plt.hist(ran_guss2, bins= 10)
plt.title("Histogram From Second Method")
plt.show()

Let us compare the random number generated from Congruential Method and Using Library

Here, we generate random number from numpy random.rand() function which generate random number in the range [0,1).

ra = np.random.rand(10000)


plt.scatter(L1, ra, s = 0.3)
plt.title("Random Number From Libray and Congruential Method")
plt.show()

News Classification using Neural Network

Durga Pokharel — Wed, 28 Sep 2022 11:16:28 +0000

News Classification with Simple Neural Network is one of the application of Deep Learning. And here in this part of the blog, I am going to perform a Nepali News Classification. Before jumping into the main part, I would love to share some of my previous contents based upon which this blog has been written.

Above blogs are written and performed by me on sequential order too. The part in this blog until the pre-processing of text is same throughout the other classification blog too.

Import Necessary Module

Lets import necessary modules that we need to data preprocession before modelling.

os: The OS module in Python provides functions for interacting with the operating system and files.
pandas: Working with DataFrame and data analysis.
numpy: For numerical operaiton and array stuffs.
matplotlib: For visualization.
matplotlib.front manager: A module for finding, managing, and using fonts across platforms
matplotlib.front manager: A module for finding, managing, and using fonts across platforms
matplotlib.front manager: A module for finding, managing, and using fonts across
warnings: Warnings are provided to warn the developer of circumstances that aren’t always exceptions.

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
from matplotlib.font_manager import FontProperties
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
import pprint

plt.style.use("seaborn-whitegrid")

Data Load

The data is currently in my drive which is available publicly. And I run the scraping code frequently to get more data so the number of rows could be different later.

I used data that I had gathered over the course of a month or two by scraping news from several news portals. This daily news was amalgamated, and I created a final consolidated csv file that I used here. That file has 5348 rows and 9 columns. In columns I kept different news fields like business, sports, news, entertainment etc as attributes.

df = pd.read_csv("/content/drive/MyDrive/News Scraping/combined_csv.csv")
df.shape


(5838, 9)


df.Category.value_counts()


business 1550
news 1228
entertainment 1092
technology 441
prabhas-news 441
sports 420
world 331
national 120
international 120
province 95
Name: Category, dtype: int64

From above, we can see that most news belongs to the business category and the entertainment category, similarly to the news category. While doing classification problems, the first requirement is that we should have an equal number of data points in all classes. If not, there is the problem of class imbalance, which arises because our model will classify data in the majority of classes and ignore the rest of the classes. Hence, we should be more concerned with how to achieve class balance.

One way to make sure classes are balanced here is to combine two or more classes into single classes. While doing that task, we combined classes that have similar types of data, like in the above categories of news and prabhas-news, international and world, and so on.

# # business, entertainment 
# df.query('Category in ("business", "entertainment")')

# # business, entertainment, technology, sports, world + international

Open the stopwords.txt file.

Stop words are a collection of terms that are commonly used in any language. Stop words in English include words like “the,” “is,” and “and.” Stop words are used in NLP and text mining applications to remove extraneous terms so that computers may focus on the important ones. The following is how I loaded the stop words file. Because stop words play an important role in news classification, we should eliminate them during preprocessing.

Stop words file

stop_file = "/content/drive/MyDrive/News Scraping/News classification/nepali_stopwords.txt"
stop_words = []
with open(stop_file) as fp:
  lines = fp.readlines()
  stop_words =list( map(lambda x:x.strip(), lines))
#stop_words

Open the Punctuation file.

The code below is for loading a punctuation file. Punctuation is a set of tools used in writing to clearly distinguish sentences, phrases, and clauses so that their intended meaning may be understood. These tools provide no useful information during categorization, thus they should be eliminated before we train our model.

punctuation_file = "/content/drive/MyDrive/News Scraping/News classification/nepali_punctuation (1).txt"
punctuation_words = []
with open(punctuation_file) as fp:
  lines = fp.readlines()
  punctuation_words =list( map(lambda x:x.strip(), lines))
punctuation_words


[':', '?', '|', '!', '.', ',', '" "', '( )', '—', '-', "?'"]

Pre-processing of text

I’m only going to utilize the titles of all of my blog’s categories. I’ll use content to make a blog post there later, despite the enormous quantity of words in the content columns. In this blog, I’ll show you how to use Naive Bayes in title data to classify news and categorize it by category.

First, I created a method named ‘preprocessing text’ in the provided code that accepts data, stop words, and punctuation words as parameters. I made a list called ‘new cat’ to keep track of the information once I processed it. I also initialized naise, as you can see in the code. Then, within cat data, I use for loop. I isolated the data on cats from the white space, linked them together, and gave them names.


def preprocess_text(cat_data, stop_words, punctuation_words):
  new_cat = []
  noise = "1,2,3,4,5,6,7,8,9,0,०,१,२,३,४,५,६,७,८,९".split(",")

  for row in cat_data:
    words = row.strip().split(" ")      
    nwords = "" # []

    for word in words:
      if word not in punctuation_words and word not in stop_words:
        is_noise = False
        for n in noise:
          #print(n)
          if n in word:
            is_noise = True
            break
        if is_noise == False:
          word = word.replace("(","")
          word = word.replace(")","")
          # nwords.append(word)
          if len(word)>1:
            nwords+=word+" "

    new_cat.append(nwords.strip())
  # print(new_cat)
  return new_cat

title_clean = preprocess_text(["शिक्षण संस्थामा ज जनस्वास्थ्य 50 मापदण्ड पालना शिक्षा मन्त्रालयको निर्देशन"], stop_words, punctuation_words)
print(title_clean)


['शिक्षण संस्थामा जनस्वास्थ्य मापदण्ड पालना शिक्षा मन्त्रालयको निर्देशन']

Here, we only take title from our data and apply stops words and punctuations.

ndf = df.copy()
cat_title = []
for i, row in ndf.iterrows():
  ndf.loc[i, "Title"]= preprocess_text([row.Title], stop_words, punctuation_words)[0]

ndf.head()

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	Unnamed: 0	Title	URL	Date	Author	Author URL	Content	Category	Description
0	0	प्रधानमन्त्री देउवा, दाहाल नेपाल भारतीय राजदूत...	https://ekantipur.com/news/2022/04/12/16497794...	चैत्र २९, २०७८	कान्तिपुर संवाददाता	https://ekantipur.com/author/author-14301	काठमाडौँ — प्रधानमन्त्री शेरबहादुर देउवा, नेकप...	news	प्रधानमन्त्री शेरबहादुर देउवा, नेकपा (माओवादी ...
1	1	गठबन्धनले महानगर उपमहानगरमा प्रमुख-उपप्रमुख के...	https://ekantipur.com/news/2022/04/12/16497772...	चैत्र २९, २०७८	कान्तिपुर संवाददाता	https://ekantipur.com/author/author-14301	काठमाडौँ — स्थानीय तहको निर्वाचनका लागि सत्ता ...	news	स्थानीय तहको निर्वाचनका लागि सत्ता गठबन्धन दलह...
2	2	परराष्ट्रमन्त्री खड्कासँग भारतीय राजदूत क्वात्...	https://ekantipur.com/news/2022/04/12/16497754...	चैत्र २९, २०७८	कान्तिपुर संवाददाता	https://ekantipur.com/author/author-14301	काठमाडौँ — भारतको विदेश मन्त्रालयमा सचिव पदमा ...	news	भारतको विदेश मन्त्रालयमा सचिव पदमा नियुक्त भएप...
3	3	स्थानीय तहको नेतृत्व बाँडफाँट केन्द्रमा पठाउन ...	https://ekantipur.com/news/2022/04/12/16497720...	चैत्र २९, २०७८	कान्तिपुर संवाददाता	https://ekantipur.com/author/author-14301	काठमाडौँ — सत्ता गठबन्धनले स्थानीय तहको नेतृत्...	news	सत्ता गठबन्धनले स्थानीय तहको नेतृत्व बाँडफाँट ...
4	4	प्रधानसेनापति भारतीय सेनाका रथीबीच भेटवार्ता	https://ekantipur.com/news/2022/04/12/16497700...	चैत्र २९, २०७८	कान्तिपुर संवाददाता	https://ekantipur.com/author/author-14301	काठमाडौँ — प्रधानसेनापति प्रभुराम शर्मा र भारत...	news	प्रधानसेनापति प्रभुराम शर्मा र भारतीय सेनाका र...

R Exercise: Social Network Analysis

Durga Pokharel — Fri, 02 Sep 2022 03:06:03 +0000

Social Network Analysis

Definition

Social networks are simply networks of social interactions and personal relationships. Think about our group of friends and how we got to know them. Maybe we met them while ago from our schooling, or maybe we met them through a hobby or through our community. In fact, 72% of all Internet users are active on social media today, including in social interactions and developing personal relationships. However, to understand about social networks, we only don’t need to go through internet or social media, they may come in variety of form in our daily life.

Social Network Analysis

Social network analysis (SNA), also known as network science, is a field of data analytics that uses networks and graph theory to understand social structures. We can see network around us such as road network, online network, network of social media such as facebook, twitter etc. Learning SNA allow us to explore insight of various data sources.

SNA Graph

We all are familier about the graph which simply the collection of non zero vertex and edge. In order to build SNA graphs, two key components are required actors and relationships. Here, actors represents the vertex and relationship means the edge between two actors. Let us write SNA graph in R code. To do this, we should have igraph package already install in our R or R studio.

library(igraph)


## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
## decompose, spectrum

## The following object is masked from 'package:base':
## 
## union


g <- graph(c(1,2))
plot(g)

In figure, we can see the directed graph from node 1 to node 2. From above graph, it is also clear that by default it gives directed graph. In above figure we can not clearly see node and edge so let’s increase its size and give different color to the node.

plot(g,
vertex.color = 'green',
vertex.size = 40,
edge.color ='red',
edge.size = 20)

Now we change the node color and node font size. Add more node on graph for this we follow the following code.

g <- graph(c(1,2,2,3,3,4,4,1))
plot(g,
     vertex.color = 'green',
     vertex.size =40,
     edge.color = 'red',
     edge.size = 20)

We got the graph with four vertex 1,2,3,4. Here, we also got directed graph. Let’s make it for this we need to write directed is false.

g <- graph(c(1,2,2,3,3,4,4,1),directed = FALSE)
plot(g,
     vertex.color = 'green',
     vertex.size =40,
     edge.color = 'red',
     edge.size = 20)

We got our desirable type of graph. Now let’s move forword. We can give the number of vertex with out writing them for this look following code.

g <- graph(c(1,2,2,3,3,4,4,1), 
directed = F, n=7)
plot(g,
     vertex.color = 'green',
     vertex.size =40,
     edge.color = 'red',
     edge.size = 20)

In above graph we have given seven vertex numbers. Among them we can see three isolated nodes. The reason to come isolated node is that we did not specify the edge or relationship between them. Also from graph we can make sense that this is not directed graph.

Adjacency Matrix

Let’s see what will happen if we type only g[].

g[]


## 7 x 7 sparse Matrix of class "dgCMatrix"
##                   
## [1,] . 1 . 1 . . .
## [2,] 1 . 1 . . . .
## [3,] . 1 . 1 . . .
## [4,] 1 . 1 . . . .
## [5,] . . . . . . .
## [6,] . . . . . . .
## [7,] . . . . . . .

This gives us the 7 by 7 adjacency matrix. Adjancency matrix is matix where we give 1 if there is edge between two vertex if not then we give 0. But in above matrix it gave simply . instead of zero.

Let us try to build our graph by keeping text data in place of number.

g1 <-
graph(c("Binu","Binita","Binita","Rita"
,"Rita","Binu","Binu","Rita", "Anju", 
"Binita"))
plot(g1,
vertex.color = "green",
vertex.size = 40,
edge.color = "red",
edge.size = 5)

If we want to check the features of g1 we simply type g1 and press control and enter key. We got following output.

g1


## IGRAPH d030509 DN-- 4 5 -- 
## + attr: name (v/c)
## + edges from d030509 (vertex names):
## [1] Binu ->Binita Binita->Rita Rita ->Binu Binu ->Rita Anju ->Binita

It showed that in graph there are 4 nodes 5 edges. And edges are directed fromBinu ->Binita, Binita->Rita, Rita ->Binu, Binu ->Rita, Anju ->Binita.

Degree

Degree means numbers of connection to each node. Let’s check the degree of graph g1. To check degree we can dodegree(g1) or degree(g1, mode='all').

degree(g1) 


## Binu Binita Rita Anju 
## 3 3 3 1

Degree of Binu is 3 similarly Anju has degree 1.

degree(g1, mode='all')


## Binu Binita Rita Anju 
## 3 3 3 1

Diameter

Diameter means number of edged inside and outside of SND. Now, let’s check the diameter of graph.

diameter(g1, directed = F, weights = 
NA)


## [1] 2

We got two diameter. i.e Anju <- Binita <- Rita, Anju <- Binita <- Binu.

Edge density

Edge density means ecount(g1)/(vcount(g1)*(vcount(g1) -1)). We can calculate it from following code.

edge_density(g1, loops = F)


## [1] 0.4166667

We got the value of edge density 0.4166667.

Reciprocity

Total edges = 5

Tied edges = 2

Reciprocity = 2/5 = 0.4

reciprocity(g1)


## [1] 0.4

Closeness

Now getting closeness of graph.

closeness(g1, mode = "all", weights = NA)


## Binu Binita Rita Anju 
## 0.2500000 0.3333333 0.2500000 0.2000000

From above result we can see that Binita is closest to the others three persons and Anju is farthest from other three persons.

Betweenness

Let’s calculate between of g1

betweenness(g1, directed = T, weights = NA)


## Binu Binita Rita Anju 
## 1 2 2 0

Binita and Rita has 2 inner edge similarly Binu has one inner edge and Anju has no inner edge.

Edge Betweenness

For every pair of vertices in a connected graph, there exists at least one shortest path between the vertices.

edge_betweenness(g1, directed = T, weights = NA)


## [1] 2 4 4 1 3

SNA in TwitterData

Here I have loadedTwitterdata from my local machine.

load("F:/MDS R/termDocMatrix.rdata")


m<- as.matrix(termDocMatrix)
termM <- m %*% t(m)
termM[1:10,1:10]


## Terms
## Terms analysis applications code computing data examples introduction
## analysis 23 0 1 0 4 4 2
## applications 0 9 0 0 8 0 0
## code 1 0 9 0 1 6 0
## computing 0 0 0 10 2 0 0
## data 4 8 1 2 85 5 3
## examples 4 0 6 0 5 17 2
## introduction 2 0 0 0 3 2 10
## mining 4 7 3 1 50 5 3
## network 12 0 1 0 0 2 2
## package 2 1 0 2 12 2 0
## Terms
## Terms mining network package
## analysis 4 12 2
## applications 7 0 1
## code 3 1 0
## computing 1 0 2
## data 50 0 12
## examples 5 2 2
## introduction 3 2 0
## mining 64 1 6
## network 1 17 1
## package 6 1 27

Now we have built a term-term adjacency matrix, where the rows and columns represents terms, and every entry is the number of co-occurrences of two terms. Next we can build a graph withgraph.adjacency() from package igraph.

g <- graph.adjacency(termM,weighted = T,mode = 'undirected')
g


## IGRAPH d066e18 UNW- 21 151 -- 
## + attr: name (v/c), weight (e/n)
## + edges from d066e18 (vertex names):
## [1] analysis --analysis analysis --code        
## [3] analysis --data analysis --examples    
## [5] analysis --introduction analysis --mining      
## [7] analysis --network analysis --package     
## [9] analysis --positions analysis --postdoctoral
## [11] analysis --r analysis --research    
## [13] analysis --series analysis --slides      
## [15] analysis --social analysis --time        
## + ... omitted several edges

Here we have built graph on termDocMatrix. In result we can see the edges between different dodes.

g <- simplify(g)
g


## IGRAPH d06a87e UNW- 21 130 -- 
## + attr: name (v/c), weight (e/n)
## + edges from d06a87e (vertex names):
## [1] analysis --code analysis --data        
## [3] analysis --examples analysis --introduction
## [5] analysis --mining analysis --network     
## [7] analysis --package analysis --positions   
## [9] analysis --postdoctoral analysis --r           
## [11] analysis --research analysis --series      
## [13] analysis --slides analysis --social      
## [15] analysis --time analysis --tutorial    
## + ... omitted several edges

Function simplify() in igraph handly removes self-loops from a network. We can see in previous graph there are 151 edges in second graph there are only 130 edges. Hence there are 21 self loops they are omitted from the graph.

Check Degree of graph and nodes of the graph.

V(g)$label <- V(g)$name
V(g)$label


## [1] "analysis" "applications" "code" "computing" "data"        
## [6] "examples" "introduction" "mining" "network" "package"     
## [11] "parallel" "positions" "postdoctoral" "r" "research"    
## [16] "series" "slides" "social" "time" "tutorial"    
## [21] "users"


V(g)$degree <- degree(g)
V(g)$degree


## [1] 17 6 9 9 18 14 12 20 14 13 8 7 8 17 9 11 15 11 11 16 15

We found degree of graph is 20.

Histogram on the basis of degree

hist(V(g)$degree, breaks = 100,col = 'green', main = "Frequency Of Degree", 
     xlab = " Degree of vertices", ylab = " Frequency")

We can see most of nodes have degree 9 and 11. We all khow what is degree of graph, number of edges that are incident to the node is called degree of graph.

Let’s Plot Graph on the Data.

Let’s set set.seed(3952). Where set.seed gives same sample with the same seed value.

set.seed(3952)

layout1 <- layout.fruchterman.reingold(g)

plot(g, layout=layout1)

A different layout can be generated with the first line of code below. The second line produces an interactive plot, which allows us to manually rearrange the layout

plot(g, layout=layout.kamada.kawai)

Make it better.

V(g)$label.cex <- 2.2 * V(g)$degree / max(V(g)$degree)+ .2

V(g)$label.color <- rgb(0, 0, .2, .8)

V(g)$frame.color <- NA

egam <- (log(E(g)$weight)+.4) / max(log(E(g)$weight)+.4)

E(g)$color <- rgb(.5, .5, 0, egam)

E(g)$width <- egam
# plot the graph in layout1

plot(g, layout=layout1)

Here size of words appear according to their degree. From graph we can clearly see that node mining has maximum degree.

Community detection

Communities are seen as groups, clusters, coherent subgroups, or modules in different fields; community detection in a social network is identifying sets of nodes in such a way that the connections of nodes within a set are more than their connection to other network nodes.

comm <- cluster_edge_betweenness(g)


## Warning in cluster_edge_betweenness(g): At community.c:461 :Membership vector
## will be selected based on the lowest modularity score.

## Warning in cluster_edge_betweenness(g): At community.c:468 :Modularity
## calculation with weighted edge betweenness community detection might not make
## sense -- modularity treats edge weights as similarities while edge betwenness
## treats them as distances


plot(comm,g)

There are dense connection within the group within the community the connection is sparse.

prop <- cluster_label_prop(g)
plot(prop, g)

This is different type of algorithms to detect community which is different from previous one.

Hubs

Nodes with most outer edges. We need to find hub score.

hs <- hub_score(g,weights = NA)$vector
hs


## analysis applications code computing data examples 
## 0.9047777 0.3589289 0.5606314 0.5223206 0.9065063 0.8195307 
## introduction mining network package parallel positions 
## 0.7307838 1.0000000 0.7483791 0.7210610 0.4939614 0.3733995 
## postdoctoral r research series slides social 
## 0.4095660 0.9147530 0.4481802 0.6761093 0.8510808 0.6018664 
## time tutorial users 
## 0.6761093 0.8899001 0.8342594

Authority

Nodes with most inner edges. We need to find authority score

as <- authority_score(g, weights = NA)$vector
as


## analysis applications code computing data examples 
## 0.9047777 0.3589289 0.5606314 0.5223206 0.9065063 0.8195307 
## introduction mining network package parallel positions 
## 0.7307838 1.0000000 0.7483791 0.7210610 0.4939614 0.3733995 
## postdoctoral r research series slides social 
## 0.4095660 0.9147530 0.4481802 0.6761093 0.8510808 0.6018664 
## time tutorial users 
## 0.6761093 0.8899001 0.8342594

Hub in Plot

par(mfrow = c(1,2))
plot(g,vertex.size= hs*50, main = "Hubs",
     vertex.label = NA,
     vertex.colour = rainbow(50))

Authority in Plot

plot(g,vertex.size= as*30, main = "Authorities",
     vertex.label = NA,
     vertex.colour = rainbow(50))

Hub are expected to contain large number of outgoing link. And authority are expected to contain large number of incoming link from hubs.

Application of SNA in Real World

Social network analysis can provide information about the reach of gangs, the impact of gangs, and gang activity. The approach may also allow we to identify those who may be at risk of gang-association and/or being exploited by gangs.

R Exercise: Association Rule Mining in R

Durga Pokharel — Fri, 02 Sep 2022 03:04:45 +0000

Association Rule Mining

Association rule mining (also known as Association Rule Learning) is a typical technique for determining relationships (co-occurrence) between many variables. It is mostly used in grocery stores, e-commerce websites, and other similar establishments.

in addition to massive transactional data bases

Amazon knows what else you want to buy when you order something on their site. This is a very prevalent example in our daily life.

Spotify works on the same principle: they know what song you want to listen to next.

Use of Association Mining Results

Changing the store layout according to trends
Customer behavior analysis
Catalogue design
Cross markteing on online store
What are the trending items customers buy
Customized email with add-on sales

When Association Mining is used?

When we wish to find an association between different objects in a collection, find frequent patterns in a transaction database, relational databases, or any other information repository, we utilize association rule mining. In retailling clustering and classification, association rule mining is found in ‘Marketing Basket Analysis.’

By developing a set of rules known as Association Rules , it can tell us what things clients commonly buy together. In simple terms, it generates output in the manner of if this, then that rules.

What is Apriori Algorithm and Rule?

Data from a retail market or an online e-commerce store is typically used to mine association rules. The ‘apriori algorithm’ makes it easier to detect these patterns or rules rapidly because most transaction data is huge. Using apriori() with all of the rules in the data is not a smart idea!

Rule A rule is a note that shows which things are frequently purchased together.

It has two parts: a ‘LHS’ and a ‘RHS’, which can be represented as follows:

‘itemssetA => itemssetB’ is a condition.

Some Association Rule Mining Terms

Support

Association rule are given in the following form,

A=>B[support, confidence]

Where A and B are sets of items in the transation data. A and B are disjoint sets.

Support = Number of transactions with both A and B / Total number of
transactions = P(A∩B) = frequency(A, B)/N

Confidence

Confidence = Number of transactions with both A and B / Total number of
transactions with A = P(A∩B)/P(A) = frequenc(A, B)/frequency(A)

Expected Confidence

Expected Confidence = Number of transactions withB/Total number of
transactions = P(B) = frequency(B)/N

List

Lift = Confidence/ExpectedConfidence =P(A∩B)/P(A).P(B) =
Support(A,B)/Support(A).Support(B)

Let’s Do Association Rule Mining in R

Create a list of baskets

market_basket<- list(c("bread", "milk"),
                     c("bread","dipers","beer","Egg"),
                     c("milk","dipers","beer","coka"),
                     c("bread","milk","dipers","beer"),
                     c("bread","milk","dipers","coka")
                     )
names(market_basket) <- paste("T",c(1:5),sep = "")
market_basket


## $T1
## [1] "bread" "milk" 
## 
## $T2
## [1] "bread" "dipers" "beer" "Egg"   
## 
## $T3
## [1] "milk" "dipers" "beer" "coka"  
## 
## $T4
## [1] "bread" "milk" "dipers" "beer"  
## 
## $T5
## [1] "bread" "milk" "dipers" "coka"

The five transcations were created from the preceding data and given the names T1,T2,T2,T4,T5.

Now we’ll use the ‘arules packages’ to do some more association rule mining. To move on, we should have installed mention packages.

library(arules)


## Warning: package 'arules' was built under R version 4.1.2

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
## abbreviate, write

Let’s make Transformation of transactions.

trans <- as(market_basket,"transactions")

Let’s check dimension of trans variable,

dim(trans)


## [1] 5 6

We received the response 5 6 here, which means we have 5 transactions and 6 products. Let’s take a look at the labels on the objects.

labels(trans)


## [1] "{bread,milk}" "{beer,bread,dipers,Egg}" 
## [3] "{beer,coka,dipers,milk}" "{beer,bread,dipers,milk}"
## [5] "{bread,coka,dipers,milk}"

Here we got items names we have.

summary(trans)


## transactions as itemMatrix in sparse format with
## 5 rows (elements/itemsets/transactions) and
## 6 columns (items) and a density of 0.6 
## 
## most frequent items:
## bread dipers milk beer coka (Other) 
## 4 4 4 3 2 1 
## 
## element (itemset/transaction) length distribution:
## sizes
## 2 4 
## 1 4 
## 
## Min. 1st Qu. Median Mean 3rd Qu. Max. 
## 2.0 4.0 4.0 3.6 4.0 4.0 
## 
## includes extended item information - examples:
## labels
## 1 beer
## 2 bread
## 3 coka
## 
## includes extended transaction information - examples:
## transactionID
## 1 T1
## 2 T2
## 3 T3

Let’s inspect the `trans`

inspect(trans)


## items transactionID
## [1] {bread, milk} T1           
## [2] {beer, bread, dipers, Egg} T2           
## [3] {beer, coka, dipers, milk} T3           
## [4] {beer, bread, dipers, milk} T4           
## [5] {bread, coka, dipers, milk} T5

It is preferable to employ the inspect function. It will display ten transactions. In this case, if our data is really huge, a larger number of transaction inspect functions will be necessary.

Relative frequency plot and plot of trans

itemFrequencyPlot(trans, topN=10, cex.names =1)

Most sold items were bread, milk and beer similarly less sold item is Egg.

image(trans)

Why Apriori Algorithms is important here?

Because it necessitates a thorough database scan, Frequent Item Set Generation is the most computationally intensive stage. We saw an example of only 5 transactions in the previous example, but real-world transaction data for retail might surpass GBs and TBs of data, necessitating the use of an optimal technique to prune out Item-sets that will not aid in further phases.

Apriori algorithm of “trans” without/with min. support of 0.3 and min. confidence of 0.5.

rues <- apriori(trans)


## Apriori
## 
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
## 
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
## 
## Absolute minimum support count: 0 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6 item(s), 5 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [31 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].


rules


## function (rhs, lhs, itemLabels, quality = data.frame()) 
## {
## if (!is(lhs, "itemMatrix")) 
## lhs <- encode(lhs, itemLabels = itemLabels)
## if (!is(rhs, "itemMatrix")) 
## rhs <- encode(rhs, itemLabels = itemLabels)
## new("rules", lhs = lhs, rhs = rhs, quality = quality)
## }
## <bytecode: 0x0000000022947bd8>
## <environment: namespace:arules>

Here, we made set of 15 rules.

rules <- apriori(trans, parameter = list(supp=0.3,conf=0.5,
                                         maxlen=10,
                                         target ="rules"))


## Apriori
## 
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.3 1
## maxlen target ext
## 10 rules TRUE
## 
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
## 
## Absolute minimum support count: 1 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6 item(s), 5 transaction(s)] done [0.00s].
## sorting and recoding items ... [5 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [32 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].

Note: maxlen= maximum length of the transaction! We could have used maxlen= 4 here as we know it but this will not be known in real-life!

Summary of rules

summary(rules)


## set of 32 rules
## 
## rule length distribution (lhs + rhs):sizes
## 1 2 3 
## 4 16 12 
## 
## Min. 1st Qu. Median Mean 3rd Qu. Max. 
## 1.00 2.00 2.00 2.25 3.00 3.00 
## 
## summary of quality measures:
## support confidence coverage lift       
## Min. :0.4000 Min. :0.5000 Min. :0.4000 Min. :0.8333  
## 1st Qu.:0.4000 1st Qu.:0.6667 1st Qu.:0.6000 1st Qu.:0.8333  
## Median :0.4000 Median :0.7500 Median :0.6000 Median :1.0000  
## Mean :0.4938 Mean :0.7474 Mean :0.6813 Mean :1.0473  
## 3rd Qu.:0.6000 3rd Qu.:0.8000 3rd Qu.:0.8000 3rd Qu.:1.2500  
## Max. :0.8000 Max. :1.0000 Max. :1.0000 Max. :1.6667  
## count      
## Min. :2.000  
## 1st Qu.:2.000  
## Median :2.000  
## Mean :2.469  
## 3rd Qu.:3.000  
## Max. :4.000  
## 
## mining info:
## data ntransactions support confidence
## trans 5 0.3 0.5
## call
## apriori(data = trans, parameter = list(supp = 0.3, conf = 0.5, maxlen = 10, target = "rules"))

A collection of 32 rules has been created. There are four rules in transaction 1, sixteen in transaction two, and twelve in transaction three. There were also several empty rues generated here. Let’s get rid of these useless rules.

rules <- apriori(trans, parameter = list(supp=0.3,conf = 0.5,
                                         maxlen =10,
                                         minlen=2,
                                         target="rules"))


## Apriori
## 
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.3 2
## maxlen target ext
## 10 rules TRUE
## 
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
## 
## Absolute minimum support count: 1 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6 item(s), 5 transaction(s)] done [0.00s].
## sorting and recoding items ... [5 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [28 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].

Let’s set RHS rule for trans data

# we set rhs =beer and default = lhs
beer_rules_rhs<- apriori(trans, parameter = list(supp= 0.3,conf= 0.5,
                                                 maxlen= 10,
                                                 minlen=2),
                         appearance = list(default="lhs",
                                           rhs ="beer"))


## Apriori
## 
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.3 2
## maxlen target ext
## 10 rules TRUE
## 
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
## 
## Absolute minimum support count: 1 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[6 item(s), 5 transaction(s)] done [0.00s].
## sorting and recoding items ... [5 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [5 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].


inspect(beer_rules_rhs)


## lhs rhs support confidence coverage lift count
## [1] {bread} => {beer} 0.4 0.5000000 0.8 0.8333333 2    
## [2] {milk} => {beer} 0.4 0.5000000 0.8 0.8333333 2    
## [3] {dipers} => {beer} 0.6 0.7500000 0.8 1.2500000 3    
## [4] {bread, dipers} => {beer} 0.4 0.6666667 0.6 1.1111111 2    
## [5] {dipers, milk} => {beer} 0.4 0.6666667 0.6 1.1111111 2

People who bought beer said their most recent purchase was dipers, and their most recent sales were breads, dipers, and dipers, as well as milk. It’s a pretty interesting data insight. We can deduce from this that the fathers most likely went to the grocery store to get the baby’s necessities.

Let’s put beer in LHS and set RHS as default values

beer_rules_lhs <- apriori(trans, parameter = list(supp=0.3,conf=0.5,
                                                  maxlen =10,
                                                  minlen =2),
                          appearance = list(default="rhs",lhs ="beer"))


## Apriori
## 
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.3 2
## maxlen target ext
## 10 rules TRUE
## 
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
## 
## Absolute minimum support count: 1 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[6 item(s), 5 transaction(s)] done [0.00s].
## sorting and recoding items ... [5 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [3 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].


inspect(beer_rules_lhs)


## lhs rhs support confidence coverage lift count
## [1] {beer} => {bread} 0.4 0.6666667 0.6 0.8333333 2    
## [2] {beer} => {milk} 0.4 0.6666667 0.6 0.8333333 2    
## [3] {beer} => {dipers} 0.6 1.0000000 0.6 1.2500000 3

People who bought beer would then buy dipers. We can see that the person who buys the most drinks also buys the most diapers.

Product Recommendation Rule

rules_conf<- sort(rules,by ="confidence",
                  decreasing = TRUE)
inspect(rules_conf)


## lhs rhs support confidence coverage lift count
## [1] {coka} => {milk} 0.4 1.0000000 0.4 1.2500000 2    
## [2] {coka} => {dipers} 0.4 1.0000000 0.4 1.2500000 2    
## [3] {beer} => {dipers} 0.6 1.0000000 0.6 1.2500000 3    
## [4] {coka, milk} => {dipers} 0.4 1.0000000 0.4 1.2500000 2    
## [5] {coka, dipers} => {milk} 0.4 1.0000000 0.4 1.2500000 2    
## [6] {beer, milk} => {dipers} 0.4 1.0000000 0.4 1.2500000 2    
## [7] {beer, bread} => {dipers} 0.4 1.0000000 0.4 1.2500000 2    
## [8] {dipers} => {beer} 0.6 0.7500000 0.8 1.2500000 3    
## [9] {milk} => {bread} 0.6 0.7500000 0.8 0.9375000 3    
## [10] {bread} => {milk} 0.6 0.7500000 0.8 0.9375000 3    
## [11] {milk} => {dipers} 0.6 0.7500000 0.8 0.9375000 3    
## [12] {dipers} => {milk} 0.6 0.7500000 0.8 0.9375000 3    
## [13] {bread} => {dipers} 0.6 0.7500000 0.8 0.9375000 3    
## [14] {dipers} => {bread} 0.6 0.7500000 0.8 0.9375000 3    
## [15] {beer} => {milk} 0.4 0.6666667 0.6 0.8333333 2    
## [16] {beer} => {bread} 0.4 0.6666667 0.6 0.8333333 2    
## [17] {dipers, milk} => {coka} 0.4 0.6666667 0.6 1.6666667 2    
## [18] {beer, dipers} => {milk} 0.4 0.6666667 0.6 0.8333333 2    
## [19] {dipers, milk} => {beer} 0.4 0.6666667 0.6 1.1111111 2    
## [20] {beer, dipers} => {bread} 0.4 0.6666667 0.6 0.8333333 2    
## [21] {bread, dipers} => {beer} 0.4 0.6666667 0.6 1.1111111 2    
## [22] {bread, milk} => {dipers} 0.4 0.6666667 0.6 0.8333333 2    
## [23] {dipers, milk} => {bread} 0.4 0.6666667 0.6 0.8333333 2    
## [24] {bread, dipers} => {milk} 0.4 0.6666667 0.6 0.8333333 2    
## [25] {milk} => {coka} 0.4 0.5000000 0.8 1.2500000 2    
## [26] {dipers} => {coka} 0.4 0.5000000 0.8 1.2500000 2    
## [27] {milk} => {beer} 0.4 0.5000000 0.8 0.8333333 2    
## [28] {bread} => {beer} 0.4 0.5000000 0.8 0.8333333 2

The rules are sorted by confidence in decreasing order in the above results.

Plotting rules with “arulesViz” package

library(arulesViz)


## Warning: package 'arulesViz' was built under R version 4.1.2


plot(rules)


## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Here, darker orange color indicate those items whose lift value is maximun when lift values decrease colar also become light orange.

Let’s plot the same plot by setting measure = "confidence".

plot(rules, measure = "confidence")


## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Plot `two-key-plot`

library(arulesViz)
plot(rules, method = 'two-key plot')


## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Plot with “ggplot2” engine

library(ggplot2)

plot(rules, engine = "ggplot2")


## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

If we hover our curcer above orange points we can see the value of supp, conf as well as left. Darker the orange color more will be the value of corresponing parameters.

Parallel Coordinate plot for 10 rules

plot(subrules, method = "paracoord")

We used the parallel coordinate approach to see in higher-dimensional space. In this case, we visualize in ten dimensions.

R Exercise: Getting Started With ggplot in R

Durga Pokharel — Fri, 29 Jul 2022 15:34:19 +0000

Getting Started with ggplot2 in R

Grammer

A grammar provides a foundation for understanding diffrent types of graphics. A grammar may also help us on what a well-formed or correct graphic looks like, but there will still be many grammatically correct but nonsensical graphics. This is easy to see by analogy to the English language: good grammar is just the first step in creating a good sentence.

Grammar of Graphics

A grammar of graphics is a tool that enables us to clearly describe the components of a graphics. Such a grammar allows us to move beyond named graphics (e.g., the “scatterplot”) and gain insights into the deep structures that underlies the statistical graphics. ggplot2proposes an alternative parameterization of the grammar, based around the idea of building up a graphic from multiple layers of data.

Components of ggplot2

Data and aesthetic mappings
Geometric objects
Scale
Facet Specification
Statistical Transformation
Coordinate Syatem

Layered grammar of graphics

Together, the data, mappings, statistical transformations and geometric objects form a layer. Plot may have different layers. Layers are responsible for creating the objects that we expect on the plots.

How to use ggplot2 in R?

For this we need to have installed ggplot2 package in our IDE. Let us use ggplot in R builtin data diamonds.

library(ggplot2)
ggplot(diamonds, aes(carat,price)) + geom_point()

geom_point is used for scatter plot. From above figure, we can see that whenever diamond’s carat increases, prices also increases. We can not see how the data distibuted for this, let’s make some changes in our code.

ggplot(diamonds,aes(carat,price)) + geom_point() +
  scale_x_continuous() + scale_y_continuous()

We can see better distribution of points than previous plot. We can clearly see that carat and price variable are not linearly distributed. To make it linearly distributable, lets make some changes in our code.

ggplot(diamonds, aes(carat,price)) + geom_point() +
  stat_smooth(method = lm) + scale_x_log10() + scale_y_log10()


## `geom_smooth()` using formula 'y ~ x'

From above graph, relationship between price and carat variables is linear. If we try the code without stat_smooth(method= lm) we can not see linear line in graph. Where lm means linear model.

ggplot(diamonds, aes(carat,price)) + geom_point() +
  scale_x_log10() + scale_y_log10()

Lets make histogram of `diamonds` data

ggplot(diamonds, aes(price)) + geom_histogram()


## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

To build histogram, we use function geom_histogram(). We should note that histogram is made on one dimensional data. If we want to add title of the plot we can do as,

ggplot(diamonds, aes(price)) + geom_histogram() + ggtitle("ggplot2 Histogram")


## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let us try some other ggplot2 features in R builtin data `mtcars`

ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()

The figure above shows scatterplot betweenhwyanddisplvariables of mtcars data from figure we can see as the values ofhwyincreases, values ofdisplvariable slightly decreases.

Let’s add `geam_smooth()`: What will happen?

ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth()


## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We see there is smooth line appearing on the middle of the data points.

Adding “wiggliness” in the smoothing plot

ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth(span = 0.2) 


## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

What changes can we see in above graph and previous graph. Let us again check by keeping span = 1.

ggplot(mpg, aes(displ, hwy)) + geom_point()+ geom_smooth(span = 1)


## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We can make sense that, by default ggplot kept value of span 1. If we setmethod= lm inside geom_smooth() we can find stright smooth line. Let us try.

ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth(method = lm) 


## `geom_smooth()` using formula 'y ~ x'

Let Modify our code little,

ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth(method = lm, se= FALSE) 


## `geom_smooth()` using formula 'y ~ x'

By se= FALSE we added a smooth line.

Fixed color

ggplot(mpg, aes(displ,hwy)) + geom_point(color = 'red')

Changing color by variable attributes

Lets change our color based on class.

ggplot(mpg, aes(displ, hwy, colour = class)) + 
geom_point()

Here we gave colors according to variable’s name.

Getting multiple scatterplot of attributes

We can get multiple scatterplot by using facet_wrap() function.

ggplot(mpg, aes(displ, hwy)) + geom_point() + 
facet_wrap(~class)

In above figure, we found distribution of various variables along with displand hwy variables.

Histogram

ggplot(mpg, aes(hwy)) + geom_histogram()


## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

hwy variable bins automatically.

Changing bin size of the histogram

ggplot(mpg, aes(hwy)) + geom_histogram(binwidth = 2.5)

Frequency polygon

A frequency polygon is a line graph of class frequency plotted against class midpoint. It can be obtained by joining the midpoints of the top of the rectangles in the histogram.

ggplot(mpg, aes(hwy)) + geom_freqpoly()


## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Change Bin size of frequency Polygon

ggplot(mpg, aes(hwy)) + geom_freqpoly(binwidth= 1)

We can see the effect of binwidth from figure by comparing above figure with previous one.

Histogram with faceting:

We have already discussed about what a facet does in scatter plot. Similarly, in histogram it gives multiple subplots.

ggplot(mpg, aes(displ, fill = drv)) + geom_histogram(binwidth = 0.5) + 
facet_wrap(~drv, ncol = 1)

Bar plot

ggplot(mpg, aes(manufacturer)) + geom_bar()

We can draw bar plot in geom_bar() function. From bar plot we can seedodge and toyotao has maximum frequency.

Let’s Use alpha inside `geom_point()`

ggplot(mpg, aes(cty, hwy)) + geom_point(alpha = 1 / 3)

Alpha refers to the opacity of a geom. Values of alpha range from 0 to 1, with lower values corresponding to more transparent colors.

Modifying the axes

ggplot(mpg, aes(cty, hwy)) +geom_point(alpha = 1 / 3) + xlab("city driving (mpg)") + 
ylab("highway driving (mpg)")

ggplot(mpg, aes(cty, hwy)) + geom_point(alpha = 1 / 3) + xlab(NULL) + 
 ylab(NULL)

Thats all for this part, thank you so much for reading.

R Exercise: Monte Carlo Simulations in R

Durga Pokharel — Fri, 29 Jul 2022 15:33:02 +0000

Monte Carlo Simulations

What is Monte Carlo Simulations?

One of the main motivations to switch from spreadsheet-type tools (such as Microsoft Excel) to a program like R is for simulation modeling. R allows us to repeat the same (potentially complex and detailed) calculations with different random values over and over again.

Within the same software, we can then summarize and plot the results of these replicated calculations. Monte Carlo methods are used to perform this type of analysis they randomly sample from a set of values in order to generate and summarize a distribution of some statistic related to the sampled quantities.

Randomness

Random processes are an important aspect of simulation modeling. A random process is one that produces a different result each time it is run according to some rules. They are inextricably tied to the concept of uncertainty, you have no idea what will happen the next time the process is run.

There are two basic ways to introduce randomness in R

Random deviattes

Resampling

Random Deviates

Each person alive at the start of the year has the option of living or dying at the conclusion of the year. There are two possible endings here, and each person has an 80% probability of surviving. survive is the outcome of a binomial random process in which there were n individuals alive at the start of this year and p is the probability that any one of them would live to the next year.

In R, we can simulate a binomial random process with p=0.8 and n=100.

rbinom(n= 1, size =100,
       prob= 0.8)


## [1] 80

At this time I got 73, but we almost certainly get different number than this one.

With a little tinkering, we can also plot it

survivors = rbinom(1000,
                   100, 0.8)
hist(survivors,
  col = "skyblue")

We could also used other processes like log normal

The log normal process is another random process. It creates random numbers using a log of the values that is regularly distributed, with a mean of log mean and a standard deviation of log sd.

hist(rlnorm(1000,0,0.1),col="skyblue")

Need for sampling

There are several situations in probability, and more broadly in machine learning, where an analytical solution cannot be calculated immediately. In Machine Learning, a problem of class imbalance exists. In fact, some would argue that for most practical probabilistic models, accurate inference is impossible.

The desired calculation is usually a sum of discrete distributions or an integral of continuous distributions, and thus is computationally difficult. For a variety of reasons, such as the huge number of random variables, the stochastic nature of the domain, noise in the data, a shortage of observations, and more, the calculation may be intractable.

Resampling

Using random deviates to generate fresh random numbers is excellent, but what if we already have a set of numbers to which we want to add randomness? We can utilize resampling techniques to do this. To sample size elements from the vector x in R, use the sample() function.

Resampling of 1 to 10

sample(x = 1:10, size =5)


## [1] 4 3 10 9 2

Sample with replacement

sample(x = c("a","b","c"), size = 10, replace = T)


## [1] "a" "a" "a" "b" "a" "b" "a" "c" "b" "b"

Sample with set probalilities

sample(x = c("live","die"),size = 10, replace = T, prob = c(0.8,0.2))


## [1] "live" "live" "die" "die" "live" "live" "live" "die" "live" "live"

Reproducing Randomness

We may want to receive the same precise random integers each time we run our script for reproducibility. To do so, we must first set the random seed, which is the starting point of our computer’s random number generator.

set.seed(1234)
rnorm(1)


## [1] -1.207066

Let’s try without random seed

rnorm(1)


## [1] 0.2774292

Each time we get different result.

Replication

To use Monte Carlo methods, we need to be able to replicate some random process many times. There are two main ways this is commonly done, either with replicate()or with for()loops.

The replicate() functions executes same expression many times and returns the output from each execution. Say we have a vector x, which represents 40 observations of an animal length(mm).

x = rnorm(30, 500,40)

We want to create the mean length sampling distribution “by hand.” We can take a random sample, determine the mean, and then repeat the process as many times as necessary.

Replication with “for” loop

A loop is a command in programming that repeats itself until it reaches a specified point. R has a few types of loops, repeat(), while(), and for(). for() loops are among the most common in simuation modeling.For each value in a vector, a for() loop performs an operation for the number of times you specify.

For loop syntax

for(var in seq){ expression(var) }

for( i in 1:5){
print(i^2)
}


## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25


nt = 100
N= NULL
N[1] = 1000
for(t in 2:nt){
  N[t] = (N[t-1]*1.1*rlnorm(1,0,0.1))*(1-0.08)
}

Let’s plot it

plot(N, type= "l", pch = 15, xlab = "Year", ylab = "Abundance")

Summarization of simulation

After replicating a calculation many times, we will need to summarize the results.

Simulating Based Learning

mu = 500
sig = 30
random = rnorm(100,mu,sig)
p = seq(0.01, 0.99, 0.01)
random_q = quantile(random,p)
normal_q = qnorm(p,mu,sig)
plot(normal_q~random_q)
abline(c(0,1))

q = seq(400,600,10)
random_cdf = ecdf(random)
random_p =random_cdf(q)
normal_p = pnorm(q,mu,sig)
plot(normal_p~q, type= "l", col = "blue")
points(random_p~q,col = "red")

R Exercise: Validation and Cross Validation for Predictive Modeling R

Durga Pokharel — Fri, 15 Jul 2022 15:49:00 +0000

Validation & Cross-validation for Predictive Modelling including Linear Model as well as Multi Linear Model

Before starting topic, let’s be familier on some term.

Validation : An act of confirming something as true or correct. Also, Validation is the process of establishing documentary evidence that a procedure, process, or activity was carried out in testing before being put into production.

Cross_Validation : Cross-validation, also known as rotation estimation or out-of-sample testing, is a set of model validation procedures for determining how well the results of a statistical investigation will generalize to new data.

Linear Model : The term “linear model” refers to a model that has a linear relationship between the target variable and the independent variable.

Multi Linear Model : A regression model that uses a straight line to evaluate the connection between a quantitative dependent variable and two or more independent variables is known as multiple linear regression.

Here we will use R’s bulit in data mtcars for coding purpose. At first let’s divided data into train set and test set in the ratio of 70% to 30%. While doing that task never forgot to use seed() function.

seed(): The random number generator is initialized using the seed() method. To generate a random number, the random number generator requires a starting value (seed value). The random number generator defaults to using the current system time.

#Define the mtcars data as “data”:
data <- mtcars
#Use random seed to replicate the result
set.seed(123)
#Do random sampling to divide the cases into two independent samples
ind <- sample(2, nrow(mtcars), replace = T, prob = c(0.7, 0.3))
#Data partition
train.data <- data[ind==1,]
test.data <- data[ind==2,]

We divided our data into training and testing set in the ratio of 70 % to 30%.

Let’s fit Linear Model

Set mile per gallon(mpg) as dependent variable and weight(wt) as independent variable.

lmodel <- lm(mpg~wt, data = train.data, method = "lm")

Let’s to model prediction.

pred <- predict(lmodel, data= test.data)

Check value of R square and error value. To do at first we should loadlibrary(caret) into our R studio.

library(caret)


## Loading required package: ggplot2

## Loading required package: lattice


pred <- predict(lmodel, data= test.data)
R2 <- R2(pred, train.data$mpg)
R2


## [1] 0.7377021

Here, we found value of R-square 73.77% that means 73.77% data fit the linear model. Let’s check for error,

RMSE <- RMSE(pred, test.data$mpg)


## Warning in pred - obs: longer object length is not a multiple of shorter object
## length


RMSE


## [1] 8.786064

Hence error for the model is 12.6374.

Leave-One-Out Cross-Validation approach

It’s usual practice when building a machine learning model to validate your methods by setting aside a subset of your data as a test set.

LOOCV (leave-one-person-out cross validation) is a type of cross validation that uses each individual as a “test” set. It’s a form of k-fold cross validation in which the number of folds, k, equals the number of participants in the dataset.

library(caret)
# Define training control
train.control <- trainControl(method = "LOOCV")


# Train the model
model1 <- train(mpg ~wt, data = mtcars, method = 
"lm",
trControl = train.control)
print(model1)


## Linear Regression 
## 
## 32 samples
## 1 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 31, 31, 31, 31, 31, 31, ... 
## Resampling results:
## 
## RMSE Rsquared MAE     
## 3.201673 0.7104641 2.517436
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE


pred1 <- predict(model1, test.data)
R2 <- R2(pred1, test.data$mpg)
R2


## [1] 0.7864736

We receive a value of R square 78.46 percent when fitting the model using the leave-one-out strategy, which is higher than the linear regression model.

RMSE <- RMSE(pred1, test.data$mpg)
RMSE


## [1] 2.843768

Error is only 2.44 which is very lower than previous one.

Let’s fit the model using K-folds Cross-Validation approach

A K-fold CV is one in which a given data set is divided into K sections/folds, with each fold serving as a testing set at some point. Let’s look at a 10-fold cross validation case (K=10). The data set is divided into ten folds here. The first fold is used to test the model, while the others are used to train it in the first iteration. The second iteration uses the second fold as the testing set and the rest as the training set. This procedure is repeated until each of the ten folds has been utilized as a test set.

#k-fold cross validation
library(caret)
# Define training control
set.seed(123) 
train.control <- trainControl(method = "cv", number = 10)
# Train the model
model2 <- train(mpg ~ wt, data = train.data, method = 
"lm",
trControl = train.control)

Calculate value of R sqauere and error observed is it will come diffrerent from previous one.

library(caret)
pred2 <- predict(model2, train.data)
R2 <- R2(pred2, train.data$mpg)
R2


## [1] 0.7377021

This method gives the value of R square 73.77%. Which meand 73% data fitted by the model.

Fit the model using Repeated K-folds Cross-Validation approach

Repeated k-fold cross-validation is a technique for improving a machine learning model’s predicted performance. Simply repeat the cross-validation technique several times and return the mean result across all folds from all runs.

#repeated k-fold cross validation
library(caret)
# Define training control
set.seed(123)
train.control <- trainControl(method = "repeatedcv", 
number = 10, repeats = 3)
# Train the model
model <- train(mpg ~wt, data = mtcars, method = 
"lm",
trControl = train.control)
# Summarize the results
print(model)


## Linear Regression 
## 
## 32 samples
## 1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 28, 28, 29, 29, 29, 30, ... 
## Resampling results:
## 
## RMSE Rsquared MAE     
## 2.975392 0.8351572 2.539797
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Hence we get value of R- square 83.51% similarly value of RMSE 2.97.

Summary: Which one should be used based on R-squared values of “lm” model?

R-square for training set: 0.7013
R-square for training with LOOCV: 0.7104641
R-square for training with k-folds CV: 0.7346939
R-square for training with repeated k-folds CV: 0.8351572
R-square for testing set: 0.9031085
R-square for testing with LOOCV: 0.9031085
R-square for testing with k-folds CV: 0.9031085
R-square for testing with repeated k-folds CV: 0.9031085

Which one should be used based on RMSE value?

RMSE for training set: 3.08648
RMSE for training with LOOCV

3.201673
RMSE for training with k-folds CV: 2.85133
RMSE for training with repeated k- folds CV: 2.975392
RMSE for testing test: 2.279303
RMSE for testing with LOOCV: 2.244232
RMSE for testing with k-folds CV: 2.244232
RMSE for testing with repeated k- folds CV: 2.244232

Let’s Repeate same process for Multilinear Regression Model

It is an extension of the simple linear regression. Multi linear regression have more than one (two or more) independent variables. Multi linear regression has one (1) continuous dependent variable. It is a supervised learning. All the assumptions of the simple linear regression are also applicable here. There is one more condition.

Multicollinearity must not be present i.e. correlations between independent variables must not be “high”.

Fitting Multi Linear Regression Model

mlr <- lm(mpg~., data = mtcars)

Let’s check variance inflection factor of mlr. The inflation factor is the difference between the variance of estimating a parameter in a model with many other factors and the variance of a model with only one term. which is avilable in car packages.

library(car)


## Loading required package: carData


vif(mlr)


## cyl disp hp drat wt qsec vs am 
## 15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873 4.648487 
## gear carb 
## 5.357452 7.908747

We need to drop the independent variable with highest VIF and run the model again until all the VIF <10!

#Removing “disp” variable:
mlr1 <- lm(mpg ~ cyl+hp+drat+wt+qsec+vs+am+gear+carb, data = mtcars)
vif(mlr)


## cyl disp hp drat wt qsec vs am 
## 15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873 4.648487 
## gear carb 
## 5.357452 7.908747


#Removing “cyl” variable:
mlr2 <- lm(mpg ~ 
hp+drat+wt+qsec+vs+am+gear+carb, data = mtcars)
summary(mlr1)


## 
## Call:
## lm(formula = mpg ~ cyl + hp + drat + wt + qsec + vs + am + gear + 
## carb, data = mtcars)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -3.7863 -1.4055 -0.2635 1.2029 4.4753 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.55052 18.52585 0.677 0.5052  
## cyl 0.09627 0.99715 0.097 0.9240  
## hp -0.01295 0.01834 -0.706 0.4876  
## drat 0.92864 1.60794 0.578 0.5694  
## wt -2.62694 1.19800 -2.193 0.0392 *
## qsec 0.66523 0.69335 0.959 0.3478  
## vs 0.16035 2.07277 0.077 0.9390  
## am 2.47882 2.03513 1.218 0.2361  
## gear 0.74300 1.47360 0.504 0.6191  
## carb -0.61686 0.60566 -1.018 0.3195  
## ---
## Signif. codes: 0 ' ***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.623 on 22 degrees of freedom
## Multiple R-squared: 0.8655, Adjusted R-squared: 0.8105 
## F-statistic: 15.73 on 9 and 22 DF, p-value: 1.183e-07


vif(mlr2)


## hp drat wt qsec vs am gear carb 
## 6.015788 3.111501 6.051127 5.918682 4.270956 4.285815 4.690187 4.290468

Now all Vif less than 10 so, data is ready to fit different prediction model.

Leave-One-Out Cross-Validation approach on Multi Regression Model.

#Leave one out CV
library(caret)
# Define training control
train.control <- trainControl(method = "LOOCV")
# Train the model
mlr <- train(mpg ~ hp+drat+wt+qsec+vs+am+gear+carb, data = mtcars, method = "lm",
trControl = train.control)
# Summarize 
summary(mlr)


## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -3.8187 -1.3903 -0.3045 1.2269 4.5183 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 13.80810 12.88582 1.072 0.2950  
## hp -0.01225 0.01649 -0.743 0.4650  
## drat 0.88894 1.52061 0.585 0.5645  
## wt -2.60968 1.15878 -2.252 0.0342 *
## qsec 0.63983 0.62752 1.020 0.3185  
## vs 0.08786 1.88992 0.046 0.9633  
## am 2.42418 1.91227 1.268 0.2176  
## gear 0.69390 1.35294 0.513 0.6129  
## carb -0.61286 0.59109 -1.037 0.3106  
## ---
## Signif. codes: 0 ' ***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.566 on 23 degrees of freedom
## Multiple R-squared: 0.8655, Adjusted R-squared: 0.8187 
## F-statistic: 18.5 on 8 and 23 DF, p-value: 2.627e-08

We got value of R square is 86.55% value of error is 2.566 on 23 degree of freedom.

Let’s fit the model using K-folds Cross-Validation approach on Multi Linear Regression Model.

#K- folds Cross- Validation
library(caret)
# Define training control
train.control <- trainControl(method = "cv", number = 10)
# Train the model
mlr1<- train(mpg ~ hp+drat+wt+qsec+vs+am+gear+carb, data = mtcars, method = "lm",
trControl = train.control)
# Summarize 
summary(mlr1)


## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -3.8187 -1.3903 -0.3045 1.2269 4.5183 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 13.80810 12.88582 1.072 0.2950  
## hp -0.01225 0.01649 -0.743 0.4650  
## drat 0.88894 1.52061 0.585 0.5645  
## wt -2.60968 1.15878 -2.252 0.0342 *
## qsec 0.63983 0.62752 1.020 0.3185  
## vs 0.08786 1.88992 0.046 0.9633  
## am 2.42418 1.91227 1.268 0.2176  
## gear 0.69390 1.35294 0.513 0.6129  
## carb -0.61286 0.59109 -1.037 0.3106  
## ---
## Signif. codes: 0 ' ***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.566 on 23 degrees of freedom
## Multiple R-squared: 0.8655, Adjusted R-squared: 0.8187 
## F-statistic: 18.5 on 8 and 23 DF, p-value: 2.627e-08

Again, we got value of r square 86.55% similarly, value for the error is 2.566.

Fit the model using Repeated K-folds Cross-Validation approach

set.seed(224)
# Repeated K- folds Cross- Validation
library(caret)
# Define training control
train.control <- trainControl(method = "repeatedcv", 
number = 10, repeats = 3)
# Train the model
mlr2<- train(mpg ~ hp+drat+wt+qsec+vs+am+gear+carb, data = mtcars, method = "lm",
trControl = train.control)
# Summarize 
summary(mlr2)


## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -3.8187 -1.3903 -0.3045 1.2269 4.5183 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 13.80810 12.88582 1.072 0.2950  
## hp -0.01225 0.01649 -0.743 0.4650  
## drat 0.88894 1.52061 0.585 0.5645  
## wt -2.60968 1.15878 -2.252 0.0342 *
## qsec 0.63983 0.62752 1.020 0.3185  
## vs 0.08786 1.88992 0.046 0.9633  
## am 2.42418 1.91227 1.268 0.2176  
## gear 0.69390 1.35294 0.513 0.6129  
## carb -0.61286 0.59109 -1.037 0.3106  
## ---
## Signif. codes: 0 ' ***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.566 on 23 degrees of freedom
## Multiple R-squared: 0.8655, Adjusted R-squared: 0.8187 
## F-statistic: 18.5 on 8 and 23 DF, p-value: 2.627e-08

We got value for R square 86.55 % and value for error is 2.566.

             Than you for Reading

R Exercise: Polynomial Regression Model in R

Durga Pokharel — Fri, 15 Jul 2022 15:48:05 +0000

Polynomial Regression

Curve fitting or curve-linear regression are additional words for the same thing. It is used when a scatterplot shows a non-linear relationship. It’s most typically employed with time series data, but it can be applied to a variety of other situations.

Let’s use the Nepal Covid data and fit a polynomial models on Covid deaths using R

To do this first import excel file in R studio using readxl library. Like below.

library(readxl)
data <- read_excel("F:/MDS-Private-Study-Materials/First Semester/Statistical Computing with R/Assignments/Data/covid_tbl_final.xlsx")
head(data)


## # A tibble: 6 x 14
## SN Date Confirmed_cases_~ Confirmed_cases~ `Confirmed _case~
## <dbl> <dttm> <dbl> <dbl> <dbl>
## 1 1 2020-01-23 00:00:00 1 1 1
## 2 2 2020-01-24 00:00:00 1 0 1
## 3 3 2020-01-25 00:00:00 1 0 1
## 4 4 2020-01-26 00:00:00 1 0 1
## 5 5 2020-01-27 00:00:00 1 0 1
## 6 6 2020-01-28 00:00:00 1 0 1
## # ... with 9 more variables: Recoveries_total <dbl>, Recoveries_daily <dbl>,
## # Deaths_total <dbl>, Deaths_daily <dbl>, RT-PCR_tests_total <dbl>,
## # RT-PCR_tests_daily <dbl>, Test_positivity_rate <dbl>, Recovery_rate <dbl>,
## # Case_fatality_rate <dbl>

head() function return top 6 rows of dataframe along with all columns.

str(data)


## tibble [495 x 14] (S3: tbl_df/tbl/data.frame)
## $ SN : num [1:495] 1 2 3 4 5 6 7 8 9 10 ...
## $ Date : POSIXct[1:495], format: "2020-01-23" "2020-01-24" ...
## $ Confirmed_cases_total : num [1:495] 1 1 1 1 1 1 1 1 1 1 ...
## $ Confirmed_cases_new : num [1:495] 1 0 0 0 0 0 0 0 0 0 ...
## $ Confirmed _cases_active: num [1:495] 1 1 1 1 1 1 0 0 0 0 ...
## $ Recoveries_total : num [1:495] 0 0 0 0 0 0 1 1 1 1 ...
## $ Recoveries_daily : num [1:495] 0 0 0 0 0 0 1 0 0 0 ...
## $ Deaths_total : num [1:495] 0 0 0 0 0 0 0 0 0 0 ...
## $ Deaths_daily : num [1:495] 0 0 0 0 0 0 0 0 0 0 ...
## $ RT-PCR_tests_total : num [1:495] NA NA NA NA NA 3 4 5 5 NA ...
## $ RT-PCR_tests_daily : num [1:495] NA NA NA NA NA NA 1 1 0 NA ...
## $ Test_positivity_rate : num [1:495] NA NA NA NA NA ...
## $ Recovery_rate : num [1:495] 0 0 0 0 0 0 100 100 100 100 ...
## $ Case_fatality_rate : num [1:495] 0 0 0 0 0 0 0 0 0 0 ...

The’str()’ method examines each column’s data type. The data type number for Confirmed cases total is the same as the data type number for the other columns.

Let us plot the daily deaths by date and see what is causing the problem

plot(data$Date,data$Deaths_daily, main= "Daily Deaths:23jan 2020-31 may 2021 ",xlab = "Date",
  ylab = "Daily Deaths" )

The problem is associated with the three outliers (all the missed deaths a priori added to the data on those 3 days!)

Let us plot the cumulative deaths again before these outliers i.e. till 23 Feb 2021

plot.data <- data[data$SN <= 398,]
plot(plot.data$Date, plot.data$Deaths_total,
     main= "Daily Covid Deaths,Nepal:23 jan-23 feb2021",
     xlab= "Date",
     ylab= "Daily Deaths")

As a result, we eliminate outliers. Our data is now ready to be fitted into a model. Let’s divide our model into a train set and a test set in the proportions of 70% to 30%.

set.seed(132)
ind <- sample(2, nrow(plot.data), replace = T, prob = c(0.7,0.3))
train_data <- plot.data[ind==1,]
test_data <- plot.data[ind==2,]

seed() function in R is used to reproduce results i.e. it produces the same sample again and again. When we generate randoms numbers without set. seed() function it will produce different samples at different time of execution.

Let us fit a linear model in the filtered data (plot.data) using SN as time variable

library(caret)


## Warning: package 'caret' was built under R version 4.1.2

## Loading required package: ggplot2

## Loading required package: lattice


lm1 <- lm(plot.data$Deaths_total~plot.data$SN, 
         data= train_data)

Using the caret package, we fit a linear model to the covid data. Let’s make a prediction based on the test data.

Before calculating the linear model summary, it is necessary to master some concepts in order to comprehend the summary.

Coefficent of Determination :

The coefficient of determination is a statistical measurement that examines how differences in one variable can be explained by the difference in a second variable. Higher the value of R square better will be the model.

Residual Standard Error:

The residual standard error is used to measure how well a regression model fits a dataset. Lower the value of residual standard error better will be the model.

summary(lm1)


## 
## Call:
## lm(formula = plot.data$Deaths_total ~ plot.data$SN, data = train_data)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -537.91 -344.76 22.38 351.50 582.90 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -588.8326 35.1575 -16.75 <2e-16 ***
## plot.data$SN 5.9315 0.1527 38.84 <2e-16 ***
## ---
## Signif. codes: 0 ' ***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 350 on 396 degrees of freedom
## Multiple R-squared: 0.7921, Adjusted R-squared: 0.7916 
## F-statistic: 1509 on 1 and 396 DF, p-value: < 2.2e-16

When we fit a linear model, we get an R2 of 79.21%, which suggests that only 79.21% of the variance is explained by independent factors in relation to dependent variables. On 396 degrees of freedom, the value of residual standard error is 350.

Let’s plot the linear model

plot(plot.data$SN, plot.data$Deaths_total, data= plot.data,
     main= "Daily Covid Deaths,Nepal:23 jan-23 feb2021",
     xlab= "Date",
     ylab= "Daily Deaths")
abline(lm(plot.data$Deaths_total~plot.data$SN,data= plot.data), col="red",lwd=2)

Let us fit a quadratic linear model in the filtered data

qlm <- lm(plot.data$Deaths_total~ poly(plot.data$SN,2), data= train_data)
summary(qlm)


## 
## Call:
## lm(formula = plot.data$Deaths_total ~ poly(plot.data$SN, 2), 
## data = train_data)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -422.04 -110.87 8.94 81.97 282.94 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 594.495 6.763 87.90 <2e-16 ***
## poly(plot.data$SN, 2)1 13595.485 134.928 100.76 <2e-16 ***
## poly(plot.data$SN, 2)2 6428.710 134.928 47.65 <2e-16 ***
## ---
## Signif. codes: 0 ' ***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 134.9 on 395 degrees of freedom
## Multiple R-squared: 0.9692, Adjusted R-squared: 0.969 
## F-statistic: 6211 on 2 and 395 DF, p-value: < 2.2e-16

The value of R2 96.92 percent was obtained in this case. In terms of dependent variables, independent factors account for 96.92 percent of variability. Similarly, the residual standard error on 395 degrees of freedom is 134.9. In comparison to the linear model, we can see that the R2 value is increasing and the error is decreasing.

Let’s plot the quardatic model

plot(plot.data$SN, plot.data$Deaths_total, data= plot.data,
     main= "Daily Covid Deaths,Nepal:23 jan-23 feb2021",
     xlab= "Date",
     ylab= "Daily Deaths")


## Warning in plot.window(...): "data" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter

## Warning in box(...): "data" is not a graphical parameter

## Warning in title(...): "data" is not a graphical parameter


lines(fitted(qlm)~SN, data=plot.data, col= "red",lwd=2)

Quardatic model fited data more welly than linear model.

Let’s Fit Cubic Model

We fit the cubic model in the following way.

clm <- lm(plot.data$Deaths_total~poly(SN,3), data= plot.data)

Let’s calculate the summary of cubic model and observed what changes came,

summary(clm)


## 
## Call:
## lm(formula = plot.data$Deaths_total ~ poly(SN, 3), data = plot.data)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -369.58 -123.49 12.82 99.36 267.65 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 594.495 6.696 88.789 < 2e-16 ***
## poly(SN, 3)1 13595.485 133.576 101.781 < 2e-16 ***
## poly(SN, 3)2 6428.710 133.576 48.128 < 2e-16 ***
## poly(SN, 3)3 -401.539 133.576 -3.006 0.00282 ** 
## ---
## Signif. codes: 0 ' ***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 133.6 on 394 degrees of freedom
## Multiple R-squared: 0.9699, Adjusted R-squared: 0.9696 
## F-statistic: 4228 on 3 and 394 DF, p-value: < 2.2e-16

The R-square value is 96.99 percent, and the residual standard error is 133.6. When we compare the prior model to this one, we can immediately see the differences.

Let’s Plot the Cubic Model

plot(plot.data$SN, plot.data$Deaths_total, data= plot.data,
     main= "Daily Covid Deaths,Nepal:23 jan-23 feb2021",
     xlab= "Date",
     ylab= "Daily Deaths")


## Warning in plot.window(...): "data" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter

## Warning in box(...): "data" is not a graphical parameter

## Warning in title(...): "data" is not a graphical parameter


lines(fitted(clm)~plot.data$SN,data = plot.data, col= "red",lwd= 2)

From figure we can see that predicted model and actual model are more closure than in case of quardatic model.

Let’s Fit Double Quardatic Model

dlm <- lm(plot.data$Deaths_total~poly(plot.data$SN,4))

Let’s calculate the summary of it.

summary(dlm)


## 
## Call:
## lm(formula = plot.data$Deaths_total ~ poly(plot.data$SN, 4))
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -105.44 -53.22 -12.50 53.61 159.13 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 594.50 3.13 189.92 < 2e-16 ***
## poly(plot.data$SN, 4)1 13595.49 62.45 217.71 < 2e-16 ***
## poly(plot.data$SN, 4)2 6428.71 62.45 102.94 < 2e-16 ***
## poly(plot.data$SN, 4)3 -401.54 62.45 -6.43 3.71e-10 ***
## poly(plot.data$SN, 4)4 -2344.63 62.45 -37.55 < 2e-16 ***
## ---
## Signif. codes: 0 ' ***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 62.45 on 393 degrees of freedom
## Multiple R-squared: 0.9934, Adjusted R-squared: 0.9934 
## F-statistic: 1.486e+04 on 4 and 393 DF, p-value: < 2.2e-16

In this scenario, the independent variables have a 99.34 percent variability with respect to the dependent variable. In addition, the residual standard error is 62.45, which is half of the cubic model’s.

Let’s fit the Double Quardatic Model

plot(plot.data$SN, plot.data$Deaths_total, data= plot.data,
     main= "Daily Covid Deaths,Nepal:23 jan-23 feb2021",
     xlab= "Date",
     ylab= "Daily Deaths")


## Warning in plot.window(...): "data" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter

## Warning in box(...): "data" is not a graphical parameter

## Warning in title(...): "data" is not a graphical parameter


lines(fitted(dlm)~plot.data$SN,data = plot.data, col= "red",lwd= 2)

Here both the model are near to overlap

Let’s Plot Fifth Order Ploynomial

flm <- lm(plot.data$Deaths_total~poly(plot.data$SN,5),data= plot.data)

Let’s calculate the summary of flm to see the value of R square and residual standard error.

summary(flm)


## 
## Call:
## lm(formula = plot.data$Deaths_total ~ poly(plot.data$SN, 5), 
## data = plot.data)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -77.300 -16.980 -3.571 19.199 140.089 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 594.495 1.716 346.36 <2e-16 ***
## poly(plot.data$SN, 5)1 13595.485 34.242 397.04 <2e-16 ***
## poly(plot.data$SN, 5)2 6428.710 34.242 187.74 <2e-16 ***
## poly(plot.data$SN, 5)3 -401.539 34.242 -11.73 <2e-16 ***
## poly(plot.data$SN, 5)4 -2344.634 34.242 -68.47 <2e-16 ***
## poly(plot.data$SN, 5)5 -1035.863 34.242 -30.25 <2e-16 ***
## ---
## Signif. codes: 0 ' ***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.24 on 392 degrees of freedom
## Multiple R-squared: 0.998, Adjusted R-squared: 0.998 
## F-statistic: 3.973e+04 on 5 and 392 DF, p-value: < 2.2e-16

In this case, the residual error is approximately half that of the double quardatic model, and the R square is 99.98 percent. Our model performs better than the previous one since we used a higher order ploynomial. As a result, higher order polynomial models are preferred since they reduce error and improve model accuracy.

R Exercise: Different Hypothesis Testing in R

Durga Pokharel — Fri, 15 Jul 2022 15:46:33 +0000

What is Hypothesis Testing

It is a type of inferential statistics that involves extrapolating results from a sample (random) to the entire population. It’s used to make decisions based on statistical tests and models that use the p-value, also known as the Type I error or alpha error.

Type I Error : When we reject true null hypothesis then it is calledtype I error

Type II Error : When we do not reject false null hypothesis then it is called type II error.

It can be done using parametric or non-parametric methods/models.

Parametric : They have certain assumptions about the data (model) and/or errors that must be validated before the results can be accepted.

Non-parametric : They are non-parametric because they make no assumptions about the data distribution (model) or mistakes.

Why to use parametric test?

Because they are based on the mean, standard deviation, and normal distribution, parametric tests are regarded “more powerful” than non parametric tests/models. Non-parametric tests are based on median, IQR, and non-normal distributions, non-parametric tests are deemed “less powerful” than parametric tests/models.

Two Statistical Hypothesis

Null Hypothesis : It is also known as hypothesis of no differenceAlternative Hypothesis : It is complementary to the null hypothesis also known as research hypotheis.

When to accept Null or Alternative Hypothesis

Accept (fail to reject) null hypothesis from parametric or non-parametric tests requires a P-value > 0.05. (Goodness-of-fit tests)

To accept it from parametric or non-parametric testing (Research hypothesis tests! ), the P-value must be less than 0.05.

Some Commonly Used Parametric Test Using R

One Sample Z Test On Mtcars Data

In this blog I am only going to explain how to test one sample z test using R without explain what is z-test, how it work because I already explained it in my past blog.

# we need to define parameter
muO <- 20
sigma <- 6
xbar <- mean(mtcars$mpg)
n <- length(mtcars$mpg)
z <-sqrt(n)*(xbar-muO)/sigma
p_value<-2*pnorm(-abs(z))

Let’s check z value and p value,

z


## [1] 0.08544207

Hence, we found value of z is 0.08544207

p_value


## [1] 0.9319099

We found p-vale 0.9319099 which is > 0.05 hence we accpet null hypothesis, i.e means of sample and population are equal.

Why there is no one sample z-test in base R package?

Because the t-distribution behaves like the z-distribution for n>=30, the T-test can be employed for both small and big samples. Thus, we don’t need one-sample z-test in R!

One Sample t-test: We can work for small sample as well as for large sample

t.test(mtcars$mpg, mu =20)


## 
## One Sample t-test
## 
## data: mtcars$mpg
## t = 0.08506, df = 31, p-value = 0.9328
## alternative hypothesis: true mean is not equal to 20
## 95 percent confidence interval:
## 17.91768 22.26357
## sample estimates:
## mean of x 
## 20.09062

Hence we obtained p-valued 0.9328 it means we do not reject null hypothesis.

Two Sample T-test

It is used to compare the means of a dependent variable with two categories of grouped independent variables. For instance, we can compare exam score (dependent variable) between male and female groups of students!

Assumptions

For each category, the dependent variable must follow the normal distribution (Test of normality-GOF)
The variance is homogeneous (i.e. equal) across independent variable categories (Test of equal variance-GOF)

What to do if variance across independent variable categories not equal

In this case we used Welch test.

Assumption

For each category, the dependent variable must follow the normal distribution (Test of normality-GOF)
Variance across independent variable categories are not homogenous i.e; not equal.

Let’s do narmality test on mtcars data

with(mtcars, shapiro.test(mpg[am == 0]))


## 
## Shapiro-Wilk normality test
## 
## data: mpg[am == 0]
## W = 0.97677, p-value = 0.8987

Here, p-value is 0.8987. Hence, we do not reject null hypothesis that means it follows normal distribution.

with(mtcars, shapiro.test(mpg[am == 1]))


## 
## Shapiro-Wilk normality test
## 
## data: mpg[am == 1]
## W = 0.9458, p-value = 0.5363

It also follows normal distribution. Hence first condition is satisfied i.e; dependent variable mpg follows normal distribution.

Variance Check

var.test(mpg ~ am, data = mtcars)


## 
## F test to compare two variances
## 
## data: mpg by am
## F = 0.38656, num df = 18, denom df = 12, p-value = 0.06691
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.1243721 1.0703429
## sample estimates:
## ratio of variances 
## 0.3865615

We can see p-value is 0.06691 which is grater than 0.05. Hence we can say variance across independent variable categories are same. Now we can use two sample student t test.

t.test(mpg ~ am, var.equal= T, data = mtcars)


## 
## Two Sample t-test
## 
## data: mpg by am
## t = -4.1061, df = 30, p-value = 0.000285
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -10.84837 -3.64151
## sample estimates:
## mean in group 0 mean in group 1 
## 17.14737 24.39231

Here we saw p-value 0.000285 which is less then 0.05. Hence we reject ho that means milage (mpg) is statistically different among cars with automatic and manual transmission system.

Let’s check two sample student t-test result with simple linear regression model

summary(lm(mpg ~ am, data = mtcars))


## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -9.3923 -3.0923 -0.2974 3.2439 9.5077 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 ' ***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385 
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285

This difference is statistically significant and the p-value is same as given by the two-samples t-test.

What test should we used if we have to compare mean of more than two samples

If we need to compare mean of more than two samples we used 1-way ANOVA test.

Assumption

Dependent variable must be “normally distributed”
Variance across categories must be same

1-way ANOVA assumptions checks

Normality by categories

with(mtcars, shapiro.test(mpg[gear == 3]))


## 
## Shapiro-Wilk normality test
## 
## data: mpg[gear == 3]
## W = 0.95833, p-value = 0.6634

Category 3 follows normal distribution.

with(mtcars, shapiro.test(mpg[gear == 4]))


## 
## Shapiro-Wilk normality test
## 
## data: mpg[gear == 4]
## W = 0.90908, p-value = 0.2076

Category 4 also follows normal distribution.

with(mtcars, shapiro.test(mpg[gear == 5]))


## 
## Shapiro-Wilk normality test
## 
## data: mpg[gear == 5]
## W = 0.90897, p-value = 0.4614

So, dependent variable follows normal distribution.

Let’s do variance test

In case of more than two samples case we do not usevar.test(). For this we usedleveneTest() avilable in car packages. Let’s check. Before doing this we need to change our independent variable into factor.

library(car)


## Loading required package: carData


leveneTest(mpg ~ as.factor(gear), data=mtcars)


## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 1.4886 0.2424
## 29

Here, we find p-value grater than 0.2424. Hence variance across categories is same. So, we can now used one way calssical ANOVA test.

1-Way Classical ANOVA test

summary(aov(mpg ~ gear, data = mtcars))


## Df Sum Sq Mean Sq F value Pr(>F)   
## gear 1 259.7 259.75 8.995 0.0054 **
## Residuals 30 866.3 28.88                  
## ---
## Signif. codes: 0 ' ***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We find p-value less than 0.05. Hence we reject null hypothesis that means sample means are not equal. This means, post-hoc test or pairwise comparison is required. If alternative hypothesis is accepted we need to do post-hoc test. For classical 1-way ANOVA TukeyHSD post-hoc test is best. Let’s used it.

TukeyHSD(aov(mpg ~ as.factor(gear), data = mtcars))


## Tukey multiple comparisons of means
## 95% family-wise confidence level
## 
## Fit: aov(formula = mpg ~ as.factor(gear), data = mtcars)
## 
## $`as.factor(gear)`
## diff lwr upr p adj
## 4-3 8.426667 3.9234704 12.929863 0.0002088
## 5-3 5.273333 -0.7309284 11.277595 0.0937176
## 5-4 -3.153333 -9.3423846 3.035718 0.4295874

Let’s check this result with simple linear model

summary(lm(mpg ~ gear, data = mtcars))


## 
## Call:
## lm(formula = mpg ~ gear, data = mtcars)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -10.240 -2.793 -0.205 2.126 12.583 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 5.623 4.916 1.144 0.2618   
## gear 3.923 1.308 2.999 0.0054 **
## ---
## Signif. codes: 0 ' ***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.374 on 30 degrees of freedom
## Multiple R-squared: 0.2307, Adjusted R-squared: 0.205 
## F-statistic: 8.995 on 1 and 30 DF, p-value: 0.005401


pairwise.t.test(mtcars$mpg, mtcars$gear, p.adj= "none")


## 
## Pairwise comparisons using t tests with pooled SD 
## 
## data: mtcars$mpg and mtcars$gear 
## 
## 3 4    
## 4 7.3e-05 -    
## 5 0.038 0.218
## 
## P value adjustment method: none

gear = 3 category is omitted from the result because R automatically creates 3 dummy variables for 3 categories of gear variable i.e. 3, 4 and 5 and uses only last two of them in the model and takes the first one as reference.

Making a Stack Data Type in Python

Durga Pokharel — Thu, 14 Jul 2022 07:22:59 +0000

Making a Stack Data Type in Python

Introduction

Stack is one of the primitive data structure that we always have to study before diving into the Data Structure and Analysis. It is an example of ADT (Abstract Data Type) where operations are predefined. There are some other types of ADTs also like Queue, List etc.

Operations

For any Data Type, most common operations includes inserting a data, removing a data and retrieving a data. For a simple stack there are 3 types of operations:

Push : To insert a data on the top of a stack.
Pop : To remove a data from the top of a stack.
Top : To retrieve a top element data.

Stack operates in a LIFO (Last in First Out) way. Which means that at anytime, the pointer will be on the top of a stack and we are only allowed to operation with a data which is pointed by a pointer. If a stack is going to use an array then its size will be fixed but if we are going to use a list, then the size will variable.

Push Operation

Lets assume that we are using an array for the stack with the size of 6 with data, [4 5 6 8 5 2]. Initially, a stack will be empty and the pointer will be pointing the top of stack’s data which is bottom. Now in first step, to push a data, we will push 4 in the 0th position. Then pointer moves one step upwards. On the next step, next data is inserted on the 1st position. And at last, our stack will be full.

Lets write it on python.

class Stack:
    def __init__ (self, size):
        self.size = size
        self.storage = ["~"]*size
        self.pointer = 0
        print(f"New Stack: {self.storage}")
        print(f"Pointer at {self.pointer}.\n")

    def push(self, x):
        self.storage[self.pointer] = x
        print(f"New Stack: {self.storage}")
        print(f"Pointer at {self.pointer+1}.\n")

        self.pointer+=1




data = [4, 5, 6, 8, 5, 2]
stack = Stack(size=6)

for x in data:
    stack.push(x)


New Stack: ['~', '~', '~', '~', '~', '~']
Pointer at 0.

New Stack: [4, '~', '~', '~', '~', '~']
Pointer at 1.

New Stack: [4, 5, '~', '~', '~', '~']
Pointer at 2.

New Stack: [4, 5, 6, '~', '~', '~']
Pointer at 3.

New Stack: [4, 5, 6, 8, '~', '~']
Pointer at 4.

New Stack: [4, 5, 6, 8, 5, '~']
Pointer at 5.

New Stack: [4, 5, 6, 8, 5, 2]
Pointer at 6.

Now we have inserted the data into our stack but what if we inserted more data than the size of a stack? In that case, a overflow will happen and in above example, we have used a list as an storage and it will also show error if we tried to insert something.

stack.push(0)


---------------------------------------------------------------------------

IndexError Traceback (most recent call last)

<ipython-input-23-c549502547a3> in <module>
----> 1 stack.push(0)

<ipython-input-21-dfbe390820f6> in push(self, x)
      8 
      9 def push(self, x):
---> 10 self.storage[self.pointer] = x
     11 print(f"New Stack: {self.storage}")
     12 print(f"Pointer at {self.pointer+1}.\n")

IndexError: list assignment index out of range

Which is expected but we must make error looking little bit different in our case.

class Stack:
    def __init__ (self, size):
        self.size = size
        self.storage = ["~"]*size
        self.pointer = 0
        print(f"New Stack: {self.storage}")
        print(f"Pointer at {self.pointer}.\n")

    def push(self, x):
        try:
            self.storage[self.pointer] = x
            print(f"New Stack: {self.storage}")
            print(f"Pointer at {self.pointer+1}.\n")

            self.pointer+=1
        except IndexError as e:
            print(f"Overflow Occured at pointer: {self.pointer}")



data = [4, 5, 6, 8, 5, 2, 0]
stack = Stack(size=6)

for x in data:
    stack.push(x)


New Stack: ['~', '~', '~', '~', '~', '~']
Pointer at 0.

New Stack: [4, '~', '~', '~', '~', '~']
Pointer at 1.

New Stack: [4, 5, '~', '~', '~', '~']
Pointer at 2.

New Stack: [4, 5, 6, '~', '~', '~']
Pointer at 3.

New Stack: [4, 5, 6, 8, '~', '~']
Pointer at 4.

New Stack: [4, 5, 6, 8, 5, '~']
Pointer at 5.

New Stack: [4, 5, 6, 8, 5, 2]
Pointer at 6.

Overflow Occured at pointer: 6

Pop Operation

Now we have our stack fully filled, lets remove values from it. The pop operation again works with data located on pointer’s position.

class Stack:
    def __init__ (self, size):
        self.size = size
        self.storage = ["~"]*size
        self.pointer = 0
        print(f"New Stack: {self.storage}")
        print(f"Pointer at {self.pointer}.\n")

    def push(self, x):
        try:
            self.storage[self.pointer] = x
            print(f"New Stack: {self.storage}")
            print(f"Pointer at {self.pointer+1}.\n")

            self.pointer+=1
        except IndexError as e:
            print(f"Overflow Occured at pointer: {self.pointer} \n")

    def pop(self):
        self.storage = self.storage[:-1]

        print(f"New Stack: {self.storage}")
        print(f"Pointer at {self.pointer-1}.\n")

        self.pointer-=1


data = [4, 5, 6, 8, 5, 2, 0]
stack = Stack(size=6)

for x in data:
    stack.push(x)
stack.pop()


New Stack: ['~', '~', '~', '~', '~', '~']
Pointer at 0.

New Stack: [4, '~', '~', '~', '~', '~']
Pointer at 1.

New Stack: [4, 5, '~', '~', '~', '~']
Pointer at 2.

New Stack: [4, 5, 6, '~', '~', '~']
Pointer at 3.

New Stack: [4, 5, 6, 8, '~', '~']
Pointer at 4.

New Stack: [4, 5, 6, 8, 5, '~']
Pointer at 5.

New Stack: [4, 5, 6, 8, 5, 2]
Pointer at 6.

Overflow Occured at pointer: 6 

New Stack: [4, 5, 6, 8, 5]
Pointer at 5.

Now that we have made a pop operation, what happens if our stack is empty and we tried to remove a data? That case is a stack underflow. Lets write that also.

for i in range(len(data)):
    stack.pop()


New Stack: [4, 5, 6, 8]
Pointer at 4.

New Stack: [4, 5, 6]
Pointer at 3.

New Stack: [4, 5]
Pointer at 2.

New Stack: [4]
Pointer at 1.

New Stack: []
Pointer at 0.

New Stack: []
Pointer at -1.

New Stack: []
Pointer at -2.

In above code, no error is shown when pointer becomes -ve. Lets assume that if pointer is already 0 and we tried to remove, then it should be an error.

class Stack:
    def __init__ (self, size):
        self.size = size
        self.storage = ["~"]*size
        self.pointer = 0
        print(f"New Stack: {self.storage}")
        print(f"Pointer at {self.pointer}.\n")

    def push(self, x):
        try:
            self.storage[self.pointer] = x
            print(f"New Stack: {self.storage}")
            print(f"Pointer at {self.pointer+1}.\n")

            self.pointer+=1
        except IndexError as e:
            print(f"Overflow Occured at pointer: {self.pointer} \n")

    def pop(self):
        if self.pointer<1:
            print("Stack Underflow occured.")
        else:
            self.storage = self.storage[:-1]
            print(f"New Stack: {self.storage}")
            print(f"Pointer at {self.pointer-1}.\n")

            self.pointer-=1



data = [4, 5, 6, 8, 5, 2, 0]
stack = Stack(size=6)

for x in data:
    stack.push(x)
stack.pop()

for i in range(len(data)):
    stack.pop()


New Stack: ['~', '~', '~', '~', '~', '~']
Pointer at 0.

New Stack: [4, '~', '~', '~', '~', '~']
Pointer at 1.

New Stack: [4, 5, '~', '~', '~', '~']
Pointer at 2.

New Stack: [4, 5, 6, '~', '~', '~']
Pointer at 3.

New Stack: [4, 5, 6, 8, '~', '~']
Pointer at 4.

New Stack: [4, 5, 6, 8, 5, '~']
Pointer at 5.

New Stack: [4, 5, 6, 8, 5, 2]
Pointer at 6.

Overflow Occured at pointer: 6 

New Stack: [4, 5, 6, 8, 5]
Pointer at 5.

New Stack: [4, 5, 6, 8]
Pointer at 4.

New Stack: [4, 5, 6]
Pointer at 3.

New Stack: [4, 5]
Pointer at 2.

New Stack: [4]
Pointer at 1.

New Stack: []
Pointer at 0.

Stack Underflow occured.
Stack Underflow occured.

Now tweaking it little bit more to make it more readable.

class Stack:
    def __init__ (self, size):
        self.size = size
        self.storage = ["~"]*size
        self.pointer = 0
        print(f"New Stack: {self.storage}")
        print(f"Pointer at {self.pointer}.\n")

    def push(self, x):
        print("=" * 20+" Push Operation "+"=" * 20)
        try:
            self.storage[self.pointer] = x
            print(f"New Stack: {self.storage}")
            print(f"Pointer at {self.pointer+1}.")

            self.pointer+=1
        except IndexError as e:
            print(f"Overflow Occured at pointer: {self.pointer}")
        print("="*55 + "\n")

    def pop(self):
        print("=" * 20+" Pop Operation "+"=" * 20)
        if self.pointer<1:
            print("Stack Underflow occured.")
        else:
            self.storage = self.storage[:-1]
            print(f"New Stack: {self.storage}")
            print(f"Pointer at {self.pointer-1}.")

            self.pointer-=1
        print("="*55 + "\n")



data = [4, 5, 6, 8, 5, 2, 0]
stack = Stack(size=6)

for x in data:
    stack.push(x)
stack.pop()

for i in range(len(data)):
    stack.pop()


New Stack: ['~', '~', '~', '~', '~', '~']
Pointer at 0.

==================== Push Operation ====================
New Stack: [4, '~', '~', '~', '~', '~']
Pointer at 1.
=======================================================

==================== Push Operation ====================
New Stack: [4, 5, '~', '~', '~', '~']
Pointer at 2.
=======================================================

==================== Push Operation ====================
New Stack: [4, 5, 6, '~', '~', '~']
Pointer at 3.
=======================================================

==================== Push Operation ====================
New Stack: [4, 5, 6, 8, '~', '~']
Pointer at 4.
=======================================================

==================== Push Operation ====================
New Stack: [4, 5, 6, 8, 5, '~']
Pointer at 5.
=======================================================

==================== Push Operation ====================
New Stack: [4, 5, 6, 8, 5, 2]
Pointer at 6.
=======================================================

==================== Push Operation ====================
Overflow Occured at pointer: 6
=======================================================

==================== Pop Operation ====================
New Stack: [4, 5, 6, 8, 5]
Pointer at 5.
=======================================================

==================== Pop Operation ====================
New Stack: [4, 5, 6, 8]
Pointer at 4.
=======================================================

==================== Pop Operation ====================
New Stack: [4, 5, 6]
Pointer at 3.
=======================================================

==================== Pop Operation ====================
New Stack: [4, 5]
Pointer at 2.
=======================================================

==================== Pop Operation ====================
New Stack: [4]
Pointer at 1.
=======================================================

==================== Pop Operation ====================
New Stack: []
Pointer at 0.
=======================================================

==================== Pop Operation ====================
Stack Underflow occured.
=======================================================

==================== Pop Operation ====================
Stack Underflow occured.
=======================================================

Top Operation

This is a simple operation. It gives the value that is below the position currently pointed by a pointer.

class Stack:
    def __init__ (self, size):
        self.size = size
        self.storage = ["~"]*size
        self.pointer = 0
        print(f"New Stack: {self.storage}")
        print(f"Pointer at {self.pointer}.\n")

    def push(self, x):
        print("=" * 20+" Push Operation "+"=" * 20)
        try:
            self.storage[self.pointer] = x
            print(f"New Stack: {self.storage}")
            print(f"Pointer at {self.pointer+1}.")

            self.pointer+=1
        except IndexError as e:
            print(f"Overflow Occured at pointer: {self.pointer}")
        print("="*55 + "\n")

    def pop(self):
        print("=" * 20+" Pop Operation "+"=" * 20)
        if self.pointer<1:
            print("Stack Underflow occured.")
        else:
            self.storage = self.storage[:-1]
            print(f"New Stack: {self.storage}")
            print(f"Pointer at {self.pointer-1}.")

            self.pointer-=1
        print("="*55 + "\n")

    def top(self):
        print("=" * 20+" Top Operation "+"=" * 20)
        try:
            print(f"Pointer at {self.pointer}.")
            print(f"Return: {self.storage[self.pointer-1]}")
        except:
            print("Nothing on the top. Stack is empty.")
        print("="*55 + "\n")



data = [4, 5, 6, 8, 5, 2, 0]
stack = Stack(size=6)

for x in data:
    stack.push(x)
    stack.top()
stack.pop()

for i in range(len(data)):
    stack.pop()
    stack.top()


New Stack: ['~', '~', '~', '~', '~', '~']
Pointer at 0.

==================== Push Operation ====================
New Stack: [4, '~', '~', '~', '~', '~']
Pointer at 1.
=======================================================

==================== Top Operation ====================
Pointer at 1.
Return: 4
=======================================================

==================== Push Operation ====================
New Stack: [4, 5, '~', '~', '~', '~']
Pointer at 2.
=======================================================

==================== Top Operation ====================
Pointer at 2.
Return: 5
=======================================================

==================== Push Operation ====================
New Stack: [4, 5, 6, '~', '~', '~']
Pointer at 3.
=======================================================

==================== Top Operation ====================
Pointer at 3.
Return: 6
=======================================================

==================== Push Operation ====================
New Stack: [4, 5, 6, 8, '~', '~']
Pointer at 4.
=======================================================

==================== Top Operation ====================
Pointer at 4.
Return: 8
=======================================================

==================== Push Operation ====================
New Stack: [4, 5, 6, 8, 5, '~']
Pointer at 5.
=======================================================

==================== Top Operation ====================
Pointer at 5.
Return: 5
=======================================================

==================== Push Operation ====================
New Stack: [4, 5, 6, 8, 5, 2]
Pointer at 6.
=======================================================

==================== Top Operation ====================
Pointer at 6.
Return: 2
=======================================================

==================== Push Operation ====================
Overflow Occured at pointer: 6
=======================================================

==================== Top Operation ====================
Pointer at 6.
Return: 2
=======================================================

==================== Pop Operation ====================
New Stack: [4, 5, 6, 8, 5]
Pointer at 5.
=======================================================

==================== Pop Operation ====================
New Stack: [4, 5, 6, 8]
Pointer at 4.
=======================================================

==================== Top Operation ====================
Pointer at 4.
Return: 8
=======================================================

==================== Pop Operation ====================
New Stack: [4, 5, 6]
Pointer at 3.
=======================================================

==================== Top Operation ====================
Pointer at 3.
Return: 6
=======================================================

==================== Pop Operation ====================
New Stack: [4, 5]
Pointer at 2.
=======================================================

==================== Top Operation ====================
Pointer at 2.
Return: 5
=======================================================

==================== Pop Operation ====================
New Stack: [4]
Pointer at 1.
=======================================================

==================== Top Operation ====================
Pointer at 1.
Return: 4
=======================================================

==================== Pop Operation ====================
New Stack: []
Pointer at 0.
=======================================================

==================== Top Operation ====================
Pointer at 0.
Nothing on the top. Stack is empty.
=======================================================

==================== Pop Operation ====================
Stack Underflow occured.
=======================================================

==================== Top Operation ====================
Pointer at 0.
Nothing on the top. Stack is empty.
=======================================================

==================== Pop Operation ====================
Stack Underflow occured.
=======================================================

==================== Top Operation ====================
Pointer at 0.
Nothing on the top. Stack is empty.
=======================================================

Conclusion

Thank you for reading this blog and reaching to the end. In the next blog, I will share similar approach for the Queue and it will also be a fun to try.

Dataframe in R.

Durga Pokharel — Thu, 14 Jul 2022 07:20:43 +0000

Getting Started With Dataframe .

Introduction

Dataframe are the mostly used data structure in R. Dataframe is a list where all components have name and are on the same line. Easiest way of understanding about dataframe is the visualization of spreadsheets. The first row is represented by header. The header is given by the list component name. Each column can store the different datatype which is called a variable and each row is an observation across multiple variables, since dataframe are like spreadsheet we can insert the data how we will like to. There are many possibilities to inserting data.

Product	apple	Banana
price store A	23	56
price store B	67	80

It is not dataframe because here price store is divided into two parts. If we rearrange the data by taking product is one variable and price is next variable and store is one variable then it become dataframe.

Product	Price	Store
apple	23	A
apple	67	B
banana	56	A
banana	80	B

Attributes of dataframe

Length
Dimension
Name
Class ## How to Create DataFrame

product <- c('apple','banana','orange','papaya','rice','wheat','pee','noodle')
catagory <- c( 'groceries','groceries','electronic','electronic','groceries','electronic','electronic','groceries')
price <- c(24,45,67,88,56,78,89,90)
quality <- c('high','low','high','low','high','low','high','low')

To create dataframe from above data we can do

 shopping_data <- data.frame(product,catagory,price,quality,
                           budget = c(120,3000,600,500,45,67,89,90))
shopping_data

Output of the avove code is,dataframe.

To check wether it is dataframe or not we can use folowing code.

str(shopping_data)

Output of the above cde is,

'data.frame':   8 obs. of 5 variables:
 $ product : chr "apple" "banana" "orange" "papaya" ...
 $ catagory: chr "groceries" "groceries" "electronic" "electronic" ...
 $ price : num 24 45 67 88 56 78 89 90
 $ quality : chr "high" "low" "high" "low" ...
 $ budget : num 120 3000 600 500 45 67 89 90

Check the attribute of dataframe.

 names(shopping_data)

Check dimension of dataframe.

 dim(shopping_data)

Check first six rows of dataframe

 head(shopping_data)

Check last six rows of dataframe.

 tail(shopping_data)

Take only two rows of dataframe.

 head(shopping_data, n = 2)

Access specified column of database.

 shopping_data$product

Output of the above code is,

 'apple''banana''orange''papaya''rice''wheat''pee''noodle'


 shopping_data[['product']]

Output of the above code is,

 'apple''banana''orange''papaya''rice''wheat''pee''noodle'

Manipulating dataframe By manipulating data frame we khow how to select, add new row and how to sort and rank into dataframe. Dataframe are list where each elements are name vector of same length. Therefore we can select element as same as in list. we do by [[]] or $column. Dataframe are also two dimensional matricies which means we can index them as matrices by using square braces.[row,column].We fix data in one dimension they behave as list. Therefore dataframe can be index either as like list or as like matrices accoding to positions, rules, names.

List subsetting

#list subsetting
shopping_data[[2]]
shopping_data[['budget']]
shopping_data$price
shopping_data$price[1:3]
shopping_data[[3]][3]
shopping_data$price[3]

Output of the above code is,

'groceries''groceries''electronic''electronic''groceries''electronic''electronic''groceries'
120300060050045678990
2445678856788990
244567
67
67

Matrix subsetting

#Matrix subsetting
shopping_data[,1]
shopping_data[,"product"]
shopping_data[1,]
shopping_data[1,"price"]

Output will be

'apple''banana''orange''papaya''rice''wheat''pee''noodle'
'apple''banana''orange''papaya''rice''wheat''pee''noodle'
A data.frame: 1 × 5
1   apple   groceries   24  high    120
24

Add new attribute into dataframe.

feedback<- c('good','outstanding','ordinary','nice','excilent','brillent','extra-ordinary','satisfactory')
shopping_data <- cbind(shopping_data,feedback)
shopping_data

Output will be

A data.frame: 8 × 6
apple   groceries   24  high    120 good
banana  groceries   45  low 3000    outstanding
orange  electronic  67  high    600 ordinary
papaya  electronic  88  low 500 nice
rice    groceries   56  high    45  excilent
wheat   electronic  78  low 67  brillent
pee electronic  89  high    89  extra-ordinary
noodle  groceries   90  low 90  satisfactory

We can do the following operations to access the data from dataframe

shopping_data[c(1:3),1]
shopping_data[1]
shopping_data[[1]]
is.vector(shopping_data[1])
is.vector(shopping_data[[1]])
is.list(shopping_data[1])
is.list(shopping_data[1])

Output is,

'apple''banana''orange'
A data.frame: 8 × 1
apple
banana
orange
papaya
rice
wheat
pee
noodle
'apple''banana''orange''papaya''rice''wheat''pee''noodle'
FALSE
TRUE
TRUE
TRUE

Working with tidyverse

During data analysis we spend our most time in data cleaning and transforming the raw data. Tydyverse is an add on that let us perform operation such as cleaning data and creating powerful graph.

product <- c('apple','banana','orange','papaya','Rice','wheat','pee','noodle')
catagory <- c( 'groceries','groceries','electronic','electronic','groceries','electronic','electronic','groceries')
price <- c(24,45,67,88,56,78,89,90)
quality <- c('high','low','high','low','high','low','high','low')
shopping_data <- data.frame(product,catagory,price,quality,
                           budget = c(120,3000,600,500,45,67,89,90))
#arrange(desc(price))
shopping_data

Output is,

A data.frame: 8 × 5
apple   groceries   24  high    120
banana  groceries   45  low 3000
orange  electronic  67  high    600
papaya  electronic  88  low 500
Rice    groceries   56  high    45
wheat   electronic  78  low 67
pee electronic  89  high    89
noodle  groceries   90  low 90

Select Function

Select function allow us to select specified data from dataframe.

# dplyr never change the original data
#install.packages("tidyverse")
#library(tidyverse)
library(dplyr) 
product <- select(shopping_data,price,budget)
product

Output is,

A data.frame: 8 × 2
24  120
45  3000
67  600
88  500
56  45
78  67
89  89
90  90

Filter

Filter function work similar to the select. Using the pipe operator %>% we can write multiple operations at once without renaming the intermedating results.

filter(product,budget > 100)

Output is,

A data.frame: 4 × 2
24  120
45  3000
67  600
88  500


dataset2 <- shopping_data %>%
select(product,price)%>%
filter(price>45)%>%
group_by( product)%>%
summarize(avg = mean(price))

dataset2

Output is,

A tibble: 6 × 2
noodle  90
orange  67
papaya  88
pee 89
Rice    56
wheat   78

Arrange function

It sort our dataframe in acending order.arrange(price), to arrange dataframe in decending order we used arrange(desc(price))

arrange(product,price)


Output is,
A data.frame: 8 × 2
24  120
45  3000
56  45
67  600
78  67
88  500
89  89
90  90

Managing control statements:

If statement:

If statement is the most common statement that execute code that only the condition place between bracket is true. Otherwise if statement ignore that particular piece of code. if(condition){ code to be executed} to overcome this abstacle we add extra element else # Paste function Paste converts its arguments ( via as.character) to character strings and concatenates them (separating them by the string given by sep ). If the arguments are vectors, they are concatenated term-by-term to give a character vector result.

product <- "tshirt"
price<- 110
if(price < 100){
    print(paste('adding',product,'to cart'))
}else
{
    print(paste('adding',product,'to wishlist'))
}

Output is,

[1] "adding tshirt to wishlist"

Control Statement in vectors

quantity <- c(1,1,2,3,4)
ifelse(quantity == 1,'Yes','No')

Output is,

'Yes''Yes''No''No''No'


price <- 100
if(price < 100){
    print("price"< "budget")
}else if(price == 100){
    print("the price is equal to budget")

}else{
    print("The budget is less then price")
}

Output is,

[1] "the price is equal to budget"


price <- c(58,100,110)
if(price < 100){
    print("price"< "budget")
}else if(price == 100){
    print("the price is equal to budget")

}else{
    print("The budget is less then price")
}

If the condition has the lenght grater than one then only the first input is tested. That means it check the first elements and then stop. This problem is resolved by using any function.

Any Function

if(any(price < 100)){

    print('At least one price is under budget')
}

Output is,

[1] "At least one price is under budget"

All Function

if(all(price<100)){
    print('all the price are under budget')
}else{
    print('Not all prices satisfies the condition.')
}

Output is,

[1] "Not all prices satisfies the condition."

To combine the condition we can use && and || operator. single and and or are used to element wise vector. While double and or are used for vector compare on one(non vectorise form)

price <- 58
if(price> 50 && price < 100){
    print('The price is between 50 and 100')
}else {
    print("the price is not in between 50 and 100")
}


[1] "The price is between 50 and 100"

Switch Statement

We can add as many as if else statements however keeping more than four is difficult to keep track what is happing when the condition is true. The switch command work with the cases, each syntax contain value to be tested followed by the possible cases.

quantity <- c(1,3,4,5)

average_quantity <- function(quantity,type) {
    switch(type,
          arthematic = mean(quantity),
          geometric = prod(quantity)^(1/length(quantity)))
}
average_quantity(quantity,"arthematic")

Output is,

3.25


x <- c(1,2,3,4,5)
sumfunction <- function(x,i){
    switch(i, 
          s = sum(x)
        )
}
sumfunction(x,"s")

Output is,

Loop

Loop is the sequence of instructions that are repeated untill a certain condition is reached.

For loop It perform the same operations on all elements from input. Its syntax is if(variable in sequence ){ Expression}between parenthesis there are three argument first argument is variable which can take any name then we have keyword in and last is sequence or vector of any kind.

For loop does not save output untill we print it.

cart <- c('apple','cookie','lemoan')
    for(product in cart){
        print(product)
    }

Output is,

[1] "apple"
[1] "cookie"
[1] "lemoan"

While loop

While loop perform the operation as long as given conditions is true. Syntax is similary as for loop. To make loop stop there must be relation between condtion and expression other wise loop does not stop ever.

index <- 1
while(index <3 ) {
    print(paste("The index value is",index))
    index <- index + 1
}

Output is,

[1] "The index value is 1"
[1] "The index value is 2"

Repeat Loop

They repeat the same operation untill it hitting the stop key or by inserting special function to stop them. Repeat loop are important in algorithms optimization and maximization. As an syntax repeat expression

The next statement is used to discontinue one particular cycle and skip to the next.

x <- 1
repeat {
    print(x)
    x = x + 1
    if( x==3){
        break
    }
}

Output is,

[1] 1
[1] 2


price <- c(123,456,78,900,987)
for(value in price){
    if( value < 100){
        next
    }
    discount <- value - value * 0.1
    print(discount)
}

Output is,

[1] 110.7
[1] 410.4
[1] 810
[1] 888.3

Getting Started with R Programming Language.

Durga Pokharel — Tue, 12 Jul 2022 15:39:33 +0000

Getting Started With R .

Introduction

R is a programming language and software environment for statical computing and graphics supported by the R foundation. R is not like a general-purpose programming language like Java, C, because it was created by statisticians as an active environment. Interactivity is the critical characteristic that allows R to explore our data. It is also a programming language and development environment for statistical testing and graphical testing. Each statistical testing is either linear, non-linear modeling, classification or many more. Different types of the plot are required while doing data analysis. In order to run R, we will use IDE(according to Wikipedia an integrated development environment(IDE) is a software application that provides comprehensive facilities to the programmer for software development). The core component which is required for every R program is BaseR. These core components contain only the code importing bits that run our code successfully.

History About R

Bell labs develops s language in 1976. In 1993 Ross Lhaka and Robert Gentleman created R in New-Zealand. R became a free source in 1995. R version 1.0.0 is released in 2000 to the public. IDE Rstdio is release in 2011.

Drawback

R is build by using S. If we want to build apps R probabily one be our choice.
The object that we work must be strored in memory and working with fetch data set can queckly

Installing and Setting up R in your Windows

Step 1: Downloading installation file

Download R tools from Official Website
Next, we need to have an IDE, most popular one is Rstudio. We can download it from this link.

After downloading installation file, install them on desired places and then open the console.

After installation completed open R then we get window just like below

Now we can write our R codes within console or we can do it via Rstudio.

I prefer to use Jupyter Notebook for runing R because it is more friendly for me. A good tutorial is available at Anaconda’s Documentation.

My First R program

I am assigning variable in R as my first R programs.

Assigning Variable and operator in R

A Variable is a container that stores values. An assignment statement set or reset the value store in the storage location(s) denoted by variable name(by Wikipedia). The assignment operator is a command that is it telling the computer to assign the text apple to the variable product. we can also assign by assign('products', ' apple). We can assign the variable in R in many ways like below.

Way 1

('apple'-> product)

Way 2

(product = 'apple')

Way 3

assign('products', ' apple)

Logical Operators in R

Logical operator means those which gives True and False value. For example

Example 1

apple <- 2
banana <- 3
most_expensive <- banana> apple
most_expensive

Output of above code is,

TRUE

Example 2

apple <- 2
banana <- 3
most_expensive <- banana< apple
most_expensive

Output of above code is,

FALSE

Example 3

apple <- 2
banana <- 2
most_expensive <- banana == apple
most_expensive

Output is,

TRUE

Example 4

apple <- 2
banana <- 2
most_expensive <- banana != apple
most_expensive

Output is,

FALSE

Some Commonly Used Data Types in R

Data is centre for analysis if there is no data there is no analysis. Every piece of data are working with some characteristics thses characteristics can be summarize with data type.

Character : Anything inside quotation is a character.
Number: Number in R is double. Working with whole and fraction is a unique feature of double. Another is integer.
Integer Integer is actually simplified version of double. It store data as a string we must use capital letter L. In our use we need to use double rather than integer.
Logical(Boolean):Yes or No. Also T or F.
Complex Number: (2 + 6i)
Raw: It is not so popular data type. It is not easy to create variable of raw type. If we really need to create raw function as a result of calling this function we get raw type data.

All the fundamental data types are called atomic data type.

Example of numbers

An integer:

a <- 2L
class(a)

Output is,

'integer'

A numeric:

a <- 2
class(a)

Output is,

'numeric'


quantity <- 2
typeof(quantity)

Output is,

'double'


quantity_integer <- 2L
typeof(quantity_integer)

Output is,

'integer'

Comments

Comments are used to give important information about the code. Comments are not run by the program but a programmer writes it for better explanation of the code.

# This is a comment in R

Exploring vectors and factors

Data structure as name suggest represent way to organize data to facilate different operations to perform faster calculations.

Vectors: Collection of data of same structure.
Factors: Which are used to store categorical data.
Array: Is a matrix which are generalization of vectors.
List\DataFrame: Elements of different list are dataframe. List are more complex data structure because they allow us to store other list too. We can think data frame as spreadsheets where data are organize as columns and rows where each column has specific data type. Within a data frame we have all kinds of datatype but within one column we have only one data type. Other criteria to categorize our data is by dimensional.

Vector and list are one dimensional objects. Matrices and dataframe are two dimensional data structure. Array are the object that have more than two dimensions.

Vector have two properties they are one dimensional and containing element of same type.

Assigning a column vector

Lets assign a column vector,

assign('b',c(1,2,3,4))
print(b)

Output is,

1 2 3 4

Vectors attributes:

length: It is denoted by length(a) and its meaning is number of elements.
Name: names(a), it allows us to add element in the list.
Type: typeof(a), It gives type of data.

There are six vectors types

Double
logical
character
complex
Raw
Integer

vector <- c("Durga","Puja","Ram","Hari")
vector
length(vector) # length 
names(vector)= "Sita" #names
typeof(vector) # type
vector

Output is,

'Durga''Puja''Ram''Hari'
4
'character'
Sita'Durga'2'Puja'3'Ram'4'Hari'

Manipulating vectors.

Manipulating of vectors consists of sorting, ordering, indexing.

sorting: Sort the data in some order.
Ordering: The order function return the index needed to get the vector sort.
Indexing: Selecting specifics iteam by position.

quantity <- c(1,3,2,5,6,7)
sort(quantity)
order(quantity)

Output is,

1 2 3 5 6 7
1 3 2 4 5 6


a <- c(1,7,36,0,7,5)
a[2]
a[3:5]
a[c(2,4)]
a[c(4,7)]# it return particular element from vector
a[-2]
a[-(2:4)] # it skip the element in the vector.
a[a==1]
a[a>3]
a[a %in%c(2,4)] # it gives matching element.

Output is,

Operating vector

Adding or multipling vector of different size is called recycling rule. For recycling largest vector must be multiple of small one.

c <- 1:6
d <- 1:3
c * d

Output is,

1 4 9 4 10 18

Sequence generation

It is used to create sequence of elements in a vector. seq() function takes length and difference between values as optional argument. In a code below, I take elements in the range 1 to 5 in the interval of 1.5.

Example:

seq(1,5,by = 1.5)

Output is,

1 2.5 4

Replicating elements

It is used to return the replicating element in the list in a specified times. In the following code I replicate the numbers from 1 to 6 two times. A builtin function rep() is used.

Example:

e<- rep(1:6,times = 2)
e

Output is,

1 2 3 4 5 6 1 2 3 4 5 6

We can replicate the same number at desirable times.

x <- rep(c(1),each = 10)
x

Out put is,

1 1 1 1 1 1 1 1 1 1

Scan Function

Scan function read any file into vector. It is very powerful function. In the code given below, it scan function read covid_data.csv.

f <- scan("covid data.csv", what = "Character")
f

Out put of the above code is,

'date,totalCases,newCases,totalRecoveries,newRecoveries,totalDeaths,newDeaths' '1/23/2020,1,1,0,0,0,0' '1/24/2020,0,0,0,0,0,0' '1/25/2020,0,0,0,0,0,0' '1/26/2020,0,0,0,0,0,0' '1/27/2020,0,0,0,0,0,0' '1/28/2020,0,0,0,0,0,0' '1/29/2020,0,0,0,0,0,0' '1/30/2020,0,0,0,0,0,0' '1/31/2020,0,0,1,1,0,0' '2/1/2020,0,0,1,0,0,0' '2/2/2020,0,0,1,0,0,0' '2/3/2020,0,0,1,0,0,0' '2/4/2020,0,0,1,0,0,0' '2/5/2020,0,0,1,0,0,0' '2/6/2020,0,0,1,0,0,0' '2/7/2020,0,0,1,0,0,0' '2/8/2020,0,0,1,0,0,0' '2/9/2020,0,0,1,0,0,0' '2/10/2020,0,0,1,0,0,0' '2/11/2020,0,0,1,0,0,0' '2/12/2020,0,0,1,0,0,0' '2/13/2020,0,0,1,0,0,0' '2/14/2020,0,0,1,0,0,0' '2/15/2020,0,0,1,0,0,0' '2/16/2020,0,0,1,0,0,0' '2/17/2020,0,0,1,0,0,0' '2/18/2020,0,0,1,0,0,0' '2/19/2020,0,0,1,0,0,0' '2/20/2020,0,0,2,1,0,0' '2/21/2020,0,0,2,0,0,0' '2/22/2020,0,0,2,0,0,0' '2/23/2020,0,0,2,0,0,0' '2/24/2020,0,0,2,0,0,0' '2/25/2020,0,0,2,0,0,0' '2/26/2020,0,0,2,0,0,0' '2/27/2020,0,0,2,0,0,0' '2/28/2020,0,0,2,0,0,0' '2/29/2020,0,0,2,0,0,0' '3/1/2020,0,0,2,0,0,0' '3/2/2020,0,0,2,0,0,0' '3/3/2020,0,0,2,0,0,0' '3/4/2020,0,0,2,0,0,0' '3/5/2020,0,0,2,0,0,0' '3/6/2020,0,0,2,0,0,0' '3/7/2020,0,0,2,0,0,0' '3/8/2020,0,0,2,0,0,0' '3/9/2020,0,0,2,0,0,0' '3/10/2020,0,0,2,0,0,0' '3/11/2020,0,0,2,0,0,0' '3/12/2020,0,0,2,0,0,0' '3/13/2020,0,0,2,0,0,0' '3/14/2020,0,0,2,0,0,0' '3/15/2020,0,0,2,0,0,0' '3/16/2020,0,0,2,0,0,0' '3/17/2020,0,0,2,0,0,0' '3/18/2020,0,0,2,0,0,0' '3/19/2020,0,0,2,0,0,0' '3/20/2020,0,0,2,0,0,0' '3/21/2020,0,0,2,0,0,0' '3/22/2020,0,0,2,0,0,0' '3/23/2020,1,1,2,0,0,0' '3/24/2020,1,0,2,0,0,0' '3/25/2020,2,1,2,0,0,0' '3/26/2020,2,0,2,0,0,0' '3/27/2020,3,1,2,0,0,0' '3/28/2020,4,1,2,0,0,0' '3/29/2020,4,0,2,0,0,0' '3/30/2020,4,0,2,0,0,0' '3/31/2020,4,0,2,0,0,0' '4/1/2020,4,0,2,0,0,0' '4/2/2020,5,1,2,0,0,0' '4/3/2020,5,0,2,0,0,0' '4/4/2020,8,3,2,0,0,0' '4/5/2020,8,0,2,0,0,0' '4/6/2020,8,0,2,0,0,0' '4/7/2020,8,0,2,0,0,0' '4/8/2020,8,0,2,0,0,0' '4/9/2020,8,0,2,0,0,0' '4/10/2020,8,0,2,0,0,0' '4/11/2020,8,0,2,0,0,0' '4/12/2020,11,3,2,0,0,0' '4/13/2020,13,2,2,0,0,0' '4/14/2020,15,2,2,0,0,0' '4/15/2020,15,0,2,0,0,0' '4/16/2020,15,0,2,0,0,0' '4/17/2020,29,14,2,0,0,0' '4/18/2020,30,1,4,2,0,0' '4/19/2020,30,0,5,1,0,0' '4/20/2020,30,0,5,0,0,0' '4/21/2020,41,11,6,1,0,0' '4/22/2020,44,3,8,2,0,0' '4/23/2020,47,3,9,1,0,0' '4/24/2020,48,1,11,2,0,0' '4/25/2020,48,0,12,1,0,0' '4/26/2020,51,3,14,2,0,0' '4/27/2020,51,0,14,0,0,0' '4/28/2020,53,2,14,0,0,0' '4/29/2020,56,3,14,0,0,0' '4/30/2020,56,0,14,0,0,0' '5/1/2020,58,2,14,0,0,0' '5/2/2020,58,0,14,0,0,0' '5/3/2020,74,16,14,0,0,0' '5/4/2020,74,0,14,0,0,0' '5/5/2020,81,7,14,0,0,0' '5/6/2020,98,17,20,6,0,0' '5/7/2020,100,2,20,0,0,0' '5/8/2020,101,1,28,8,0,0' '5/9/2020,108,7,29,1,0,0' '5/10/2020,109,1,29,0,0,0' '5/11/2020,133,24,31,2,0,0' '5/12/2020,216,83,31,0,0,0' '5/13/2020,242,26,33,2,0,0' '5/14/2020,248,6,33,0,1,1' '5/15/2020,266,18,34,1,1,0' '5/16/2020,280,14,34,0,1,0' '5/17/2020,294,14,34,0,3,2' '5/18/2020,374,80,34,0,3,0' '5/19/2020,401,27,35,1,3,0' '5/20/2020,426,25,43,8,3,0' '5/21/2020,456,30,47,4,4,1' '5/22/2020,515,59,68,21,4,0' '5/23/2020,583,68,68,0,5,1' '5/24/2020,602,19,85,17,5,0' '5/25/2020,681,79,110,25,5,0' '5/26/2020,771,90,152,42,5,0' '5/27/2020,885,114,180,28,6,1' '5/28/2020,1041,156,184,4,6,0' '5/29/2020,1211,170,184,0,6,0' '5/30/2020,1400,189,188,4,7,1' '5/31/2020,1571,171,189,1,8,1' '6/1/2020,1810,239,190,1,8,0' '6/2/2020,2098,288,235,45,9,1' '6/3/2020,2299,201,238,3,11,2' '6/4/2020,2633,334,256,18,12,1' '6/5/2020,2911,278,289,33,12,0' '6/6/2020,3234,323,295,6,13,1' '6/7/2020,3447,213,340,45,13,0' '6/8/2020,3760,313,363,23,14,1' '6/9/2020,4083,323,394,31,15,1' '6/10/2020,4362,279,394,0,17,2' '6/11/2020,4612,250,394,0,17,0' '6/12/2020,5059,447,394,0,18,1' '6/13/2020,5334,275,394,0,19,1' '6/14/2020,5759,425,394,0,19,0' '6/15/2020,6210,451,1044,650,20,1' '6/16/2020,6590,380,1161,117,20,0' '6/17/2020,7176,586,1170,9,22,2' '6/18/2020,7847,671,1189,19,22,0' '6/19/2020,8273,426,1405,216,23,1' '6/20/2020,8604,331,1581,176,23,0' '6/21/2020,9025,421,1775,194,23,0' '6/22/2020,9558,533,2151,376,24,1' '6/23/2020,10098,540,2225,74,24,0' '6/24/2020,10727,629,2339,114,25,1' '6/25/2020,11161,434,2651,312,27,2' '6/26/2020,11754,593,2699,48,27,0' '6/27/2020,12308,554,2835,136,29,2' '6/28/2020,12771,463,3014,179,30,1' '6/29/2020,13247,476,3135,121,30,0' '6/30/2020,13563,316,3195,60,30,0' '7/1/2020,14045,482,4657,1462,33,3' '7/2/2020,14518,473,5321,664,33,0' '7/3/2020,15258,740,6144,823,33,0' '7/4/2020,15490,232,6416,272,34,1' '7/5/2020,15783,293,6548,132,35,1' '7/6/2020,15963,180,6812,264,35,0' '7/7/2020,16167,204,7500,688,36,1' '7/8/2020,16422,255,7753,253,36,0' '7/9/2020,16530,108,7892,139,38,2' '7/10/2020,16648,118,8012,120,39,1' '7/11/2020,16718,70,8443,431,39,0' '7/12/2020,16800,82,8590,147,39,0' '7/13/2020,16944,144,10295,1705,39,0' '7/14/2020,17060,116,10329,34,39,0' '7/15/2020,17176,116,11026,697,40,1' '7/16/2020,17343,167,11250,224,41,1' '7/17/2020,17444,101,11388,138,41,0' '7/18/2020,17501,57,11491,103,41,0' '7/19/2020,17657,156,11549,58,41,0' '7/20/2020,17843,186,11722,173,41,0' '7/21/2020,17993,150,12331,609,42,1' '7/22/2020,18093,100,12538,207,44,2' '7/23/2020,18240,147,12694,156,44,0' '7/24/2020,18373,133,12801,107,47,3' '7/25/2020,18482,109,12907,106,47,0' '7/26/2020,18612,130,12982,75,49,2' '7/27/2020,18751,139,13608,626,50,1' '7/28/2020,19062,311,13729,121,50,0' '7/29/2020,19272,210,13875,146,53,3' '7/30/2020,19546,274,14102,227,56,3' '7/31/2020,19770,224,14253,151,57,1' '8/1/2020,20085,315,14346,93,59,2' '8/2/2020,20331,246,14457,111,59,0' '8/3/2020,20749,418,14815,358,61,2' '8/4/2020,21008,259,14880,65,62,1' '8/5/2020,21389,381,15010,130,67,5' '8/6/2020,21749,360,15243,233,71,4' '8/7/2020,22213,464,15668,425,74,3' '8/8/2020,22591,378,16167,499,76,2' '8/9/2020,22971,380,16207,40,80,4' '8/10/2020,23309,338,16347,140,83,3' '8/11/2020,23947,638,16518,171,86,3' '8/12/2020,24431,484,16582,64,95,9' '8/13/2020,24956,525,16691,109,96,1' '8/14/2020,25550,594,16931,240,101,5' '8/15/2020,26018,468,17055,124,102,1' '8/16/2020,26659,641,17189,134,104,2' '8/17/2020,27240,581,17349,160,107,3' '8/18/2020,28256,1016,17434,85,114,7' '8/19/2020,28937,681,17554,120,120,6' '8/20/2020,29644,707,17818,264,126,6' '8/21/2020,30482,838,18068,250,137,11' '8/22/2020,31116,634,18204,136,146,9' '8/23/2020,31934,818,18485,281,149,3' '8/24/2020,32677,743,18660,175,157,8' '8/25/2020,33532,855,18973,313,164,7' '8/26/2020,34417,885,19358,385,175,11' '8/27/2020,35528,1111,19927,569,183,8' '8/28/2020,36455,927,20096,169,195,12' '8/29/2020,37339,884,20409,313,207,12' '8/30/2020,38560,1221,20676,267,221,14' '8/31/2020,39459,899,21264,588,228,7' '9/1/2020,40528,1069,22032,768,239,11' '9/2/2020,41648,1120,23144,1112,250,11' '9/3/2020,42876,1228,24061,917,257,7' '9/4/2020,44235,1359,25415,1354,271,14' '9/5/2020,45276,1041,26981,1566,280,9' '9/6/2020,46256,980,28795,1814,289,9' '9/7/2020,47235,979,30531,1736,300,11' '9/8/2020,48137,902,32818,2287,306,6' '9/9/2020,49218,1081,33736,918,312,6' '9/10/2020,50464,1246,35554,1818,317,5' '9/11/2020,51918,1454,36526,972,322,5' '9/12/2020,53119,1201,37378,852,336,14' '9/13/2020,54158,1039,38551,1173,345,9' '9/14/2020,55328,1170,39430,879,360,15' '9/15/2020,56787,1459,40492,1062,371,11' '9/16/2020,58326,1539,41560,1068,379,8' '9/17/2020,59572,1246,42803,1243,383,4' '9/18/2020,61592,2020,43674,871,390,7' '9/19/2020,62796,1204,45121,1447,401,11' '9/20/2020,64121,1325,46087,966,411,10' '9/21/2020,65275,1154,47092,1005,427,16' '9/22/2020,66631,1356,47915,823,429,2' '9/23/2020,67803,1172,49808,1893,436,7' '9/24/2020,69300,1497,50265,457,452,16' '9/25/2020,70613,1313,51720,1455,458,6' '9/26/2020,71820,1207,52867,1147,466,8' '9/27/2020,73393,1573,53752,885,476,10' '9/28/2020,74744,1351,54494,742,481,5' '9/29/2020,76257,1513,55225,731,491,10' '9/30/2020,77816,1559,56282,1057,498,7'

Conversion of different type of data into character type is called implicit coercian

R convert coerced data type into character.

x <- c(1,'two',4,"durga")
x
typeof(x)

Output is,

'1' 'two' '4' 'durga'
'character'

Explicit type coercian

We do this by typing as.desire data type. Explicit type coercian helps us to deal with incorectly catagorized data.
We can not transfer numeric into character
Character into numberic.

num <- 1:5
num_char <- as.character(num)
num_char

Output is,

'1' '2' '3' '4' '5'


product <- c("apple",1,"banana")
as.numeric(product)

Output is,

Warning message in eval(expr, envir, enclos):
"NAs introduced by coercion"
<NA> 1 <NA>

Installing Packages in R

There are numerous useful packages to do various tasks in R and with those packages, we could do things better and faster way. Once simpler way to install packages is via console;

install.packages("haven")


library("haven") # allows to read sav file
saq8 <- read_sav("F:/Statisticts with R/CSV file for covid data/SAQ8.sav")

In above example, I first installed package named as haven and then I used it to read sav file.

This all for this blog and I hope you enjoyed it. Please leave the feedbacks and stay tuned for my next blog.

DEV Community: Durga Pokharel

How Can We Generate Random Number From Congruential Method?

What is Random Number

Main Characteristics of Random Number

Generating Random Number from Monte Carlo Method

Generating Random Number using Congruential Method

Lets Generate Random numbers using Congruential Method by using python

Plot of Two random number as we generated

Let’s change given random number into Normal distribution

Using Python

Lets Check Distribution of generated new numbers

Let us compare the random number generated from Congruential Method and Using Library

News Classification using Neural Network

Import Necessary Module

Data Load

Open the stopwords.txt file.

Stop words file

Open the Punctuation file.

Pre-processing of text

R Exercise: Social Network Analysis

Social Network Analysis

Definition

Social Network Analysis

SNA Graph

Adjacency Matrix

Degree

Diameter

Edge density

Reciprocity

Closeness

Betweenness

Edge Betweenness

SNA in TwitterData

Check Degree of graph and nodes of the graph.

Histogram on the basis of degree

Let’s Plot Graph on the Data.

Make it better.

Community detection

Hubs

Authority

Hub in Plot

Authority in Plot

Application of SNA in Real World

R Exercise: Association Rule Mining in R

Association Rule Mining

Use of Association Mining Results

When Association Mining is used?

What is Apriori Algorithm and Rule?

Some Association Rule Mining Terms

Support

Confidence

Expected Confidence

List

Let’s Do Association Rule Mining in R

Create a list of baskets

Let’s inspect the trans

Relative frequency plot and plot of trans

Why Apriori Algorithms is important here?

Apriori algorithm of “trans” without/with min. support of 0.3 and min. confidence of 0.5.

Summary of rules

Let’s set RHS rule for trans data

Let’s put beer in LHS and set RHS as default values

Product Recommendation Rule

Plotting rules with “arulesViz” package

Plot two-key-plot

Plot with “ggplot2” engine

Parallel Coordinate plot for 10 rules

R Exercise: Getting Started With ggplot in R

Getting Started with ggplot2 in R

Grammer

Grammar of Graphics

Components of ggplot2

Layered grammar of graphics

How to use ggplot2 in R?

Lets make histogram of diamonds data

Let us try some other ggplot2 features in R builtin data mtcars

Let’s add geam_smooth(): What will happen?

Adding “wiggliness” in the smoothing plot

Let Modify our code little,

Fixed color

Let’s inspect the `trans`

Plot `two-key-plot`

Lets make histogram of `diamonds` data

Let us try some other ggplot2 features in R builtin data `mtcars`

Let’s add `geam_smooth()`: What will happen?

Let’s Use alpha inside `geom_point()`