DEV Community: mary kariuki

Numpy (numerical python)

mary kariuki — Wed, 16 Nov 2022 08:08:14 +0000

Numpy in python

Numpy is a python library used for working with the arrays.
Numpy stands for numerical python.it is used in performing wide variety of mathemetical operation on arrays.It also has functions for working in domain of linear algebra, fourier transform, and matrices. In python there are list that can operate as numpy but list are slow in process therefore numpy helps in solving the problem since NumPy arrays are stored at one continuous place in memory unlike lists, so processes can access and manipulate them very efficiently.
To install numpy we use the following command

pip install numpy

Having installed the numpy you have to import the the library using the following command.

import numpy as np

Where np is an alias used for refering numpy.

creating an array

The array object in NumPy is called ndarray.We can create a NumPy ndarray object by using the array() function as shown below

import numpy as np

# creating an array

x = np.array([2,4,6,8,10])

print(x)

print(type(x))

Note: an array can be into one dimension, two dimension or three dimension.

One dimensional array, is an array that has 0-D arrays as its elements
Two dimensional array, An array that has 1-D arrays as its elements
Three dimensional array, is An array that has 2-D arrays as its elements

There are many operations that takes place in numpy arrays which include

numpy array indexing
numpy array slicing
numpy array shape
numpy array reshape
numpy array split
numpy array join

Numpy array indexing

We access can array element through indexing by the help of an index number.

Indexing in 1-D array

#indexing in 1-D array

x = np.array([1, 3, 4, 6])

print(x[0])

Output

Indexing in 2-D array

#indexing in 2-D array

y = np.array([[1,4,6,9,0], [2,7,3,9,1]])

print('2nd element on 1st row: ', y[0, 1])

output

Numpy array slicing

slicing refers to taking elements from one given index to another given index.

#Slice elements from index 1 to index 5 from the following array

y = np.array([10,20,30,40,50,60])

print(y[1:4])

Output

[20 30 40 ]

Note:The result includes the start index, but excludes the end index

Numpy array shape

The shape of an array is the number of elements in each dimension.
NumPy arrays have an attribute called shape that returns a tuple with each index having the number of corresponding elements.

arr = np.array([[2,4,6,8], [8,8,3,4]])

print(arr.shape)

Output:

(2, 4)

The example above returns (2, 4), which means the array has two rows and 4 columns.The first digit represent row and the second one represent columns

Numpy array reshape

Reshaping refers to changing the shape of an array where we have said that shape in array is the number of elements in each dimension.reshaping can be adding or removing number of elements in each dimension.

reshaping can be from 1-D to 2-D

z = np.array([2,4,6,8,10,12,14,16,18,20,22,24])

newarray = z.reshape(4, 3)

print(newarray)

Output:

[[ 2  4  6]
 [ 8 10 12]
 [14 16 18]
 [20 22 24]]

Note: the array have been reshaped from 1-D array to 2-D array with 4rows and 3 columns

Reshaping 1-D to 3-D

z = np.array([2,4,6,8,10,12,14,16,18,20,22,24])

newarray = z.reshape(2, 3, 2)

print(newarray)

Output:

[[[ 2  4]
  [ 6  8]
  [10 12]]

 [[14 16]
  [18 20]
  [22 24]]]

Note:The outermost dimension will have 2 arrays that contains 3 arrays, each with 2 elements

Numpy joining array

Joining means putting contents of two or more arrays in a single array.

x = np.array([10, 20, 30])

y = np.array([40, 50, 60])

arr1 = np.concatenate((x, y))

print(arr1)

Output:

[10 20 30 40 50 60]

we can join the arrays using stack functions such as vstack which stacks along the columns.lets have an example,


x = np.array([10, 20, 30])

y = np.array([40, 50, 60])

arr2 = np.vstack((x,y))

print(arr2)

Output:

[[10 20 30]
 [40 50 60]]

Numpy splitting array

Splitting is reverse operation of Joining.
Joining merges multiple arrays into one and Splitting breaks one array into multiple.To split the arrays we use array_split() function where we pass some arguments which are the array to be split and the number of split.

x = np.array([20,40, 60,70,80,100])

arr3 = np.array_split(x, 3)

print(arr3)

Output:

[array([20, 40]), array([60, 70]), array([ 80, 100])]

Note:The return value from the example above is an array containing three arrays.

visualization of data using matplotlib and seaborn

mary kariuki — Sat, 22 Oct 2022 22:54:34 +0000

Visualization of data.

Data visualization is the graphical representation of data.
Matplotlib is a python library used in plotting of graphs with other modules such such pandas and numpy while seaborn is also
a python library used for plotting graph with help ofother libararies like matplotlib,numpy and pandas.
The difference between seaborn and matplotlib is that,seaborn
complies the entire data into a single plot while matplotlib is
used in plotting 2-D graphs of arrays.

Matplotlib

The first thing is to install matplotlib that uses a simple command

pip install matplotlib

After matplotlib has being installed you have to import the matplotlib module as shown below

import matplotlib.pyplot as plt

note: plt is an alias.
Matplotlib is used in plotting varoius graphs such as

bar graphs
histograms
pie charts
scatter plots

Scatter plot

to draw a scatter plot we use the SCATTER() method which draws one dot for each value.To plot a scatter function one should have two values that is the x-axis values and y-axis values.

import matplotlib.pyplot as plt
import numpy as np

xpoints = np.array([3,4,5,7,1,0,5,8,6,4])
ypoints = np.array([70,20,70,30,50,90,55,49,34,28])

plt.scatter(xpoints, ypoints)
plt.show()

Bar graphs

when drawing a bar graph we use the BAR() method to create bar graphs and provide the x-axis and y-axis values.

import matplotlib.pyplot as plt
import numpy as np

xvalues = np.array(["mary", "anne", "simon", "james"])
yvalues = np.array([90,10,50,70])

plt.bar(xvalues,yvalues)
plt.show()

Histogram

A histogram is a graph that shows frequency distribution.
We use the HIST() method to create histograms, which uses arrays of numbers where the hist function reads the array and provide a histogram.

import matplotlib.pyplot as plt
import numpy as np

y = np.random.normal(20, 40, 500)

plt.hist(y)
plt.show()

Piechart

We use the pie() method to create pie charts.

import matplotlib.pyplot as plt
import numpy as np

z = np.array([10,30,5,60,59,70,2])

plt.pie(z)
plt.show()

The pie chart is subdivided in 7parts since we have passed 7elements in the array.

Seaborn

To use seaborn module you will first install as shown.

pip install seaborn

after installing you now import the matplotlib and seaborn since they go hand in hand.

import matplotlib.pyplot as plt

import seaborn as sns

Seaborn is used in statistical graphics in python now lets load our data.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=sns.load_dataset("data")
df

Out of our data we can have a single plot that describes
the entire data.

import matplotlib.pyplot as plt
import seaborn as sns
sns.pairplot(df)

Note:pairplot > allows us to plot pairwise relationships between variables within a dataset.

Distplot in seaborn
Distplot stands for distribution plot it takes as input an array and plots a curve corresponding to the distribution of points in the array.

import matplotlib.pyplot as plt
import seaborn as sns

sns.distplot([2,4,6,8,10])

plt.show()

Analyzing and cleaning data using pandas

mary kariuki — Tue, 18 Oct 2022 16:03:39 +0000

Analyzing data using pandas

It is advisable to first make a quick overview of your dataset once you load your data into dataframes, a dataset can be in a csv format therefore to load it into dataframe we use the following,

import pandas as pd
dx=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
print(dx)

we can view our data in various ways which includes,

head() function

it is a function that returns the headers and specified number of rows from top.

import pandas as pd
dt=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\datas.csv")
print(dt.head(10))

tail() function

it returns the headers and specified number of row from bottom.

import pandas as pd
dt=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
print(dt.tail())

info () function

it is a function that gives more information about your dataset that is it show datatypes, non-null cells,memory among others.

dt=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
print(dt.info())

describe() function

the function gives the description of your data that is the function shows the mean,median,standard deviation,maximum among others.

dt=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
print(dt.describe())

Cleaning data using pandas

cleaning data is simply removing bad data in dataset,this may involve removing empty cells,removing duplicates,checking data with wrong format.

removing of duplicates

duplicates are rows that have been registered more than once.
to remove duplicates we use the duplicated() function which returns a boolean value that is True if duplicates exist else returns False.

dt.duplicated().sum()

removing empty cells

we remove the empty cells using dropna() function,this is a method that returns new dataset and will not change the original dataset.

dt = pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\datas.csv")
data1=dt.dropna()
print(data1)

cleaning wrong data

we fix wrong data in two ways that is by replacing the wrong values or removing those wrong values.

removing wrong data

dx=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
dx.dropna(subset=['fname'],inplace=True)
print(dx)

2.replacing wrong data

dx=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
for line in dx.index:
    if dx.loc[line,'ASSIGN 2']>90:
        dx.loc[line,'ASSIGN 2']=10
print(dx)

cleaning wrong format

it may difficult and impossible to analyze data with some columns or rows having wrong format.wrong format can be a row having multiple datatypes , to fix this one can convert the entire row into one datatype or remove the entire row from dataset.
removing entire row

dx=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
dx.dropna(subset=['row 3'])
print(dx)