DEV Community: Elvis Mburu

Machine Learning

Elvis Mburu — Wed, 10 May 2023 11:21:24 +0000

In this day and age there are booming buzz words in the tech sphere. Some being Artificial Intelligence, machine learning and deep learning.

What do these words mean? We'll cover them in a few but we'll put our focus majorly in machine learning.

Artificial Intelligence

According to a research paper by John McCarthy artificial intelligence is the science and engineering of making intelligent machines, especially intelligent computer programs. It is related to the similar task of using computers to understand human intelligence, but AI does not have to confine itself to methods that are biologically observable.

Turing on the other hand, provided a set of tests that an artificially intelligent computer has to pass.

Generally AI focuses on the simulation of human intelligence in computers/machines.

Machine Learning

It is a subset of artificial intelligence (AI) in which algorithms are developed and trained on a dataset from which they make predictions or decisions based on the in fed data.
This means that computers learn from data rather than being programmed.

There are three types of machine learning algorithms.

Supervised Learning
Unsupervised Learning
Reinforcement Learning

a. Supervised Learning
The algorithms are trained with labeled datasets. This means that the training data has both the input features and the expected output.
It pretty similar to when you teach a class by also providing them answers for the test so that they can gauge their understanding.
Thus the algorithm learns to recognize patterns to make predictions/decisions about new, unlabeled data.
Some of the common supervised algorithms are:

linear regression

logistic regression

random forest

support vector machine(svm)

b. Unsupervised Learning
The algorithm is trained on unlabeled data. This means that the algorithm has to recognize patterns in the data and then group similar data together.

Some common algorithms used in unsupervised machine learning are:

neural networks

k-means clustering

probabilistic clustering methods

c. Reinforcement Learning
The algorithm learns to make decisions by interacting with an environment and receiving feedback in the form of reward or penalties. The algorithms strives to maximize rewards while reducing penalties.

An example of a reinforcement learning algorithm is a robot to pick items. It holds items and attempts to lift them. If the item falls the robots via trial and error adjusts various features such as holding more tightly or looking for a more suitable place to hold the item. Over time it learns the best way to pick up the item and now significantly does so effortlessly.

Applications of Machine Learning

self driving cars
speech recognition
computer vision
recommendation engines
fraud detection

Machine learning is an essential tool in data science. We'll explore these algorithms as we progress.

Comprehensive Guide to GitHub for Data Scientists

Elvis Mburu — Sun, 02 Apr 2023 20:40:31 +0000

What is GitHub

Github is a code hosting platform for collaboration and version control.
It facilitates social coding by providing a hosting service and web interface for git code repository.

What is version control

It is the practice of tracking and managing changes to software code.
Version control software keeps track of every modification to the code in a special kind of database.
Git is a version control.

The version control system assigns a unique hash code for every modification done to the source code.

Version Control Benefits

History Tracking
Collaborative history tracking

Github Terms

Installing git on linux

sudo apt install git-all

if you are on another system check out here

Configure user
set the username for the local repositories

git config --global user.name "[username]"

set the email to attach to the commits

git config --global user.email "[email]"

set the password

git config --global user.password "[password]"

Repository

A repository is a centralized location in Git where files and their version history are stored.
In other sense it's a directory that contains all the files and sub-directories associated with a project, along with the entire revision history of each file.

Branch

A branch is a parallel version of a repository.
The default branch is called master
Any other branch is a copy of the master branch at a particular time.
Each branch contains changes that are different from the main codebase i.e. the master branch
Benefits of using branches

parallel development without disrupting the main codebase
It facilitates collaboration across teams

Commits

These refer to the changes in a repository.
Each commit has a description/message why or what change was made.

Pull requests

They are very instrumental to enabling seamless collaboration.
With pull request you are proposing that your changes should be merged with the master branch.
They show content differences, changes, additions and subtractions in colors (red and green)

Pull requests are merged to the main branch by the repository owner or the code reiewer

Github Events

Now that we have a brief overview of what Github is all about, let's dive into some of the events :

creating and deleting a repository
pushing a code into a repository
creating a branch
opening and closing a pull request
code reviewing
merging
opening and closing issues
assigning issues

Creating a repository

Creating a repository alias repo
There are two ways of creating a repo

github user interface
creating from a folder

a. repo from github user interface
from github click the green button on the top-left

b. repo from a folder
You may want to make a existing folder in your local machine a repo.
In the terminal go to the existing project you want to start tracking.
Then enter the command below to initialize a folder as a repository.
This thus creates a new repository in the current directory.

You then use the command

git add .

This command is used to add all changes in the current directory and its sub-directories to the staging area (the temporary storage area in Git where you can prepare changes to be committed to the repository).
Instead if you want to commit selected files you can instead of git add . use:

git add filename

the command adds a file named filename to the staging area
or

git add file1 file2

incase of multiple files

To commit the changes in the staging area to the repository we use git commit command. Example:-

git commit -m "first commit in the repository"

the -m option allows you to specify a message for a commit.
This is often used as a brief summary of the changes that were made.
Benefits of a good commit message

Enhance clarity of what changes were made
Acts as a historical record of the changes made
Facilitates collaboration among team members
Helps in debugging as it helps identify which changes caused the errors/bugs
A commit message can serve as documentation for the code changes

now let's rename the current branch to main

git branch -M main

this commands simply just renames the current branch to main.
The default branch is master

Now let's add a new remote repo named origin to the local git repo.

git remote add origin git@github.com:usename/new_repo

We now push our changes to the remote repository named origin

git push -u origin main

if you want to clone a repo from github to your local machine
you can use the command:

git clone url/to/the/repo

this creates a directory with same name as your repo with the project contents also

Following Github Flow

Create a branch
create a branch in your repository.
There are two ways of creating a branch to your repository

from the github interface
from the terminal

create branch from github interface
click on the dropdown on the left of your screen

write the name of your branch

then click on the part create branch:

create a branch from the terminal
Check the current branch using the command

git branch

create a new branch using the command

git branch <branch_name>

Now switch to the new branch using the command

git checkout <branch_name>

git checkout -b <branch_name>

Now you'll be making changes to the new branch instead of the main/master branch.
To list the branches present in the repo

git branch --list

You can commit and push your changes to the branch
Also you can be able to revert if a mistake is made

Deleting a branch
To delete a branch you use the command

git branch -d [branch-name]

Create a pull request

Creating a pull requests is vital especially in a collaboration environment.
Some pull requests require approval before merging it.
When you create a pull request, include a summary of the changes and the problem they solve.

On github web interface
navigate to the main page of the repository
in the branch menu, choose the branch that contains your commits

click on New pull request
You can choose the branch you want to create a pull request for

If no issues you can click on the Create pull request grren button

The repo owner or code reviewer will then review the pull request and merge it to the main branch.

Create the pull request using the CLI
To create a pull request we use the

gh pr create --assignee "@username"

or you can use "@me" to self assign the pull request

Synchronize changes

To synchronize your local repository with the remote repository on Github

git fetch
It downloads all history from the remote tracking branches

git merge
It combines remote tracking branch into the current local branch

git push
Uploads all local branch commits to Github

git pull
Updates your current local working branch with new commits from the corresponding remote branch
It is a combination of git fetch and git merge

Commit Changes

To list the version history for the current branch use the command

git log

To list the version history for a file, including renames

git log --follow [file]

To show content differences between two branches

git diff [first_branch] ... [second_branch]

Snapshots of the file in preparation for versioning

git add [file]

Redo Commits

To undo all commits after [commit], preserving changes locally

git reset [commit]

To discard all history and changes back to the specified commit

git reset --hard [commit]

Sentiment Analysis

Elvis Mburu — Sat, 25 Mar 2023 12:23:26 +0000

Getting Started With Sentiment Analysis

It is the process of detecting positive or negative sentiment in text.
It is also referred to as opinion mining.
It is an approach to natural language processing (NLP) that identifies the emotional tone

behind a body of text.
It is vastly used by organizations to determine and categorize opinions about a produt, service or idea

Sentiment analysis involves the use of data mining, machine learning (ML), artificial intelligence

and computational linguistics to mine text for sentiment and subjective information.

Such information maybe classified as:

positive
neutral
negative This classification is also known as polarity of a text. Graded Sentiment Analysis
very positive
positive
Neutral
Negative
Very Negative This is also referred to as graded or fine-grained sentiment anlysis.

Types of Sentiment Analysis

Intent-based - recognizes motivation behind a text
Fine-grained - graded sentiment analysis
Emotion-detection - allows detection of various emotions
Aspect-based - anayses text to know particular aspects/features mentioned in all the polarity.

We will not dive into these types for now.

This in turn helps organizations to gather insights into real-time customer sentiment,

customer experience and brand reputation.

Generally these tools use text analytics to analyze online sources .

Benefits of sentiment analysis

sorting data as scale
real-time analysis
consistent criteria

Steps involved in Sentiment Analysis

Sentiment analysis generally follows the following steps:

Collect data - The text to be analyzed is identified and collected.
Clean the data - The data is processed and cleaned to remove noise and parts of speech that don't have meaning relevant to the sentiment of the text.
Extract features - A machine learning algorithm automatically extracts text features to identify negative or positive sentiment.
Pick an ML model - A sentiment analysis tool scores the text using rule-based, automatic or hybrid ML model.
Sentiment classification - Once a model is picked an used to analyze a piece of text, it assigns a sentiment score to the text including positive, negative of neutral.

Let's have a deep dive in sentiment analysis using an example

Step 1. Collect Data

We are going to used a data set from UCI Machine Learning Repository.

Let's start with importing the libraries that we will be using:
punkt is a data package that contains pre-trained models for tokenization.

# import the required packages and libraries
import numpy as np
import pandas as pd
import nltk
nltk.download('punkt')

loading the dataset

pd.set_option('display.max_colwith', None)
df = pd.read_csv('https://gist.githubusercontent.com/fmnobar/88703ec6a1f37b3eabf126ad38c392b8/raw/76b84540ccd4b0b207a6978eb7e9d938275886ff/imdb_labelled.csv')
df.head()

Output

We can now see that there are only two columns text and label.
The label indicates the sentiment of the review

1 indicates a postive sentiment
0 indicates a negative sentiment. This thus indicates the polarity of the sentiment.

We now create a sample string, which is the first entry in the text column of the dataframe df.

sample = df.text[0]
sample

Output

Tokens and Bigrams

a. Tokens

A token is a single unit of meaning that can be identified in a text.
It is also known as a unigram.
Tokenization is the process of breaking down a text into individual tokens.
The functions that perform tokenization are called tokenizers.
This concept is implemented with the nltk.word_tokenize function.

the function takes a string of text as input and returns a list of tokens.
it splits the text into individual words and punctuation marks.
Let's see an example the functions usage by tokenizing the sample text.

sample_tokens = nltk.tokenize(sample)
sample_tokens[:10] # view a list of elements upto the 10th token

Output

b. Bigrams

If we combine two unigrams/tokens we form a bigram.
A bigram is a pair of adjecent tokens in a text.
They are used to capture some of the context in which a particular word
or phrase appers.
They are used to build statistical models of language which are
sequences of n words/tokens.
By analyzing the frequency of different n-grams in a large corpus of text,
NLP systems can learn to predict the probability of dofferen words occuring in a particular context.

bigrams are implememted with the nltk.bigrams function

Let's see this in action

sample_bitokes = list(nltk.bigrams(sample_tokens))

# Return the first 10 bigrams
sample_bitokens[:10]

Output

Frequency Distribution

Refers to the count or proportion of words or prases asscociated with positive or negative sentiment.
It basically counts the occurrence of each sentiment-bearing word/phrase
and then calculate the frequency distribution.

implemented using the nltk.FreqDist function

What are the top 10 most frequently used tokens in our sample?

sample_freqdist = nltk.FreqDist(sample_tokens)

# Return the top 10 most frequent tokens
sample_freqdist.most_common(10)

Output

This results ultimately make sense:

a comma, the , a or periods can be quite common in a phrase.

Let's create a function named tokens_top that takes in a text
as input and returns the top n most common tokens in a given text.

def tokens_top(text, n):
    # create tokens
    tokens = nltk.word_tokenize(text)

    # create the frequency distribution
    freqdist = nltk.FreqDist(tokens)

    # return the top n most common tokens
    return freqdist.most_common(n)

# Call the function 
tokens_top(df.text[1], 10)

Output

Document-Term Matrix

It is a matrix that represents the frequency of terms that occur in a collection of documents.
The rows represent the documents in the corpus and the columns represent the terms .
The cells of the matrix represents the frequency or weight of each term.

We can implement this with scikit-learn's CountVectorizer

Example

#import the package
from sklearn.feature_extraction.text import CountVectorizer

def create_dtm(series):
    # Create an instance/object of the class
    cv = CountVectorizer()

    # create a dtm from the series parameter
    dtm = cv.fit_transform(series)

    # convert the sparse array to a dense array
    dtm = dtm.todense()

    # get column names
    features = cv.get_feature_names_out()

    # create a dataframe
    dtm_df = pd.DataFrame(dtm, columns = features)

    # return the dataframe
    return dtm_df
# Call the function for df['text].head
create_dtm(df['text'].head())

Output

Data Cleaning

Feature Importance

Refers to the extent to which a specific feature/variable contributes to the
prediction or classification in sentiment analysis.

There are differet methods that can be used to determine feature importance:

machine learning algorithms eg. decision trees and random forests
statistical methods eg. correlation or regression analysis

feature importance is a useful tool in sentiment analysis as it can help identify
the most important features for accurately predicting the sentiment of a text.

Example
we'll define a function "top_n_tokens" that has 3 parameters
text, sentiment and n

the function will return the top n most important tokens
to predict the sentiment of the text.

We'll use LogisticRegression from sklearn.linear_model
with the following parameters:

solver = 'lbfgs'
max_iter = 2500

random_state = 1234

from sklearn.linear_model import LogisticRegression

def top_n_tokens(text, sentiment, n):
# create an instance of the class
lgr = LogisticRegression(solver = 'lbfgs', max_iter = 2500, random_state = 1234)
cv = CountVectorizer()

# create the DTM
dtm = cv.fit_transform(text)

# fit the logistic regression model
lgr.fit(dtm, sentiment)

# get the coefficients
coefs = lgr.coef_[0];

# create the features/column names
features = cv.get_features_names_out()

# create the dataframe
df = pd.DataFrame({'Tokens' : features, 'Coefficients' : coefs}) 
# return the largest n
return df.nlargest(n, coefficients)
# Test if on df['text]
top_n_tokens(df.text, df.label, 10)

Output

To validate the hypothesis that the most important features will be the ones that
indicate a strong positive sentiment, let's look at the 10 smallest coefficients.

from sklearn.linear_model import LosticRegression

def bottom_n_tokens(text, sentiment, n):
    # create an instance of the class
    lgr = LogisticRegression(solver = 'lbfgs', max_iter = 2500, random_state = 1234)
    cv = CountVectorizer()

    # create the DTM
    dtm = cv.fit_transform(text)

    # fit the logistic regression model
    lgr.fit(dtm, sentiment)

    # get the coefficients
    coefs = lgr.coef_[0];

    # create the features/column names
    features = cv.get_features_names_out()

    # create the dataframe
    df = pd.DataFrame({'Tokens' : features, 'Coefficients' : coefs})

    # return the smallest n
    return df.nmallest(n, coefficients)
# Test if on df['text]
bottom_n_tokens(df.text, df.label, 10)

Output

In the example that we've covered till this far we've used labelled data
What if we do not have labelled data?
Then we can use pre-trained models such as:

TextBlob -VADER
Stanford ColeNLP
Google Cloud Natural Language API
Hugging Face Transformers

Let's explore TextBlob

TextBlob

It is a Python library that provides a simple API for performing common
NLP tasks such as sentiment analysis.
It uses a pre-trained model to assign a sentiment score to a piece of text, ranging from -1 to 1

It is built on top of NLTK (natural language toolkit)
It also provides additional information such as:

subjectivity score

It returns the sentiment of agiveen data in the format of a named tuple as follows:
(polarity, subjectivity)

polarity score is a float within the range of [-1.0, 1.0].

it aims at differentiating whether the text is positive or negative

subjectivity is a float within the range [0.0, 1.0]

0.0 is very objective
1.0 is very subjective

TextBlob also provides other features such as:

part-of-speech tagging
a noun phrase extraction

Example

Let's define a function named polarity_subjectivity that accepts two argument.
The function uses TextBlob to the provided text
if print_results = True, prints polarity and subjectivity of the text elseM
returns a tuple of float values 1st being polarity and 2nd being subjectivity

You can install TextBlob using

!pip install textblob

#import TextBlob
from textblob import TextBlob

def polarity_subjectivity(text = sample, print_results = False):
    # create an instance of TextBlob
    tb= TextBlob(text)

    # if condition is metm print the results
    if print_results:
        print(f"Polarity is {round(tb.sentiment[0], 2)} : Subjectivity {round(tb.sentiment[1], 2)}")
    else:
        return (tb.sentiment[0], tb.sentiment[1])

# Test the function
polarity_subjectivity(sample, print_results =  True)

Output

The results indicate that our sample has a slight positive polarity and it's relatively subjective thought not by a high degree

Let's define a function token_count that accepts a string and using nltk's word_tokenizer,
returns an integer number of tokens in the given string

Then define another function series_tokens that accepts a Pandas Series as argument
and aplies the function
token_count to the given series.
Use the second function on the top 10 rows of our dataframe

# import libraries
from nltk import word_tokenize

# Define the first function that counts the number of tokens in a given string
def token_count(string):
    return (len(word_tokenize(string)))

# Define the second function that applies the           token_count funnction to a given Pandas series
def series_tokens(series):
    return series.apply(token_count)

# Apply the function to the top 10 rows of the data frame
series_tokens(df.text.head(10))

Output

Let's define a function named series_polarity_subjectivity
that applies the polarity_subjectivity function we defined earlier

# define the function
def series_polarity_subjectivity(series):
    return series.apply(polarity_subjectivity)

# apply to the top 10 rows of df['text']
series_polarity_subjectivity(df['text'].head(10))

Output

Measure of Complexity - Lexical Diversity

Lexical diversity refers to the variety of words used in a piece of writing or speech.
It is a measure of how often different words are used in a given text or speech and is often used as an indicator of the richnes and complexity of vocabulary.
It thus defines the number of unique tokens over the total number of tokens.

Example

Let's define a complexity function that accepts a string as an argument and returns the lexical complexity score defined as the number of unique tokens over the total number of tokens.

def complexity(string):
    # create a list of all tokens
    total_tokens = nltk.word_tokenize(string)

    # create a set of words(It keeps only unique values)
    unique_tokens = set(total_tokens)

    # Return the complexity measure
    if len(total_tokens) > 0:
        return len(unique_tokens) / len(total_tokens)

# apply the function to top 10 rows
df.text.head(10).apply(complexity)

Output

Some interesting insights the row at index 3 and 4 have the highest lexical diversity. All the tokens in them are totally unique.

Text Cleanup - Stopwords and Non-alphabeticals

This step ensures that the text data is in a constitent format and to remove noise,

irrelevant information and other inconsitencies.
Some of the techniques for text cleanup:

Lowercasing
Tokenization
Stopword Removal
Removing Punctuation
Stemming and Lemmatization
Removing URL's and mentions
Removing emojis and emotions

Example

#import the library
from nltk.corpus imort stopwords

# Select only English stopwords
english_stop_words = stopwords.words('english')

# print the first 20
print(english_stop_words[:20])

Let's look at an example to remove non-alphabetical
We'll use isalpha

string_1 = "Crite_Jes.cd"
string_2 = "a quick dog"
string_2 = "We are good!"

print(f"String_1: {string_1.isalpha()}\n")
print(f"String_1: {string_2.isalpha()}\n")
print(f"String_1: {string_3.isalpha()}\n")

Output

Essential SQL Commands for Data Science

Elvis Mburu — Mon, 13 Mar 2023 19:24:51 +0000

Structured Query Language (SQL) is a programming language designed for managing and manipulating relational databases.
A database on the other hand is a collection of data that is organized in a manner that facilitates ease of access, as well as efficient management and updating.
A database is made up of tables that store relevant information.
The language is used by data analysts and data scientists to extract insights from large datasets.
SQL is a powerful tool that can be used to perform a wide variety of data manipulation tasks including : filtering, sorting, grouping and aggregating data.
A table stores and displays data in a structured format consisting of columns and rows that are similar to those seen in Excel spreadsheets.

SQL can:

Insert, update or delete records in a database.
Create new databases, tables, triggers and views.
Retrieve data from a database.

Basic SQL Commands

I shall be demonstrating this commands using mysql terminal.

1. SHOW

The show statement displays information contained in the database and its tables

SHOW DATABASES;

This command (SHOW DATABASES) lists the databases managed by the server.

SHOW TABLES;

SHOW COLUMNS FROM table_names;

This commands shows the columns in the table_names table.
It displays the :

Field : The column name
Type : The data type of the values stored in the column
Null : If the column is null
Key : It the column is the Primary Key
Default : The default value if null
Extra : may contain additional information that is available about a given column

USE command

It is used (no pun intended) to specify which database to be used if there are multiple of them.

USE demo;

There are six databases managed by my server. By the the help of USE we specify that we want to use the demo database.

2. SELECT Statement

It is used to retrieve data from one or more tables in a database.
The select statement can be used to filter, sort and group data using different functions which we'll cover as we progress.
Here's the syntax of SQL SELECT statement:

SELECT column_list
FROM table_name;

column_list : includes one or more columns which data is retrieved
table-name : it's the name of the table from which the information is retrieved

A query may retrieve information from selected columns or from all columns in the table.
To create a simple SELECT statement, specify the name(s) of the column(s) you need from the table.

SELECT adm_no FROM table_names;

From the above statement we SELECT the values in the adm_no column in the table_names table. This means we have specified from which column we want it's values selected. We have just selected from just one column.

SELECT adm_no, Maths FROM table_name;

We can specify more columns to be queried. In the above statement we have queried from two columns (adm_no and Maths).
To SELECT from specific multiple columns you use a comma (,) to add the name of the column you want queried.

SELECT * FROM table_name;

We use an asterisk (*) if we want to query/fetch from all the columns in a table.

Multiple Queries
SQL allows to run multiple queries or commands at the same time.

SELECT * FROM pets;
SELECT * FROM table_name;

The above statements retrieves all the columns and the rows in the pets and table_name tables;

The DISTINCT Keyword
In situations where you have multiple duplicate records in a table you may want to retrieve only unique records, instead of fetching the Duplicates.
Syntax

SELECT DISTINCT col_name1, col_name2
FROM table_name;

Example

SELECT DISTINCT * FROM pets;

We fetch all the columns and rows that are distinct from the pets table;

3. CREATE Database

The *** CREATE DATABASE*** statement is used to create a new SQL database;

CREATE DATABASE employees;

From the above commands we've created a database called employees;

we can now USE the database and now create it's tables;

4. CREATE TABLE

The CREATE TABLE statement is used to create a new database.

CREATE TABLE table_name (
column1 datatype,
column2 datatype,
...
);

That is the syntax of creating a table.
Let's create a new table

CREATE TABLE employees (
emp_id VARCHAR(50),
firstName VARCHAR(50),
lastName VARCHAR(50),
department VARCHAR(50));

We've created a table called employees
It has the following columns:

emp_id : of type varchar
firstName : of type varchar
lastName : of type varchar
department : of type varchar

Now that we have created a new table we want to insert values to the table:

INSERT INTO

The INSERT INTO is used to insert new records in a table.

syntax

INSERT INTO table_name
VALUES (value1, value2, ...)

Here's an example inserting values to the employees table.

INSERT INTO employees
VALUES ('IT_210', 'John', 'Doe', 'IT');

We have insert values to the employees table.
Note we have supplied values for every column in the employees table.

What if we do not want to supply a value for every column/field of the table?
We then have to supply a list of the fields we want to supply values for

INSERT INTO employees (emp_id, firstName, lastName)
VALUES ('mn_210', 'Lucky', 'Lard');

Here we've supplied a list of the list of fields we want to supply values for: emp_id, firstName,lastName

What if we want to update the values already contained in the database?
Maybe we passed the wrong department or name for an employee. Well use the update statement

5. UPDATE statement

The UPDATE statement is used to modify the existing records in a table.
syntax

UPDATE table_name
SET column1 = value1, column2 = value2, ...
WHERE condition;

Let's update the lastName of the customer with id mn_210

UPDATE employees
SET lastName = "Angie"
WHERE emp_id = 'mn_210';

We've updated the name from Lard to Angie.
You can also update many records by use of a comma as shown in the syntax definition

6. WHERE statement

It is used to filter data based on a specified condition.
We provide the conditions that have to be met before returning the data using the WHERE clause.
It is used to filter data in a way.
The WHERE clause is not only used in SELECT statements, but also in UPDATE and DELETE etc.

SELECT * FROM employees
WHERE emp_id = 'mk_23';

Here we've specified that we want to retrieve record that for employee with id mk_23
In this case the query will fetch on row since only one employee has the id since every id is unique
Example2

SELECT * FROM employees
WHERE department = 'IT';

Here we fetch all the records having the department as IT

For this case it returns two reco.rds.

We will see more of WHERE as we progress with other statements, commands etc.

7. DELETE

The DELETE statement is used to delete existing records in a table.
Syntax

DELETE FROM table_name WHERE condition;

Let's delete the records for Lucky

DELETE FROM employees
WHERE emp_id = 'mn_210';

Here we have delete the record that has the emp_id with the value mn_210 which is refers to Lucky Angie.
When we retrieve all the records we can see that the record has been deleted.

We can also delete all records at once.
**Syntax **

DELETE FROM table_name;

Example

DELETE FROM pets;

This format is used to delete every record from a table.
This however does not delete the table.

8. ORDER BY

The keyword is used to sort the result-set in ascending or descending order.
It sorts the records i descending order, use the **DESC* keyword.
Syntax

SELECT col_1, col_2, ...
FROM table_name
ORDER BY col_1, col_2, ... ASC|DESC

Let's explore the keyword using the employees database and employees table.

SELECT *
FROM employees
ORDER BY emp_id;

Here we've ordered the records of the employees table using the emp_id column in ascending order

SELECT * FROM employees
ORDER BY firstName, lastName;

Here we've order the records with the firstName and lastName. In an instance where the firstName is similar in two or more records the records will be ordered in respect to the lastName

10. GROUP BY

The GROUP BY statement groups row that have the same values into summary rows.
It is often used with aggregate functions

COUNT()
MAX()
SUM()
AVG()

Syntax

SELECT col_name(s)
FROM table_name
WHERE condition
GROUP BY col_name(s)
ORDER BY col_name(s)

Let's view various instances of GROUP BY statement

SELECT COUNT(department), department FROM employees
GROUP BY department;

Here we count the number of employees in each department

Exploratory Data Analysis Ultimate Guide

Elvis Mburu — Tue, 28 Feb 2023 20:15:28 +0000

Overview

Introduction

Data Science is an inter-disciplinary field that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data.

Data Science encompasses many steps and activities.
The main data science steps are:

Business understanding
Data collection
Data Exploration
Data Modelling
Model Evaluation
Model Deployment

We will dive deep into exploratory data analysis commonly referred to as EDA.

Exploratory Data Analysis (EDA)

What exactly is EDA?

EDA generally mean the process of exploring the data to gain insights, identify trends, patterns and various relationships between various features in the data.

To demonstrate various activities in the EDA phase of Data Science we'll use Python programming language.

If new to to Python here is a link to an earlier article about python for data science, it targets beginners to programming and gradually introducing the relevant concepts for Data Science.

Exploratory data analysis is often used to see what the data can reveal beyond formal modelling or hypothesis testing and provides a better understanding of data set variables and the relationships between them.

Aim
The main aim of EDA is to help look at data before making any assumptions.

Exploratory Data Analysis Tools

Specific statistical functions and techniques you can perform with EDA tools include:

Clustering and dimension reduction techniques, which help create graphical displays of high-dimensional data containing many variables.
Univariate visualization of each field in the data.
Bivariate visualiation and summary statistics that allow you to asses the relationship between each variable in the dataset and the target variable.
K-means Clustering is a clustering method in unsupervised learning where the data points are assigned into K groups.
Predictive models, such as linear models, use statistics and data to predict outcomes.

Types of exploratory data analysis

Univariate non-graphical: It is simple since we just consider one variable/feature. The primary goal is to know the underlying sample distribution and make observations about the population. Outlier detection is also part of the analysis:

Central Tendency: The commonly useful measures of central tendency are mean, median and sometimes mode.
Spread: It's an indicator of what proportion distant from the middle we are to seek out the info values.
Skewness and Kurtosis: Skewness is a measure of symmetry. A dataset is symmetric if it looks the same to the left and right of the center point. Kurtosis is a measure of whether the data are heavily-tailed or light tailed relative to a normal distribution.

Multivariate Non-graphical
It's an EDA technique that won't show the connectin between two or more varivables within the sort of either cross-tabulation or statistics.
Univariate Graphical
They involve a degree of subjective analysis

Histogram: They are used to describe a feature/variable in terms of frequency/distribution (central tendency, spread, outliers)
Boxplots: They oftenly used to describe measures of central tendecy and show robust measures of location and spread, symmetry.
Multivariate graphical: They display relationships between two or more features/variables(columns).

***Examples of multivariate graphics are :

Scatterplot:
Heatmap: It's a graphical representaion where values are depicted by color.

Exploratory Data Analysis is a continuous loop.

A Practical Approach in EDA

Libraries
We will use the following libraries in EDA

numpy: it's a Python library used for working with arrays. It also has functions for working in domain of linear algebra, fourier transform and matrices.

Installing numpy

!pip install numpy

Check the official documentation numpy

pandas: It's a Python library used for working with datasets. It has various functions for manipulating the data.

installing pandas

!pip install pandas

Official documentation pandas

matplotlib: It's a comprehensive library for creating static, animated and interactive visualizations in Python.

installing matplotlib

!pip install matplotlib

Offical documentation matplotlib

EDA Example

We are going to explore EDA using a housing dataset.
Here's a link

We want to predict the prices of houses based on certain factors like:

area - the area of the house in square feet
bedrooms - the number of bedrooms in the house
bathrooms - the number of bathrooms
stories - the number of floors (story building)
main road - nearness to the main road; yes if near to, no if not.
guestroom - yes if present, no if isn't
basement - no if absent, yes if present
hot overheating - yes if present, no if absent
airconditioning - yes if present, no if absent
parking - the number of vehicles the parking can accomodate
prefarea - yes if the locality of the house is of much preference to many, no if it isn't
furnishingstatus - furnished, semi-funished

1.1 Importing packages
This importing the necessary packages and modules required.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

1.2 Loading Dataset
Loading the dataset to conduct EDA... My data set is local but one may pass an URL if the data is hosted online.

housing_data = pd.read_csv("Housing.csv")

2. Exploratory Data Analysis

2.1. Preprocessing

View the first n rows of the data set in order to get a general idea of how the data looks like

housing_data.head()

2.2. Checking the shape of the dataset

housing_data.shape

2.3. Checking the the statistical metrics of the dataset

housing_data.describe()

2.4. Checking the info about the data

housing_data.info()

The dataset at hand is already cleaned so no need of performing the cleaning phase.

Univariate Analysis

As we said earlier, in univariate analysis you analyze the data of just one variable.
A variable refers to a single feature/column.
Some visual methods include :

Histograms: Bar plots in which frequency of data is represented with rectangle bars
Box-plots: The variable values are represented in form of boxes

Let's make a histogram of the price column

plt.title("House Prices")
plt.xlabel("Prices")
plt.ylabel("Frequency")
plt.hist(housing_data.price)
plt.show()

from the image we realize that the house price is positively-skewed. This is because more values are plotted on the left side of the distribution.
Most houses have a price range of 3 million and 4 million

Area

plt.title("House Area")
plt.xlabel("area")
plt.ylabel("Frequency")
plt.hist(housing_data.area)
plt.show()

The house area are positively skewed by having majority of the house areas range from 4000 square feet and below.
The mode area is 3000 square feet having 200 houses in total having the area

bedrooms

plt.title("Number of Bedrooms")
plt.xlabel("Bedrooms")
plt.ylabel("Frequency")
plt.hist(housing_data.bedrooms)
plt.show()

From the image we can infer that the house are normally distributed.
Majority of the houses have 3 bedrooms.

bathrooms

plt.title("Number of Bathrooms")
plt.xlabel("Bathrooms")
plt.ylabel("Frequency")
plt.hist(housing_data.bathrooms)
plt.show()

The histograms shows that most of the houses have only one bathroom

stories

plt.title("Number of Stories")
plt.xlabel("Stories")
plt.ylabel("Frequency")
plt.hist(housing_data.stories)
plt.show()

The histogram shows that majority of the houses are one or two stories

mainroad

plt.title("Nearness to Main Road")
plt.xlabel("Near main road")
plt.ylabel("Frequency")
plt.hist(housing_data.mainroad)
plt.show()

The histogram shows that many of the houses are near the main road

guestroom

plt.title("guest room presence")
plt.xlabel("Guest room")
plt.ylabel("Frequency")
plt.hist(housing_data.guestroom)
plt.show()

This histogram reveals that a majority of the houses have no guest room

basement

plt.title("Basement presence")
plt.xlabel("basement")
plt.ylabel("Frequency")
plt.hist(housing_data.basement)
plt.show()

The histogram shows that more houses have no basement as compared to those that have.

hotwaterheating

plt.title("Hotwater heating presence")
plt.xlabel("hotwater")
plt.ylabel("Frequency")
plt.hist(housing_data.hotwaterheating)
plt.show()

This histogram reveals that more houses do not have hot-water-heating that those that have.

airconditioning

plt.title("airconditioning presence")
plt.xlabel("airconditioning")
plt.ylabel("Frequency")
plt.hist(housing_data.airconditioning)
plt.show()

The histogram shows that more houses do not have air conditioning than those that have

parking

plt.title("parking size (no. cars)")
plt.xlabel("no. of cars")
plt.ylabel("Frequency")
plt.hist(housing_data.parking)
plt.show()

Many houses have no parking

prefarea

plt.title("prefarea")
plt.xlabel("prefarea")
plt.ylabel("Frequency")
plt.hist(housing_data.prefarea)
plt.show()

This histogram reveals that majority of the houses are not in the area of preferrence

furnishing status

plt.title("Furnishing Status")
plt.xlabel("furnishing")
plt.ylabel("Frequency")
plt.hist(housing_data.furnishingstatus)
plt.show()

There are three states of furnishing (furnished, semi-furnished, unfurnished)
Majority of the houses are semi-furnished

Bivariate Analysis

As we discussed earlier, Bivariate analysis is a kind of statistical analysis in which two variables are observed against each other. One variable will be dependent and the other is independent.

Using Bivariate analysis we will see how the various features relate to house price:

We'll use various visualizations to uncover the relationships, some are:

scatter plot
bar charts etc

1.1 area vs price
Let's plot a scatter plot of area and price.

plt.title("Area vs price")
housing_data.scatter(housing_data.price, housing_data.area)
plt.xlabel("Price")
plt.ylabel("Area")
plt.show()

from the image result we can infer that most of the houses that are 6 million and below have an area of 8,000 square feet. Thus the cheaper the house the smaller the area.
Some houses that are expensive even though the area is small and vice versa... This could either be :

an outlier
affected by other features

We will uncover this later as we look for correlation between the variables.

1.2 bedrooms vs price
Let's try to understand house the number of bedrooms affect the price of a house.

plt.title("bedrooms vs price")
plt.xlabel("Price")
plt.ylabel("Bedrooms")
plt.bar(housing_data.bedrooms, housing_data.price)
plt.show()

From the graph we can observe that the lesser the number of bedroom. But the relationship is not that linear like in regression, the most expensive houses are 4-bedrooms. We would expect that houses with 5 and 6 bedrooms to be more expensive. Some factors could be affecting this assumption.

Let's see the same graph using a scatter plot:

plt.title("bedrooms vs price")
plt.xlabel("Bedrooms")
plt.ylabel("price")
plt.scatter(housing_data.bedrooms, housing_data.price)
plt.show()

We can now see how the prices are distributed for each number of bedrooms.
if we look closely we can see that 3-bedroom houses are many compared to 4-bedrooms.

1.3 bathrooms vs price
Let's see how the price of the houses compare to the number of bathrooms

plt.title("bedrooms vs price")
plt.xlabel("Bathrooms")
plt.ylabel("price")
plt.scatter(housing_data.bathrooms, housing_data.price)
plt.show()

From the graph above we can infer the following observations:

Most of the houses have 1 or 2 bathrooms a total of 534 houses, majority having 1 bathroom (401)

1.4 stories
Let's explore the relationship between the number of stories and the price of the house

plt.title("Stories vs Price")
plt.xlabel("Stories")
plt.ylabel("Price")
plt.scatter(housing_data.stories, housing_data.price)
plt.show()

We can observe that many have 1 or 2 strories.

Some outliers can be viewed such as in houses with 3 stories.
The number of stories can be seen affecting the price of the house
There must be a drive that makes people to opt to 1 or 2 story houses. If we view a bar graph for the same relationship we see that the lesser the number of stories the more the range of prices ... This indicates there are various factors that still affect the price of the house such as furnishing and maybe nearness to the main road.

1.5. main road
Nearness to the main road definitely affects the price of the house. It also affect the number of houses available.
Let's dive into visualization and see the relationship between the prices of the houses to the distance to the main road.

plt.title("Nearness to mainroad vs price")
plt.xlabel("Near main road")
plt.ylabel("Price")
plt.bar(housing_data.mainroad, housing_data.price)
plt.show()

We can see that houses that are near to the more expensive than those that are not.
Such insights could have many conclusions but we'll dive into that since the project scope is still minimal.

1.6. guestroom
Shows if a house has a guest room or not.
We want to view the relationship between houses having a guest room or not and the price for each accord.

plt.title("Guest room present vs Price")
plt.xlabel("Guest room")
plt.ylabel("Price")
plt.scatter(housing_data.guestroom, housing_data.price)
plt.show()

from the graph we can infer that most of the houses have no guest room.
This does not mean that they are cheap, other metrics could be making the houses to be expensive.
We will see in the correlation graph.
The houses that have no guest room have a wider range of prices as compared to those that have a guest room.

1.7. basement
Having a basement may make the house price go up. We'l, explore on the effect of a house having a basement to the price of the house
Let's have a visual view of the relationship:

plt.title("Basement present vs Price")
plt.xlabel("Basement Present")
plt.ylabel("Price")
plt.scatter(housing_data.basement, housing_data.price)
plt.show()

The graph shows the distribution of houses in respect to it having a basement or not.
We see also that more houses do not have a basement as compared to those that have.
The price range of houses in respect to having a basement is wider in those that do not have a basement as compared to those that have.

1.8. hot overheating
Houses that have hot overheating can have an influence in the price of the house. Let's check on how the hot overheating feature affect the price or how it relates to price.

plt.title("Hot Waterheating present vs Price")
plt.xlabel("hotWaterheating Present")
plt.ylabel("Price")
plt.scatter(housing_data.hotwaterheating, housing_data.price)
plt.show()

From the graph above we can clearly see that most houses do not have the hot water heating feature.
The range of price for the houses without the hot water heating feature is much greater than that of the houses that do have the feature.
The wide range for the houses without the hot water heating feature are probably affected by other features since the price ranges are not majorly related.

1.9. aircondtitioning
Airconditioning feature is either present in a house or absent. This presence or absence of airconditioning can affect the price of houses.
Let's explore the relationship between the airconditioning feature and house prices

plt.title("Airconditioning present vs Price")
plt.xlabel("aircondtioning Present")
plt.ylabel("Price")
plt.scatter(housing_data.airconditioning, housing_data.price)
plt.show()

By looking at the graph we can infer that the houses are almost evenly distributed.
But those houses with air-conditioning have a wider range of prices.

2.0. Parking
Various houses have a whole number of vehicles that can be accommodated. Houses that have 0 parking value mean that the houses do not have any parking space.
Let's explore the relationship between the parking and the price of the houses.

plt.title("Parking vs Price")
plt.xlabel("parkig")
plt.ylabel("Price")
plt.scatter(housing_data.parking, housing_data.price)
plt.show()

We can infer from the graph that there are 4 categories of parking:

0 parking - no parking space
1 vehicle parking space
2 vehicle parking space
3 vehicle parking space

The price ranges are large in all the 4 categories.
There may be a factor affecting this which we will uncover when checking for correlation.
The houses with 1 parking space can be viewed as closely distributed.

2.1. prefarea
Houses can be either be in an area of preference or not. We might want to uncover the relationship between preference and the price of the house.
Let's compare the prefarea to the price of the house

plt.title("Prefarea vs Price")
plt.xlabel("prefarea")
plt.ylabel("Price")
plt.scatter(housing_data.prefarea, housing_data.price)
plt.show()

From the graph we can infer that majority of the houses are not in the preferred area. These houses (not preferred area) have a wider range in distribution as compared to those that are preferred.
Seemingly the houses that are the preferred area are much higher in the price as compared to those that are not.

2.2. furnishing status
A house can either be furnished , semi-furnished or unfurnished.
This may have an effect in the pricing of the house.
We will use a scatter plot to uncover insights about the relationship between furnishing status and the price of the houses.

plt.title("Furnishing Status vs Price")
plt.xlabel("furnishing stats")
plt.ylabel("Price")
plt.scatter(housing_data.furnishingstatus, housing_data.price)
plt.show()

from the graph we can se that the houses are almost even;y distributed in terms of numbers per category.
We can also see that the furnished status of the house has a wide range in prices and also recording the highest prices.

I will add an update to cater for correlation between every feature.

Python 101 - Python for Data Science

Elvis Mburu — Sun, 19 Feb 2023 14:42:53 +0000

Python is a high-level, general purpose programming language.
Python is dynamically typed.
The language is object-oriented and supports functional programming too.
Python was developed by Guido van Rossum in the 1980's.

Since there are many programming languages lets look at why Python may be the best fit for you:

Advantages of Python

Simplicity in it's use and hence simple to understand.
It's free and Open-Source: This is made possible by a whole wide range diverse and vibrant community determined to develop and improve it.
Interpreted Language: This means that Python directly executes the code line by line. Incase of an error, it stops further execution and reports back the error which has occurred.
Extensive library: Python has an extensive library of different packages and methods thus reducing coding many functions from scratch.
Dynamically Typed
Portability: This ensures code developed in one machine runs in another machine including those having different architectures.
Supportive and vibrant large community.

Applications of Python
Python as a language has traversed many use cases and is now being used in many fields and domains.

Here's a few of them:

Web applications
Automation
Artificial Intelligence
Statistics
Data Analysis
Machine Learning
Desktop Applications
Back-end Development

Deep Dive Into Python

We'll put our focus on the Python for Data Science.
But first we'll build our Python muscles by understanding the basics.

Outline

Introduction to variables
Data types in python
Operators in Python
Data Structures
Control Flows
Functions
Packages
Data Science

Setting up Coding Environment
You can use the following tools:

Jupyter notebooks Windows:
install Python link: click here
Download and install Anaconda here : link: click here
Mac OS : click here
Linux OS : click here
Google colab : It's an online environment to run Python code
you can access it : click here

1. Introduction to variables

What are variables? You might ask.
A variable is a value that can changes and is assigned a value to which it refers to.

Remember this in your O and A levels:
let x be 12 or even y=mx+c
In this case x and y are variables that refer to/represent something else

Something amazing with them is that they can be used multiple times and refer to different values each time.
Example:

x = 2 
x = 4
x = 8
or in the case of `y=mx+c` where c is a constant
y = 43 + 3
y = 45 + 4

In Python variables are pretty much the same as the concept used in Mathematics.
They are used to refer to various values

Example

x=56
y=45.34
hello= "Hello, world"

There are various rules governing variable naming.
Here's a few:

Variable names cannot be the same as Python keywords
variable names can only contain letters, digits or an underscore
Variable names can only start with a letter or an underscore
Variable names cannot contain spaces
Variables names are case-sensitive thus myName and MyName are regarded as different variable names

Here's a link to the official guide

2. Data Types

A data type is a classification that specifies which type of value of a variable has.
There are various data types used in Python
Here's a few that are supported in Python

strings: They refer to a sequence of characters, digits or symbols and are always treated as text.
Boolean: True or False values
Integer: Numeric data types that do not have fractions/decimals
Float: Numeric data types that have fractions
Example in code

num1 = 1 # Integer
num2 = 2.0 # Float
bool1 = True # Boolean True
bool2 = False # Boolean False
myStr = "Hello, world" # String

In the above code example you have noticed something new that we have not talked about: The # character.
This character is used to denote a comment.
What is a comment?
A comment is an explanation/annotation in the source code of a computer program
They are added to make the code easier to understand and are ignored by the interpreter hence not executed
Comments in Python are used in a single line

3. Operators in Python

There are two types of operators in Python

Arithmetic Operators
Conditional Operators

a. Arithmetic Operators
They perform basic Mathematical functions.
Here's a simple list:

+ addition x + y
- Subtraction x-y
* Multiplication x*y
/ Division x/y
% Modulus x%y
** Exponentiation x**y
// Floor Division x//y

b. Conditional Operators
They are used in conditional statements that evaluate to True or False.
Examples:

and Logical AND: True if both the x and y operands are true
or Logical OR: True if one of the x or y operands is true
not Logical NOT:True if operand is not x false and vice versa
> Greater than: True if the left x>y operand is greater than the right
< Less than: True if the left operand x<y is less than the right one
>= Greater than or equal to x>=y
<= Less than or equal to x<=y

4. Data Structures

A data structures are a way of organizing data so that it can be accessed more efficiently depending upon the situation.

Here's a list of some of the main data structures in Python.

Lists
Dictionaries
Sets
Tuples

a. Lists
Lists refer to a data structure that is used to hold multiple items in one variable and can be created using [] brackets
Example
fruits = [] # Here we create an empty list names = ['John', 'Doe'] # Here we create a list containing two items

Lists are ordered and their items can be accessed by what we call indexing.
In Python the first index is always 0.
So in order to access an item in a list we use:
list_name[index]
for example:

fruits = ['apple', 'mango', 'melon', 'orange'] # a list containing 4 items
fruits[0] # accessing the first item 'apple' from the list
fruits[1] # accessing the the second item 'mango' from the list

Some list methods and manipulation

** slicing **
Refers to retrieving items from a specified portion in a list
Examples:

fruits = ['apple', 'mango', 'melon', 'orange']
fruits[:] # retrieving every item in the list
fruits[0:2] # retrieving items from the first element to the element at index 2 exclusive
fruits[-1] # negative indexing, retrieving the last item

len()
the function returns the length of the list
Example:

fruits = ['apple', 'mango', 'melon', 'orange']
print(len(fruits)) # prints 4 which is the number of elements in the list fruits

type()
Return the data type
Example:
print(type(fruits)) # prints <class 'list'>

Lists are mutable, this means that they can modified.
Thus:

you can add items to a list
you can remove an item from a list
you can change the list items

Examples:

fruits = ['apple', 'mango', 'melon', 'orange']
fruits.append('guava') # adding 'guava' at the end of the list
fruits.insert(1, 'passion') # inserting 'passion' at index 1 of the list
fruits.pop() # remove the last item in the list
fruits.remove('apple') # removing apple from the list

b. tuple
Tuples are used to store multiple items in a single variable.
Tuples are immutable, thus you can not alter the form in which they were created.
They store items in ()

Example
thisTuple = ('apple', 'banana', 'berry') # creating a tuple named 'thisTuple' with three items

Tuples are ordered
Tuples are immutable
Tuples allow duplicates
Tuples can contain different data types

a type()
Returns the tuple's data type

mytuple = (1, 2, 3, 4)
print(type(mytuple)) # returns <class 'tuple'>

c. Set

It is a collection which is unordered, immutable and un-indexed
No duplicate members

names = {'one', 'two', 'three'}

d. Dictionary

It's a data structure that consists of key-value pairs.
It's ordered, mutable and doesn't allow duplicates.

Dictionaries are written with curly brackets and have keys and values.

Example:

myDict = {
    'brand': 'Ford',
    'model': 'Mustang',
    'year': 1964
} # creating a dictionary with 3 sets of elements (key-value pairs)

5. Control Flows

a. if Statements
It is a conditional statement that is used to determine whether a block of code will be executed or not.

If the condition defined evaluates to true, it will continue to execute the code block in the if statement

Example of if-statement

age = 20
if (age > 18):
    print("You are an adult")

What if you want to execute another block of code if age is not greater than 18?
We make use of the else statement

age = 20
if (age > 18):
    print("You are an adult")
else:
    print("You are still a minor")

What if you want to test many conditions?
We'll make use of elif statement

age = 20
if (age < 18):
    print("You are a minor")
elif (age > 18 and age <= 35):
    print("You are an adult")
else:
    print("You are a senior adult")

we can even use if statements inside other if statements.
They are called nested if statements.

Example:

age = 20
if (age > 18):
    if (age < 35):
        print("You are an youth")

b. for Statements

It iterates over the items of any sequence, in the order that they appear in the sequence

words = ['cat', 'window', 'defenestrate']
for word in words:
    print(word, len(words))

c. while Statement

It is used for repeated execution as long as an expression is true.
Example:

number = 5
x = 0
while ( x < number):
    print(x)
    x++

The range() Function

It generates arithmetic progressions

for i in range(5):
    print(i)

This generates 5 numbers 0 through 4 (remember python starts counting from 0)

The break and continue Statements

The break statement breaks out of the innermost enclosing for or while loop

for i in range(2, 10):
    for x in range(2, n):
        print(n, 'equals', x, '*', n//x)
        break

else:
    print(n, 'is a prime number')

The break statement continues with the next iteration of the loop

for num in range(2, 10):
    if num % 2 == 0:
        print("Found an even number", num)

    print("Found an odd number", num)

pass Statements

The pass statement does nothing.
It is often used when a statement is required syntantically but the program requires no action

Example

while True:
    pass

6. Functions

A function is a block of code which only runs when it is called.
A function can return data

There are four types of Python Functions:

Built-in functions - they are functions embedded in the Python interpreter and are ready for use.
You have certainly come across some by now example:
- len() - finding the length of a list, tuple etc
- print() - display a sequence of characters
- type() - return the data type of a data structure etc
Recursion functions - refers to functions that call themselves
Lambda functions - they are anonymous function that are defined without a name
User defined functions - they are functions defined by the user to do a specific task

Example of user defined functions

def greetings(): #defining the function
    print("Hello All")

greetings() #calling the function

7. Packages
Packages are collections of multiple Python files.
Packages are a directory of python scripts, where each script performs a specific function.

For Data Science, the commonly used packages are:

Numpy: Used for working with arrays
Matplotlib: Used for Data Visualization
Scikit-learn: For Machine Learning Algorithms The Python files are known as modules. This approach helps achieve modularization.

importing packages

Packages can contain sub-packages which also have modules
To load any package or module, we use the keyword import followed by the module name or package name

i. numpy
numpy - Numerical Python
It's a core library for scientific computing.
It provides high performance multi-dimensional array object and tools for working with these objects

numpy vs. python list
Numpy is much faster in performance than purely Python based approach
creating numpy array from a python list

import numpy as np # importing numpy package and giving it a np alias
marks = [78, 47, 98, 43, 58] # creating a list
marks_np = np.array(marks)
print(type(marks_np)) # prints
ndarray attributes
- ndim: number of dimensions of the array
- shape: shape of the array array (n_rows, n_cols)
- dtypes: data types stored in the array
- size: the total number of elements in the array
- strides: number of bytes that must be moved to store each row and column in memory (no_bytes_files, no_bytes_columns)

Example:

print('dimension ',mark_s.ndim)
print('shape ', mark_s.shape)
print('size ', mark_s.size)
print('dtype ', mark_s.dtype)
print('strides ', mark_s.strides)

some key functions defined for numpy arrays

zeros(shape=(n,m)) : creates a zero-array with the shape (n rows, m columns)

x = np.zeros(shape=(3,5), dtype ="int32")
print(x)
arange(start=i, stop=j, step=u) : creates a 1-D array whose first value is i inclusive and last value of j exclusive, each values has a step of s to the next or from previous

x = np.arange(start=100, stop=1000, step=100, dtype="int32")
print(x)
linspace(start=i, stop=j, num=n) : creates a 1-D array whose first value is i inclusive, last value is j inclusive and contains n values in total

x_lin = np.linspace(start=10, stop=50, num=30)
print(x_lin)
full(shape=(n,m), fill_value=f) : allows to create an array with the shape (n rows, m columns), where all positions have the value f.

x_ful = np.full(shape=(5,6), fill_value=3)
print(x_ful)

ii. Pandas
Stands for Python Data Analysis Library
It is an open-source Python library
It is used by data scientists/analysts to:

read
write
manipulate
analyze the data

Why Pandas?

It helps you explore and manipulate data in an efficient manner
It helps you analyze large volumes of data with ease

Why is Pandas popular?

Easy to read and learn
Fast and powerful
Integrates well with other visualization libraries

importing pandas

import pandas
import pandas as pd # creating an alias for pandas

Pandas Series
A series is:

a 1-D labelled array
can hold data of any type
similar to a table's column

A Series can have:

Integers
Strings
Both numbers and strings

The Series data type is object
Series are indexed, starting from 0

Creating a Series

import pandas as pd
numbers = [1, 3, 5, 5, 7, 9, 13, 56]
pd.Series(numbers) # A series from a list

country = {'Kenya': 'Nairobi', 'Tanzania': 'Dodoma', 'Uganda': 'Kampala'}
pd.Series(country) # Creating a series from a dictionary, the dict keys will be the index for the Series

Pandas DataFrame
A DataFrame is:

a 2-D table
made up of a collection of Series
Structured with labeled axes (rows and columns)

You create a DataFrame with the .DataFrame() method

import pandas as pd
data = {'item_id': [1, 2, 3, 4, 5], 'item_name': ['chocolate', 'floor', 'sugar', 'ice cream', 'soap'], "item_price": [356.00, 200.00, 150.00, 55.00, 187.00]}
pd.DataFrame(data)

The DataFrame has 3 columns each containing 5 entries.

Some pandas functions and methods

.head() shows the top entries in a DataFrame. Number of values to be shown can be specified in it.
.tail() shows the last entries in a DataFrame. Number of values to be shown can be specified.
.descibe() gives the statistical analysis of the each column in the DataFrame
.shape describes the rows and columns present in the DataFrame
.info() gives a summary of the DataFrame showing the sum of not null values

data.shape
data.head(5)
data.tail(9)
data.info()
data.describe()

You can access a column by using it as the index of the DataFrame

print(data['item_name']) # This outputs the entries in the 'item_name' column

8. Data Science

Data Science is a field that combines math and statistics, Specialized programming, advanced analytics, artificial intelligence and machine learning with specific subject matter expertise to uncover actionable insights hidden in an organization's data.

Steps involved in data science process:

Business understanding/analysis
Data Exploration and Preparation
Data Transformation and Representation
Data visualization
Data Modelling, Training, Validation and Deployment

Some of the Python Libraries used for Data Science:

NumPy
Pandas
Scipy
Matplotlib

Since Data Science is a team spot environments that allow collaboration such as sharing code.
Such environments are:

Jupyter notebooks
Github

We'll deep dive into Data Science in the next article