DEV Community: Belinda Florence

Getting Started with Sentiment Analysis

Belinda Florence — Mon, 27 Mar 2023 20:49:35 +0000

Opinions are very important to a product data mining. They shape the improvement and corrections necessary for a product or service. It is from such opinions that companies are able to build end products that are customer centered and comfortable to use. Sentiment analysis can be expensive to do when it involves an actual counting to determine the quality of product as per the individual customer review. These are just among many other reasons why automated sentiment analysis is key to companies. In this article, we are going to discuss concepts that are crucial when getting started with sentiment analysis. In addition, we shall also see a simple roadmap on how to go about the analysis step by step.

Introduction
Sentiment analysis is also called Opinion mining, and it is the identification and classification of sentiments that come in text form in the field of natural language processing. To get a nice amount of sentiment data to analyze, then Twitter is among the best place to look. The data helps understand difference in opinion of people on social media about carious topics. To detect the sentiments of a customer, it is therefore paramount to develop an automated machine learning sentiment analysis model.

Source data: Sentiment140dataset

Machine learning pipeline content: Logistic regression, SVM and Bernoulli Naive Bayes as classifiers and Term Frequency-Inverse Document Frequency, NLTK library for natural language processing.

Now let us dive in to the exciting part of data analysis, the pipeline steps:

A) IMPORT LIBRARIES AND DEPENDANCIES

import numpy as np
import pandas as pd
import seaborn as sns

# For plotting purposess
import matplotlib.pplot as plt
import seaborn as sns
from wordcloud import WordCloud

#Skliearn tools
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# for extracting important feature in text
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics import confusion_matrix, classification_report

#natural language toolkit
from nltk.stem import WordNetLemmatizer

B)LOADING DATA INTO OUR NOTEBOOK

# we shall give a variable name data to load data into.
# Due to the volume of the data, we can specify the number of rows we can load, for this case we can use 5000 rows

data =pd.read_csv('file_path',nrows=5000)

There are some troubles that can arise while loading the data on jupyter notebook. We will handle one here; for instance, after loading the data and reading the head we find that part of our data is acting as the header instead as being part of the data rows, we can correct that using the following command instead:

#it will return the headers of the columns as numbers instead of the data
data =pd.read_csv('file_path',nrows=5000,header=None)

C) EDA(Explaratory Data Analysis)
At this point, we want to explore and discover different aspects and contents in the data we now have. I have written an article about this, in case you need some clarification on what exactly EDA is you can visit the article:
i) Read head
This enables us access the first five rows by default. We can use the following code to achieve this:

data.head()
From our data, it has 5 columns if correctly loaded.

ii)Shape
Basically, this is just to see the dimensions of our data in terms of the number of rows and columns. We use the following code to achieve this:
data.shape

iii) Length of data
To check out the length of the data, we use the following code:
len(data)

iv) Check for Null values
We want to see the data that we have loaded if it has any null value. To achieve this, we use the following statement:
data.isnull()
this will return the response for each row, which is a little cumbersome, so the best way is return the sum of null values, if they exist using the following command:
np.sum(data.isnull())
If correctly done, it returns the results for the number of null values in each of the five columns. We discover this particular data does not contain any null values.

v) Unique Values
We want now to Check in the 'target' column for any unique value that can be of interest to us. We use:
data['target'].unique()
for our case, 0 is the unique value and is of datatype int.

vi)Data Visualization
We can plot a graph using Jupyter notebook to graphically understand our data

D) PRE-PROCESSING
Removing noise(unnecessary data) from the data will help us have a higher accurate models. The noise in this kind of data include stopwords(like 'the', 'is','in'..), special characters, and hashtags if they exist. It is advisable to change data to lowercase for better generalization.

i) Create new dataset using Target and Content
We achieve this by selecting a sample for investigation from the data given. In our case we will use the 'Target' which is the classification of sentiment and the 'Tweet_Content' column and work with them:
sample = data[['Tweet_Content','Target']]
ii) Check for unique values
We can now check the unique values in our new dataset sample for column target using the following code:
sample['Target'].unique()
_iii) Divide the Target column _
We achieve this by assigning all 0 values to negative and 1 value to positive.
Due to the vastness of our data, we can choose a portion from each category to work with, for our case we will work with 0.25 million from each case as shown below:

positive = positive.iloc[:int(25000)]
negative = positive.iloc[:int(25000)]

iv) Form new dataset
Now we form a new data set with 500,000 dataset with both negative and positive representation using the following code:
new_dataset = pd.concat([positive,negative])

v) Change tweet content to lower case
We want to have data that is generally uniform. Good news is that Python has a Library that help us turn text into lower case or upper case. For our case we will turn to lowercase using the following code:
new_dataset['Tweet_Content']=new_dataset['Tweet_Content'].str.lower()

vi)Remove all stopwords
We want to get read of all stopwords. We start by creating a list that contain all possible stop words and we will use it to create a method to remove stop words from our data. We create the list as shown:

stopwordlist = ['a', 'about', 'above', 'after', 'again', 'ain', 'all', 'am', 'an',
             'and','any','are', 'as', 'at', 'be', 'because', 'been', 'before',
             'being', 'below', 'between','both', 'by', 'can', 'd', 'did', 'do',
             'does', 'doing', 'down', 'during', 'each','few', 'for', 'from',
             'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here',
             'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in',
             'into','is', 'it', 'its', 'itself', 'just', 'll', 'm', 'ma',
             'me', 'more', 'most','my', 'myself', 'now', 'o', 'of', 'on', 'once',
             'only', 'or', 'other', 'our', 'ours','ourselves', 'out', 'own', 're','s', 'same', 'she', "shes", 'should', "shouldve",'so', 'some', 'such',
             't', 'than', 'that', "thatll", 'the', 'their', 'theirs', 'them',
             'themselves', 'then', 'there', 'these', 'they', 'this', 'those',
             'through', 'to', 'too','under', 'until', 'up', 've', 'very', 'was',             'we', 'were', 'what', 'when', 'where','which','while', 'who', 'whom',
             'why', 'will', 'with', 'won', 'y', 'you', "youd","youll", "youre",
             "youve", 'your', 'yours', 'yourself', 'yourselves']

Now let us create a method to remove all these stop words as shown below:

STOPWORDS = set(stopwordlist)#turn it into a set
def cleaning_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])
new_dataset['Tweet_Content'] = new_dataset['Tweet_Content'].apply(lambda text: cleaning_stopwords(text))
new_dataset['Tweet_Content'].head()

The above code returns our data in the Tweet content minus the stop words we have defined in the stopword list.

vii) Remove Punctuation
Punctuation marks can hinder the accuracy of our model, luckily, Python has a string library that contains English words and punctuation, which is very helpful right now. Hence we will remove them in this section as shown below:

import string
eng_punctuations = string.punctuation
punctuations_list = eng_punctuations
def cleaning_punctuations(text):
    translator = str.maketrans('', '', punctuations_list)
    return text.translate(translator)
new_dataset['Tweet_Content'] = new_dataset['Tweet_Content'].apply(lambda text: cleaning_punctuations(text))
new_dataset['Tweet_Content'].head()

These returns the first five text without panctuation marks.

viii) Remove Repeating Words
To remove repeating words we use the following piece of code:
we will import the re library(A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).) just in case it is not present using the following code:
import re
now let us define a method to remove repeating words:

def cleaning_repeating_char(text):
    return re.sub(r'(.)1+', r'1', text)
new_dataset['Tweet_Content'] = new_dataset['Tweet_Content'].apply(lambda x: cleaning_repeating_char(x))
new_dataset['Tweet_Content'].tail()

viii) Remove urls
In twitter, sometimes we find comments having urls to refer the targeted audience to another location on the internet regarding their interest at the time. We will define a method to achieve this:

def cleaning_URLs(data):
    return re.sub('((www.[^s]+)|(https?://[^s]+))',' ',data)
new_dataset['Tweet_Content'] = new_dataset['Tweet_Content'].apply(lambda x: cleaning_URLs(x))
new_dataset['Tweet_Content'].head()

ix) Remove numeric characters
At this point, we remove numbers from the tweets cause we want our data to be as clean as possible. We achieve this using the following code:

def remove_numbers(data):
    return re.sub('[0-9]+', '', data)
new_dataset['Tweet_Content'] = new_dataset['Tweet_Content'].apply(lambda x: remove_numbers(x))
new_dataset['Tweet_Content'].head()

x) Tokenizing Tweet Content
Tokenization is splitting paragraphs and sentences into smaller units that can be more easily assigned meaning. The first step of the NLP process is gathering the data (a sentence) and breaking it into understandable parts (words). It is can be achieved using the following example:

from nltk.tokenize import RegexpTokenizer
def tokenize_tweet(tweet):
    return nltk.word_tokenize(tweet)
new_dataset['Tweet_Content_Token'] = new_dataset['Tweet_Content'].apply(tokenize_tweet)
new_dataset.head(10)

xi) Lemmatization
Lemmatization in Natural Language Processing(NLP) helps models break a word down to its root meaning to identify similarities.

lam = nltk.WordNetLemmatizer()
def lemmatizer_on_text(data):
    text = [lm.lemmatize(word) for word in data]
    return data
new_dataset['Tweet_Content_Token'] = new_dataset['Tweet_Content_Token'].apply(lambda x: lemmatizer_on_text(x))
new_dataset['Tweet_Content_Token'].head()

xii) Stemming the words
Stemming is the process of reducing a word to its stem that affixes to suffixes and prefixes or to the roots of words known as "lemmas".It is useful for indexing words. It is done as shown in the example below:

def stemming_on_text(data):
    text = [st.stem(word) for word in data]
    return data
new_dataset['Tweet_Content_Token'] = new_dataset['Tweet_Content_Token'].apply(lambda x: stemming_on_text(x))
new_dataset['Tweet_Content_Token'].head()

xiii) Labeling axes
At this point, we have to separate the input feature and label. Our Tweet Content is constant, it cannot be changed once is expressed. What changes is the label of the tweet being either positive or negative. We, therefore, assign the X label the Tweet Content which is constant and the y label the sentiment as shown in the example:

X=data.Tweet_Content
y=data.Target

xiv) Cloud of Words
Sometimes words change meaning with change in context. Cloud of words help us organize this words into context of either negative cloud to help us develop an AI model. Negative cloud of words is achieved through the following example

positive = data['text'][:800000]
wc = WordCloud(max_words = 1000 , width = 1600 , height = 800,
              collocations=False).generate(" ".join(positive))
plt.figure(figsize = (20,20))
plt.imshow(wc)

xv)Split data into training and testing data
We split our data into training data and testing data as shown in the example:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.05, random_state =26105111)

E) CREATING MODELS

We have processed the data so far to our liking and now we can fit it and create a model. We shall use TF-IDF vectorizer to fit the data as shown in the example:

vectoriser = TfidfVectorizer(ngram_range=(1,2), max_features=500000)
vectoriser.fit(X_train)
print('Feature_words count: ', len(vectoriser.get_feature_names()))

We can also use the vectorizer to transform out training and testing data as shown in the example:

X_train = vectoriser.transform(X_train)
X_test  = vectoriser.transform(X_test)

First Model: Bernoulli Naive Bayes Classifier
We shall specify the score of our models using the following example:

def model_Evaluate(model):
    y_predict = model.predict(X_test)
    print(classification_report(y_test, y_predict))
# Compute and plot the Confusion matrix
cf_matrix = confusion_matrix(y_test, y_predict)
categories = ['Negative','Positive']
group_names = ['True Neg','False Pos', 'False Neg','True Pos']
group_percentages = ['{0:.2%}'.format(value) for value in cf_matrix.flatten() / np.sum(cf_matrix)]
labels = [f'{v1}n{v2}' for v1, v2 in zip(group_names,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot = labels, cmap = 'Blues',fmt = '',
xticklabels = categories, yticklabels = categories)
plt.xlabel("Predicted values", fontdict = {'size':14}, labelpad = 10)
plt.ylabel("Actual values" , fontdict = {'size':14}, labelpad = 10)
plt.title ("Confusion Matrix", fontdict = {'size':18}, pad = 20)

Now to the model:

BNBmodel = BernoulliNB()
BNBmodel.fit(X_train, y_train)
model_Evaluate(BNBmodel)
y_predict1 = BNBmodel.predict(X_test)

Second Model: Logistic Regression
We also use the second model to see the accuracy as shown in the example:

LRmodel = LogisticRegression(C = 2, max_iter = 1000, n_jobs=-1)
LRmodel.fit(X_train, y_train)
model_Evaluate(LRmodel)
y_pred3 = LRmodel.predict(X_test)

Third Model: SVM
The last model is SVM and afterwards we will do comparison using the three models and choose which one is preferred:

SVCmodel = LinearSVC()
SVCmodel.fit(X_train, y_train)
model_Evaluate(SVCmodel)
y_pred2 = SVCmodel.predict(X_test)

CONCLUSION
From analysis on basis of score, logistic regression has the highest score among the three models. You can find the code on Jupyter notebook for further reference on my github account https://github.com/Gamalie/Data_Science-BootCamp/blob/main/Twitter%20Sentiment%201.ipynb

Essential SQL Commands for Data Science

Belinda Florence — Tue, 14 Mar 2023 14:06:06 +0000

Effective and essential tools are critical when it comes to good performance. This is not an exemption to data scientists, they too seek tools and ways of making their work enjoyable and fast despite the challenges that come with the tasks. SQL is helpful in extracting important information from a database. In this article, we shall look at SQL commands that are essential for data scientists to know and use in their tasks. The commands vary from the most basic one to the most advanced with examples to make it possible to follow when reading the article.

a) SELECT...FROM...
This command is used to retrieve data from one or more tables in the database. SELECT command retrieves all data, sorts, and filters using accompanying different functions as illustrated.

To get all data in a given table:

SELECT *
FROM _table_name_

To get retrieved data sorted we can use the group by function as illustrated:

SELECT *
FROM students
GROUP BY class

However, it is advisable that the GROUPBY function be accompanied by COUNT function to make the command sensible. For instance

SELECT count(_Column_name_) from _table_name_
group by _column_name2_;

We can also select specific columns from a table and view them using the following SELECT command:

select column1,colum3
from _table_name_;

ii)Distinct
It is used to select only unique values in the tables for the specified column. For many tables, there is duplication of certain values and this brings redundancy when it comes to viewing. This function help solve this problem.

select distinct (_column_name_)
from _table_name_;

iii) Where
This is an additional function that is used to filter data in the database. Here we specify the desired conditions to be met for the records we want to be displayed.

select _column_name_ 
from _table_name_
where _condition_;

The only data that will be returned is the one that meets the conditions only and non-other Otherwise, no record will be displayed. We can also return more than one column, for instance in a class a school table where we want to see students who have more than 10 years, we can use the following command:

select Age, First_Name
from School
where Age>10;

This command returns the column of age and First names of the students in the table who have an age greater than 10 years.
When it comes to the conditions, we use '>' for greater than,'>=' for greater or equal to,'<=' for less or equal to and '=' for equal to. It is awesome to filter data, isn't it?

Where can also be combined by logic operators to get specific results of records. Foer instance, where the condition should meet more than one criteria, we use and operator to achieve this as shown in the syntax below:

select Age, First_Name
from School
where Age>10 and First_Name!= James;

or can also be used to get records that meet either of the condition or both of them as in the example below:

select Age, First_Name
from School
where Age>10 or First_Name = James;

in addition to that, not can also be used to show the condition given, if met, the records that meet that condition should not be returned as shown in the example:

select Age, First_Name
from School
where not Age>10;

This returns records of any student less than 9 years old.

b) Order By
By default, this command sorts desired data in ascending order. In the case that we want our data order by in descending order, we use the DESC function to achieve a descending order as in the sample:

select _column_name_
from _table_name_
order by _column_name1_;

to order in descending order we specify within the command as follows:

select _column_name_
from _table_name_
order by _column_name1_ desc;

ii)Order by more than one column
So let us say we want to use more than one column to order our data, we may ask ourselves if it is possible. The answer is yes. It is very possible to use more than one column as follows:

select _column_name_
from _table_name_
order by _column_name1_asc,column_name2 desc;

Using more than two columns to specify order by is mostly efficient when we are dealing with data that has duplication in some columns. Let us say, in a school table, we might find several students sharing first names and just to create some order in it, we might query the table to give us the table with the column of age and class ordered for neatness purposes which results to easy view.

c)Group By
This function is used to summarize rows that have same values into groups. In most cases, it is used with aggregated functions like sum, average, max, min and count. For example, given a school, we can find the total number of students, return their first names and group by age in a given class as follows:

select count(_column_name_)
from _table_name_
group by column_name1;

It is also acceptable to use other aggregation like minimum for example:

i) Use count

select count(_column_name_)
from _table_name_
group by column_name1;

ii) Use max

select max(_column_name_)
from _table_name_
group by column_name1;

iii) Use average

select avg(_column_name_)
from _table_name_
group by column_name1;

iv) Use min

select min (_column_name_)
from _table_name_
group by _column_name1;_

The trick with the group, is we categorize data according to a certain value we require and by default is in ascending order which we can specify by using the desc function. Priority of order is given to the column name described in the group by function rather than select function.

d) Join
This command joins rows from more than one table, based on related columns between them. The joins are in four ways;

i)inner
Gives records with matching values in both tables. For instance, if we are querying a number of columns from two different tables that have a similar column name with the same values, then the rest of the columns from different tables will be returned on basis of the common column name and values. Example

Our database has a table of customers_information and credit

CUSTOMER_ID DATE COUNTRY NAME BUSINESS
4 10/12/22 Kenya Israel Hairdressing
7 14/12/22 Tanzania Tesh Hotel
1 20/12/22 Uganda Amor School

credit table
CUSTOMER_ID CREDIT STATUS
7 Poor Denied
6 Good In_process
4 Excellent Complete

when we use inner join, the results will be as follows:

select customers_information.DATE, credit.STATUS
from credit
inner join customers_information on credit.CUSTOMER_ID=customers_information.CUSTOMER_ID;

The syntax is as follows:

select _column_name,_ _column_name2_
from _table_1
_inner join _table_2_
on _table_1.column_name = table_2.column_name2_;

CUSTOMER_ID DATE STATUS
7 14/12/22 Denied
4 10/12/22 Complete

It is possible to also join 3 tables, the more reason to like sql 😂 right?
We use the same syntax as above but with a slight modification to accommodate the third table as demonstrated:

select _column_name,_ _column_name2_,column_name3
from _table_1
_inner join _table_2_
on _table_1.column_name = table_2.column_name2_;
on _table_1.column_name = table_3.column_name3_;

ii) Right Join
This command returns records of the second table (or right table). In case there are matching records from left table, then they are also returned together with those in table1.The syntax is as follows:

select _column_name,_ _column_name2_
from _table_1
right join _table_2_
on _table_1.column_name = table_2.column_name2_;

iii) Left Join
This is similar to right join only that it returns records of table 1 and any similar records found in table 2. The syntax is also similar to that of right join as shown

select _column_name(s)
from _table_1
left join _table_2_
on _table_1.column_name = table_2.column_name2_;

iv)Cross Join
This returns all records from all tables selected. It returns all matching records from the involved tables whether the rows in either table match or not When the conditional keyword is given(where) the results will be similar to those of inner join command query
Its syntax is as follows:

select _column_name,_ _column_name2_
from _table_1
cross join _table_2_;

e) Aggregate
In this part, we shall take a look at the syntax that goes in the aggregate commands. They include, sum,avarage, count, minimum and maximum commands. Let us look at each of them briefly

i) Sum
Now let us say we have a numeric column, like that of age or number of children one has in a given city. It is possible to get the total number of children in that city using the sum command that follows the following syntax:

select sum( _column_name)
from _table_name_;

In addition to this, we can also give conditions to the command of sum to get sum of specific data as shown below:

select sum( _column_name_)
from _table_name_
where condition;

ii)count
If we are not sure by the number of rows, or just need to confirm for safety purposes, then this is the command to use. count() returns the number of rows in a given column using the following syntax:

select count( _column_name_)
from _table_name_;

As also in the sum syntax, conditions can be added to return a specific number of desired rows in a column.

iii) Min
As from the word itself, min() function returns the least value in a given column. Additional conditions are also accepted using the where function. It follows the following syntax.:

select min( _column_name_)
from _table_name_;

iv) avg
Given a numeric column, lets say the age of students in a class, we can get their average age by using the following syntax:

select avg( _column_name_)
from _table_name_;

v) Max
As the name suggest, this function can be used to get the maximum value of a numeric column as shown in the syntax below:

select max( _column_name_)
from _table_name_;

f)Null Values
If we want to check if a record has null values( no value) we use is null or is not null functions. It can be illustrated using the following example:

select Age, First_Name
from School
where Age is null;

If there are empty records in the Age column, the results returns the number of null records and their count. If not the null table is return empty and count is zero.

ii) Is not null
It is used to show all records that do not have empty values. It is illustrated in the following example:

select Age, First_Name
from School
where Age is not null;

g) Regulatory clauses

i) Limit
Given that sometimes we work with large data and sometimes not all the data is necessary to be viewed, we can use limit clause to specify the number like in the example only 5 rows will be returned:

select *
from School
limit 5;

ii) Between
This clause gives a range of records that should be returned. An example is given below:

select *
from School
where Age between 10 and 14;

iii) in
It is used to select multiple values from records using the where statement as shown in the example:

select *
from School
where Age in (3,5);

the syntax is:

select _column_name_
from School
where column_name1 in (value1_in_column_name1,value5_in_column_name1);

h) Manipulation
There are several commands that enable us to manipulate the tables in the databases to achieve a different thing. For instance, let us say the name given to a certain column is too technical for some people who need to view the table. In this case, the clause AS is used to achieve a change of name. It is also called the alias command as illustrated in the example:

select Last_Name as Surname
from School
where Age in (3,5);

The record will return the column name 'Last_Name' but will be recorded as 'Surname'.
It is important to note that the name is only durable during the query time and is not permanent on the table.

ii) Update statement
If we want to change some things in the records that we have, we use the update syntax to achieve this. Where clause is used to specify the records that are being changed. Where is very important to use, otherwise if not specified, all records will be updated. The example below shows how this is possible:

update _table_name_
set column_name =(val_1,val_2,val_3)
where condition;

for instance

update School
set First_Name = ('Julius'),Age = ('3')
where Residence = Nairobi;

iii)Delete statement
For instance, we have been recording values that are wrong to a given table and we have just realized. Is there a way to delete these records until we find the right value to update the records? Of course, using the delete statement makes this possible. Using the following example, we can delete values of age and school for a certain student:

delete from school
where Age = 3;

Now if we are frustrated and have just learnt that all our records are wrong, we can delete the records and remain with the table. What a fresh start, right? This can be achieved using the following syntax:

delete from School;

All records in the 'School' table will be deleted

I) Union
Just like in mathematics, union statements return values of two tables which should be of the same data type and have the same order. They return only distinct values from the two column queried using the select statement as in the syntax below:

select column_name from table1
union
select column_name(n) from table2;

To get duplicates, as in all the values from the two columns from the two tables, use union all as in the syntax:

select column_name from table1
union all
select column_name(n) from table2;

Database **
**i)Creation of Database
It is a simple and straight command that will create a database of the desired name. It is as in the example below:

create database _database_name_;

ii) Deletion
We have been discussing the commands that affect tables and columns but for this section, the discussion will be on commands that affect the database generally;

If we have been recording very confidential information for a given NGO concerning medical reports of individuals in a given area, and the task that the NGO desired is achieved, we can delete the database, in case that the records are no longer important. For this case, we use the drop 'database' command to achieve this as illustrated:

drop database _databse_name_

Tables
i) Creation
We can create a table from scratch or using an existing table. All this is illustrated in the following examples:

create table _table_name_(
      _column_name datatype,
      column_name1 datatype,
      column_name2 datatype_
     );

--creation from scratch

creation from an existing table

create table _table_name_ as 
   select column_name,column_name1,column_name2
   from _existing_table_name_;

ii) Modification
To add and delete or change columns of a given table permanently,we use the 'Alter Table' Command.
To add column we use the following command:

alter table _table_name_
add _column_name datatype_;

We can also change something in the column, like change of datatype of a given column. It is as illustrated:

alter table _table_name_
modify column _column_name datatype_;

There is also a possibility of dropping a column all together, in case that is not really helpful. For instance, we can drop the column of the year of birth if we have the age column of a given record. We use the following syntax:

alter table _table_name_
drop _column_name;

These are the ABC's of the important command for a data scientist that when fully practiced, aid efficient and easy analysis of data

Exploratory Data Analysis Ultimate Guide

Belinda Florence — Wed, 01 Mar 2023 07:53:47 +0000

Data is becoming more valuable to different institutions with time. Market behavior and potential customers can be predicted from existing data. This has led to the phenomena of "Big Data", large amounts of data that are analyzed computationally.
Equally, EDA is a term that one can frequently come across in data science and it can be a little heavy for newbies especially when they are not familiar with what the abbreviation means. EDA in data science stands for Exploratory Data Analysis. So we shall take a look at this concept in depth in this article and perhaps, guide you into one or more issues in these concepts.

Definition
There are important characteristics of a given data that one seeks when analyzing data. After the analysis, one can present the discovery (whether new or formerly known) in summary and maybe in a visual form. Therefore, exploratory data analysis can be defined as an
approach to intimately analyze data seeking to become more familiar with. Simply put, it is a way to get a basic understanding of the data at hand. Regardless, the end goal is to establish patterns in the data after analysis that are relevant to the given institution or the analyst. There are several ways in which one can achieve this concept and there are different degrees to which one can reach in their analysis. In the following part, we shall take a look at the steps which are used in EDA and how to accomplish them.

Steps in EDA

Data collection where data is collected from different sources, either excel, an end point, or csv file in sites like Kaggle and Github.
Load Data: Data is what is sought to be explored. Data can be loaded in different ways:

Upload already available data from local machine: This is done by use of an available button in the notebook environment, (for my case I am using Jupyter notebook). In the Jupyter Notebook's Home page, there is a button on the upper right side of the page tagged "upload". When pressed it takes one to the area in which the file to be uploaded is located. After selecting the file, click the upload blue button to allow the file to be in the Jupyter Hub.

Data can also be loaded using the command line; in the button next to the "upload" button is the "new" button. When click it gives options and among them is the "Terminal" button. Click the "Terminal" button to open the command line. Then enter following command to download data in the current directory one is in:

wget <MY-FILE-URL>

In the case that the file being downloaded is a zip file, use pip to download a tool that is used to unzip the file using the following command:

pip install unzip

You can then now proceed to unzip the file using the following command:

unzip <"downloaded_file_name">

Open the notebook and import a few libraries that will help you explore different aspects of your data; numpy, pandas, matplotlib, and seaborn. Please note, these are not the only libraries that can help in analysis, but they are the most used in the analysis, thus they are preferred.

import pandas as pd
import matplotlib as plt
import numpy as np
import seaborn as sns

Afterward click "Run" button just above the cell to run the cell and get feedback. If there is no error given then the cell has been successfully executed.

Next, you read the data from where it is located. Use the following commands :

variable_name=pd.read_csv("directory of the data location")

When you run this command and it does not return an error message, it means the file can now be referred to using the variable name assigned to it.

The Actual EDA
Now, this is the beginning of an interesting part for me in EDA, where we get to get our hands dirty with the real thing. Analysis can be done in various aspects and aims at different goals. According to the difference in needs, analysis can be done using different libraries in Python language. The following are some of the functions and tools that can be used to do data analysis.
a) Read head
Using the following command, one is able to get a preview of the nature of the data being used without no much sweat.

variable_name.head()

It returns the value of the first five rows if not specified. To specify the number of rows to preview, just enter the integer inside the brackets. For example

variable_name.head(10) # which returns the first 10 rows of the data

Likewise, we can read the last 5 rows using the tail function that can be written in the following manner:

variable_name.tail()

b) Number of rows and columns
If interested to know how many rows and columns(dimensions) respectively you are working with, then you can use the following to know.

variable_name.shape

c) Check for null values and types of data
We use info() method to achieve this

variable_name.info()

It returns the different columns present and their type of object. In addition, it tells you if any of the columns has null values.

d) A summary of statistical status of the data
It gives the mean, mode, maximum value, standard variation, and the quartiles. We can get this information using the described function as shown below:

variable_name.describe()

e) Getting a unique character in a given column
To analyze the presence and the nature of unique elements in the data set, use the unique() function as shown:

variable_name.column_name.unique()

f) To see how many times a value appears in a certain column
Frequency of a certain data can give a deeper insight to the analysis of data and can be gotten through the code below using the value_count() function. It returns the column name and the datatype alongside the results.

variable_name.column_name.value_count()

g) Number of array dimension or axes
To see the nature of the dataframe in terms of dimension, use the ndim function

variable_name.ndim

h) Number of elements in an object
There is a function that returns an integer that gives the number of elements in an object

variable_name.size

i) Check id dataframe is empty
We can analyze if the data we have is fully populated using the following code

variable_name.empty

j) To check memory usage of the dataframe
Use the following command:

variable_name.memory_usage

k) Access a single value
When want to access a single value in either a row and a column, can use the following command:

variable_name.at

l) Get columns in the data
To achieve this, we use the column function that returns a list of all columns in their order as an array

variable_name.columns

m) Correlation
To see the negative, moderate, and positive correlation, we can use the corr() function to see this in a table.

variable_name.corr()

Graphical Representation in Data Analysis

We can also have a visual representation of data using the functions offered by numpy, pandas, seaborn, and matplotlib. In this section, we shall have a deep insight into these libraries and the range of interesting things they are capable of achieving.

Import pyplot from matplotlib in the case that you have not imported it.

a) Bar Chart
When we want a bar chart, we have to pass three parameters to the plot() function in matplotlib. They are the x-axis, the y-axis and the type of plotting we need. In our case, it is a bar chart.
We can use the guideline to help us do this

variable_name.plot(x="column1", y= "column2",kind ="bar",figsize=(20,15)

In the case that the figsize is not explicitly given, the plot returns a default size.

b) Line Graph
When we want to see our data on a line graph, we use the same method as above, but pass in the kind field, line as our type of graph that we want drawn

variable_name.plot(x="column_name", y= "column_name",kind ="line",figsize=(20,15)

c) Plot a single column
We use the seaborn library to achieve this. As part of the code, we use distplot function to plot the data in the given column as below:

sns.distplot(variable_name["column_name"])

d) General Information
We can plot all the columns in a single graph and analyze it visually. However, when dealing with huge data, the analysis may be a little bit difficult cause of the congestion. We can try to reduce the congestion by passing the fig size that is a little bigger than the default one.
We use the plot and the show function to achieve this as shown

variable_name.plot()

mplt.show()

d) General Plot of a Column
Data from a single column can be plotted using this code:

variable_name["column_name"].plot()

mplt.show()

e)Comparison of two different Columns

A comparison of two different columns can be done and a relationship traced if it exists. We should be careful to reasonably choose the columns under analysis to avoid weird graphs that are trying to stretch and accommodate the outrageous data range difference. We can use the following guideline to achieve this:

variable_name. plot.scatter(x="column_name",y="column_name",alpha = 0.5)

mplt.show()

For this instance, we have chosen a scatter graph to be plotted in order to see the variations and relationships in between the columns.

f) Box Graph

Box graph represents a summary of a set of data in a five-number format, the first quartile, median, third median, minimum and maximum. We can have a graph to represent this information using the plot() function as shown

variable_name.plot.box(figsize=(n,m))

plt.show()

This returns the data displayed in a graph and each column is represented in the graph and an analysis can be done from that information.

g) Correlation Objects

We have seen above that python has a function that enables us to get the correlation of the data we have. Now correlation objects will be very useful as we will see in the next section. In this section, we shall demonstrate how to create a correlation object

variable_name_of_object = variable_name.corr()

data_set_name.corr()

h) Heatmaps
We can view correlations graphically using the heatmap function that changes color in regards to change of correlation of elements with others. In the following code, we demonstrate how to achieve this:

sns.heatmap(variable_name_of_object,cmap='Red',annot=True)

We can also manipulate the data and add columns, and strike out some columns until we get what we desire. Python offers us these functionalities to enable us to explore different possibilities that can with data analysis.

a) Create a column from derived information

Now let us take for instance, a circumstance in which we want to get a new column that is a result of a mathematical operation of an already existing column, doing the operation on the elements of column one at a time can be tedious and especially working with the very large data set. It is therefore important to seek for a solution that is first and efficient. Fortunately, A new column can also be created from existing columns, derived information columns.
The code below illustrates a guideline on how this can be achieved

variable_name_of_dataset ["new_column_name"] = variable_name_of_dataset ["existing_column_name"] *2

In this example, the mathematical operation is multiplying the elements of the column that is existing by 2; thus the "*2" mark at the end of the operation.

b) Renaming Columns
We download or get data that has its columns named according to the need and knowledge locus of the person collecting the data. However, the naming may be not conventional enough to suit the needs of the person doing the analysis. To make the data seem more familiar and usable, the data analyst can rename the columns by giving the alternate new name they desire to give the specific column. The guideline below shows how this is possible using rename function which takes two arguments, the original column name and the new column name :

# Renaming Columns

new_variable_name = variable_name.rename(columns+
{"old_column_name":"new_column_name"})

new_variable_name.head()

In the above guideline, we have given for rename of just a single column, but more than one column can be renamed. We just pass the old column names and the new ones as elements of a dictionary in python.

The rename can also take the direction of letter cases, that is to lower or to upper case. We take the same steps, but instead of passing the old and new column names, we can pass the function of conversion to lower or upper case for a given column label

# Renaming Columns

new_variable_name = variable_name.rename(columns=str.lower)

new_variable_name.head()

Data analysis is not limited to the above functions, but these are key and the most used to see different aspects of data. I have done a notebook on most if not all the things mentioned above for reference in case you are stuck. It is in my github account and can be accessed by the link below, if it is helpful, please give an upvote: https://github.com/Gamalie/Data-Science

This is done to illustrate the things discussed above and assist, those stuck on how to analyze their data in preparation for modeling into machine learning or deep learning model.

There are also several sites that can assist one do more data exploration. Kaggle has guidelines on how to analyze different data sets. The official Pandas, Seaborn, numpy, and matplotlib documentation will assist in getting more understanding of the libraries and what they offer in terms of tools for analysis.

Introduction to Python

Belinda Florence — Sun, 19 Feb 2023 20:53:16 +0000

Data science, big data, IoT, AI name it, the world is moving into the intelligent side and we have no choice but to move with it. As tech enthusiasts, we have to be on our toes to keep up with the changing technology in the world. The above technologies are formed on the foundation of Python. Up to this point, we can agree that Python is an important programming language. So let us delve into this interesting language and know more about it.
Python is an object-oriented dynamic programming language that is interpreted; it uses objects to organize software designs rather than functions. One interesting thing about this language is that there is no declaration of variables or methods in the source code. You may ask, what is the advantage of not declaring variable forehand? Well, the types of data are checked at run-time rather than compile-time. this functionality makes the code brief and flexible. The shorter the code, the happier the developer. Python is an open source and therefore allows the developers to share ideas and learn from one another. The Python interpreter and the extensive standard library are freely available in source or binary form for all major platforms from the Python website, https://www.python.org/, and may be freely distributed.
Now that you have an idea about Python, let us look at the installation process. The installation process has been made easier and much easier for Linux users as the OS comes with the latest version of Python. The preferred installer for python is pip. Starting with Python 3.4 is included by default with the Python binary installers. The standard packaging tools are all designed to be used from the command line. Using the following commands, you can install the latest version of a module and its dependencies from the Python Package.
python -m pip install MyPackage
One can also install a specific version in the command line using the following command:
python -m pip install SomePackage==1.0.4 # specific version
Existing modules can also be upgraded using the following command:
python -m pip install --upgrade SomePackage
The creation of virtual environments is done through the venv module.
The Python interpreter is usually installed as /usr/local/bin/python3.11 on those machines where it is available; putting /usr/local/bin in your Unix shell’s search path makes it possible to start it by typing the command:
python3.11
On Windows machines where you have installed Python from the Microsoft Store, the python3.11 command will be available. If you have the py.exe launcher installed, you can use the py command.
To install the py.exe go to and download the appropriate exe file according to your computer's specification.

Open the installed interpreter using the following command:

python3.11
This runs the interpreter directly

Run the following command see the version

`import sys

print("User Current Version:-", sys.version)
`

Now you can run a simple code on the interpreter. Example:

c=b=a
b = 2
a = 4
c
The answer will be
6
From here you can run more mathematical simple operations to get more familiar with the language.

Python Syntax

Now we do not want to be frustrated the first time we are using this programing language, at least not by syntax error. This calls us to have a basic knowledge of the dos and don'ts of this amazing language.

Python uses new lines to complete a command, as opposed to other programming languages which often use semicolons or parentheses.
Python is indentation sensitive, that means the right indentation should be used for relevant and smooth flow of the program. Lines of code in the same category should be inline with each other. For example:

def degree_year(cy,ey) #method to check number of years in school
    return(cy-ey)
degree_year(2023,2017)

From the above code, the definition of the method is in line with the function used to call the defined method. If you put the second line of code in the same level as the other two, an error will occur.

Commenting is done using the hashtag character "#" to indicate the beginning of the comment. Comments are writings within the code that are not executed but give further information about the particular line of code.
Python is case-sensitive. For class names, the name should begin with uppercase and for method names it should start in lowercase followed by uppercase for the beginning of the next word given that the name is made up of two words. For instance:

def className #method name is Carmel case

class Nairobi #class name begins with an upper case

For more identifiers and the rules that they follow, you can visit

What is Python used for

Creation of web applications.
Python can be used alongside software to create workflows.
Read and modify files when used alongside a database
Python can be used to handle big data and perform complex mathematics.
Python can be used for rapid prototyping, or for production-ready software development.
Python is used in data science and deep learning through Jupyter and Collab notebook environment.

Conclusion

For more information, you can visit the following websites for resources :

Happy Coding!!!