<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Belinda Florence</title>
    <description>The latest articles on DEV Community by Belinda Florence (@belinda).</description>
    <link>https://dev.to/belinda</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1019489%2F7274ee44-3e55-4fc2-a799-113f04f1c367.jpeg</url>
      <title>DEV Community: Belinda Florence</title>
      <link>https://dev.to/belinda</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/belinda"/>
    <language>en</language>
    <item>
      <title>Getting Started with Sentiment Analysis</title>
      <dc:creator>Belinda Florence</dc:creator>
      <pubDate>Mon, 27 Mar 2023 20:49:35 +0000</pubDate>
      <link>https://dev.to/belinda/getting-started-with-sentiment-analysis-3461</link>
      <guid>https://dev.to/belinda/getting-started-with-sentiment-analysis-3461</guid>
      <description>&lt;p&gt;Opinions are very important to a product data mining. They shape the improvement and corrections necessary for a product or service. It is from such opinions that companies are able to build end products that are customer centered and comfortable to use. Sentiment analysis can be expensive to do when it involves an actual counting to determine the quality of product as per the individual customer review. These are just among many other reasons why automated sentiment analysis is key to companies. In this article, we are going to discuss concepts that are crucial when getting started with sentiment analysis. In addition, we shall also see a simple roadmap on how to go about the analysis step by step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;br&gt;
Sentiment analysis is also called &lt;strong&gt;&lt;em&gt;Opinion mining&lt;/em&gt;&lt;/strong&gt;, and it is the identification and classification of sentiments that come in text form in the field of natural language processing. To get a nice amount of sentiment data to analyze, then Twitter is among the best place to look. The data helps understand difference in opinion of people on social media about carious topics. To detect the sentiments of a customer, it is therefore paramount to develop an automated machine learning sentiment analysis model.&lt;/p&gt;

&lt;p&gt;Source data: Sentiment140dataset&lt;a href="https://www.kaggle.com/datasets/kazanova/sentiment140"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Machine learning pipeline content: Logistic regression, SVM and Bernoulli Naive Bayes as classifiers and Term Frequency-Inverse Document Frequency, NLTK library for natural language processing. &lt;/p&gt;

&lt;p&gt;Now let us dive in to the exciting part of data analysis, the pipeline steps:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A) IMPORT LIBRARIES AND DEPENDANCIES&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import numpy as np
import pandas as pd
import seaborn as sns

# For plotting purposess
import matplotlib.pplot as plt
import seaborn as sns
from wordcloud import WordCloud

#Skliearn tools
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# for extracting important feature in text
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics import confusion_matrix, classification_report

#natural language toolkit
from nltk.stem import WordNetLemmatizer

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;B)LOADING DATA INTO OUR NOTEBOOK&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# we shall give a variable name data to load data into.
# Due to the volume of the data, we can specify the number of rows we can load, for this case we can use 5000 rows

data =pd.read_csv('file_path',nrows=5000)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are some troubles that can arise while loading the data on jupyter notebook. We will handle one here; for instance, after loading the data and reading the head we find that part of our data is acting as the header instead as being part of the data rows, we can correct that using the following command instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#it will return the headers of the columns as numbers instead of the data
data =pd.read_csv('file_path',nrows=5000,header=None)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;C) EDA(Explaratory Data Analysis)&lt;/strong&gt;&lt;br&gt;
At this point, we want to explore and discover different aspects and contents in the data we now have. I have written an article about this, in case you need some clarification on what exactly EDA is you can visit the article: &lt;br&gt;
&lt;strong&gt;&lt;em&gt;i) Read head&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
This enables us access the first five rows by default. We can use the following code to achieve this:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;data.head()&lt;/code&gt;&lt;br&gt;
From our data, it has 5 columns if correctly loaded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;ii)Shape&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
Basically, this is just to see the dimensions of our data in terms of the number of rows and columns. We use the following code to achieve this:&lt;br&gt;
&lt;code&gt;data.shape&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;iii) Length of data&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
To check out the length of the data, we use the following code:&lt;br&gt;
&lt;code&gt;len(data)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;iv) Check for Null values&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
We want to see the data that we have loaded if it has any null value. To achieve this, we use the following statement:&lt;br&gt;
&lt;code&gt;data.isnull()&lt;/code&gt;&lt;br&gt;
this will return the response for each row, which is a little cumbersome, so the best way is return the sum of null values, if they exist using the following command:&lt;br&gt;
&lt;code&gt;np.sum(data.isnull())&lt;/code&gt;&lt;br&gt;
If correctly done, it returns the results for the number of null values in each of the five columns. We discover this particular data does not contain any null values. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;v) Unique Values&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
We want now to Check in the 'target' column for any unique value that can be of interest to us. We use:&lt;br&gt;
&lt;code&gt;data['target'].unique()&lt;/code&gt;&lt;br&gt;
for our case, 0 is the unique value and is of datatype int.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;vi)Data Visualization&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
We can plot a graph using Jupyter notebook to graphically understand our data&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;D) PRE-PROCESSING&lt;/strong&gt;&lt;br&gt;
Removing noise(unnecessary data) from the data will help us have a higher accurate models. The noise in this kind of data include stopwords(like 'the', 'is','in'..), special characters, and hashtags if they exist. It is advisable to change data to lowercase for better generalization. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;i) Create new dataset using Target and Content&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
We achieve this by selecting a sample for investigation from the data given. In our case we will use the 'Target' which is the classification of sentiment and the 'Tweet_Content' column and work with them:&lt;br&gt;
&lt;code&gt;sample = data[['Tweet_Content','Target']]&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;&lt;em&gt;ii) Check for unique values&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
We can now check the unique values in our new dataset sample for column target using the following code:&lt;br&gt;
&lt;code&gt;sample['Target'].unique()&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;_iii) Divide the Target column _&lt;/strong&gt;&lt;br&gt;
We achieve this by assigning all 0 values to negative and 1 value to positive.&lt;br&gt;
Due to the vastness of our data, we can choose a portion from each category to work with, for our case we will work with 0.25 million from each case as shown below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;positive = positive.iloc[:int(25000)]
negative = positive.iloc[:int(25000)]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;&lt;strong&gt;iv) Form new dataset&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
Now we form a new data set with 500,000 dataset with both negative and positive representation using the following code:&lt;br&gt;
&lt;code&gt;new_dataset = pd.concat([positive,negative])&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;v) Change tweet content to lower case&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
We want to have data that is generally uniform. Good news is that Python has a Library that help us turn text into lower case or upper case. For our case we will turn to lowercase using the following code:&lt;br&gt;
&lt;code&gt;new_dataset['Tweet_Content']=new_dataset['Tweet_Content'].str.lower()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;vi)Remove all stopwords&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
We want to get read of all stopwords. We start by creating a list that contain all possible stop words and we will use it to create a method to remove stop words from our data. We create the list as shown:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;stopwordlist = ['a', 'about', 'above', 'after', 'again', 'ain', 'all', 'am', 'an',
             'and','any','are', 'as', 'at', 'be', 'because', 'been', 'before',
             'being', 'below', 'between','both', 'by', 'can', 'd', 'did', 'do',
             'does', 'doing', 'down', 'during', 'each','few', 'for', 'from',
             'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here',
             'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in',
             'into','is', 'it', 'its', 'itself', 'just', 'll', 'm', 'ma',
             'me', 'more', 'most','my', 'myself', 'now', 'o', 'of', 'on', 'once',
             'only', 'or', 'other', 'our', 'ours','ourselves', 'out', 'own', 're','s', 'same', 'she', "shes", 'should', "shouldve",'so', 'some', 'such',
             't', 'than', 'that', "thatll", 'the', 'their', 'theirs', 'them',
             'themselves', 'then', 'there', 'these', 'they', 'this', 'those',
             'through', 'to', 'too','under', 'until', 'up', 've', 'very', 'was',             'we', 'were', 'what', 'when', 'where','which','while', 'who', 'whom',
             'why', 'will', 'with', 'won', 'y', 'you', "youd","youll", "youre",
             "youve", 'your', 'yours', 'yourself', 'yourselves']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let us create a method to remove all these stop words as shown below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;STOPWORDS = set(stopwordlist)#turn it into a set
def cleaning_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])
new_dataset['Tweet_Content'] = new_dataset['Tweet_Content'].apply(lambda text: cleaning_stopwords(text))
new_dataset['Tweet_Content'].head()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above code returns our data in the Tweet content minus the stop words we have defined in the stopword list.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;vii) Remove Punctuation&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
Punctuation marks can hinder the accuracy of our model, luckily, Python has a string library that contains English words and punctuation, which is very helpful right now. Hence we will remove them in this section as shown below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import string
eng_punctuations = string.punctuation
punctuations_list = eng_punctuations
def cleaning_punctuations(text):
    translator = str.maketrans('', '', punctuations_list)
    return text.translate(translator)
new_dataset['Tweet_Content'] = new_dataset['Tweet_Content'].apply(lambda text: cleaning_punctuations(text))
new_dataset['Tweet_Content'].head()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These returns the first five text without panctuation marks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;viii) Remove Repeating Words&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
To remove repeating words we use the following piece of code:&lt;br&gt;
we will import the re library(A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).) just in case it is not present using the following code:&lt;br&gt;
&lt;code&gt;import re&lt;/code&gt;&lt;br&gt;
now let us define a method to remove repeating words:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def cleaning_repeating_char(text):
    return re.sub(r'(.)1+', r'1', text)
new_dataset['Tweet_Content'] = new_dataset['Tweet_Content'].apply(lambda x: cleaning_repeating_char(x))
new_dataset['Tweet_Content'].tail()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;viii) Remove urls&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
In twitter, sometimes we find comments having urls to refer the targeted audience to another location on the internet regarding their interest at the time. We will define a method to achieve this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def cleaning_URLs(data):
    return re.sub('((www.[^s]+)|(https?://[^s]+))',' ',data)
new_dataset['Tweet_Content'] = new_dataset['Tweet_Content'].apply(lambda x: cleaning_URLs(x))
new_dataset['Tweet_Content'].head()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;ix) Remove numeric characters&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
At this point, we remove numbers from the tweets cause we want our data to be as clean as possible. We achieve this using the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def remove_numbers(data):
    return re.sub('[0-9]+', '', data)
new_dataset['Tweet_Content'] = new_dataset['Tweet_Content'].apply(lambda x: remove_numbers(x))
new_dataset['Tweet_Content'].head()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;x) Tokenizing Tweet Content&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
Tokenization is splitting paragraphs and sentences into smaller units that can be more easily assigned meaning. The first step of the NLP process is gathering the data (a sentence) and breaking it into understandable parts (words). It is can be achieved using the following example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from nltk.tokenize import RegexpTokenizer
def tokenize_tweet(tweet):
    return nltk.word_tokenize(tweet)
new_dataset['Tweet_Content_Token'] = new_dataset['Tweet_Content'].apply(tokenize_tweet)
new_dataset.head(10)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;xi) Lemmatization&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
Lemmatization in Natural Language Processing(NLP) helps models break a word down to its root meaning to identify similarities.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;lam = nltk.WordNetLemmatizer()
def lemmatizer_on_text(data):
    text = [lm.lemmatize(word) for word in data]
    return data
new_dataset['Tweet_Content_Token'] = new_dataset['Tweet_Content_Token'].apply(lambda x: lemmatizer_on_text(x))
new_dataset['Tweet_Content_Token'].head()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;&lt;strong&gt;xii) Stemming the words&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
Stemming is the process of reducing a word to its stem that affixes to suffixes and prefixes or to the roots of words known as "lemmas".It is useful for indexing words. It is done as shown in the example below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def stemming_on_text(data):
    text = [st.stem(word) for word in data]
    return data
new_dataset['Tweet_Content_Token'] = new_dataset['Tweet_Content_Token'].apply(lambda x: stemming_on_text(x))
new_dataset['Tweet_Content_Token'].head()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;xiii) Labeling axes&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
At this point, we have to separate the input feature and label. Our Tweet Content is constant, it cannot be changed once is expressed. What changes is the label of the tweet being either positive or negative. We, therefore, assign the X label the Tweet Content which is constant and the y label the sentiment as shown in the example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;X=data.Tweet_Content
y=data.Target
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;&lt;strong&gt;xiv) Cloud of Words&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
Sometimes words change meaning with change in context. Cloud of words help us organize this words into context of either negative cloud to help us develop an AI model. Negative cloud of words is achieved through the following example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;positive = data['text'][:800000]
wc = WordCloud(max_words = 1000 , width = 1600 , height = 800,
              collocations=False).generate(" ".join(positive))
plt.figure(figsize = (20,20))
plt.imshow(wc)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;xv)Split data into training and testing data&lt;br&gt;
We split our data into training data and testing data as shown in the example:&lt;br&gt;
&lt;code&gt;X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.05, random_state =26105111)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;E) CREATING MODELS&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---YS23tV4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lr0q31uaqcjpmtrq9nf1.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---YS23tV4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lr0q31uaqcjpmtrq9nf1.jpg" alt="Image description" width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We have processed the data so far to our liking and now we can fit it and create a model. We shall use TF-IDF vectorizer to fit the data as shown in the example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vectoriser = TfidfVectorizer(ngram_range=(1,2), max_features=500000)
vectoriser.fit(X_train)
print('Feature_words count: ', len(vectoriser.get_feature_names()))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can also use the vectorizer to transform out training and testing data as shown in the example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;X_train = vectoriser.transform(X_train)
X_test  = vectoriser.transform(X_test)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;First Model: Bernoulli Naive Bayes Classifier&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
We shall specify the score of our models using the following example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def model_Evaluate(model):
    y_predict = model.predict(X_test)
    print(classification_report(y_test, y_predict))
# Compute and plot the Confusion matrix
cf_matrix = confusion_matrix(y_test, y_predict)
categories = ['Negative','Positive']
group_names = ['True Neg','False Pos', 'False Neg','True Pos']
group_percentages = ['{0:.2%}'.format(value) for value in cf_matrix.flatten() / np.sum(cf_matrix)]
labels = [f'{v1}n{v2}' for v1, v2 in zip(group_names,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot = labels, cmap = 'Blues',fmt = '',
xticklabels = categories, yticklabels = categories)
plt.xlabel("Predicted values", fontdict = {'size':14}, labelpad = 10)
plt.ylabel("Actual values" , fontdict = {'size':14}, labelpad = 10)
plt.title ("Confusion Matrix", fontdict = {'size':18}, pad = 20)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now to the model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BNBmodel = BernoulliNB()
BNBmodel.fit(X_train, y_train)
model_Evaluate(BNBmodel)
y_predict1 = BNBmodel.predict(X_test)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;Second Model: Logistic Regression&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
We also use the second model to see the accuracy as shown in the example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LRmodel = LogisticRegression(C = 2, max_iter = 1000, n_jobs=-1)
LRmodel.fit(X_train, y_train)
model_Evaluate(LRmodel)
y_pred3 = LRmodel.predict(X_test)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;Third Model: SVM&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
The last model is SVM and afterwards we will do comparison using the three models and choose which one is preferred:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SVCmodel = LinearSVC()
SVCmodel.fit(X_train, y_train)
model_Evaluate(SVCmodel)
y_pred2 = SVCmodel.predict(X_test)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;CONCLUSION&lt;/strong&gt;&lt;br&gt;
From analysis on basis of score, logistic regression has the highest score among the three models. You can find the code on Jupyter notebook for further reference on my github account &lt;a href="https://github.com/Gamalie/Data_Science-BootCamp/blob/main/Twitter%20Sentiment%201.ipynb"&gt;https://github.com/Gamalie/Data_Science-BootCamp/blob/main/Twitter%20Sentiment%201.ipynb&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Essential SQL Commands for Data Science</title>
      <dc:creator>Belinda Florence</dc:creator>
      <pubDate>Tue, 14 Mar 2023 14:06:06 +0000</pubDate>
      <link>https://dev.to/belinda/essential-sql-commands-for-data-science-3hp</link>
      <guid>https://dev.to/belinda/essential-sql-commands-for-data-science-3hp</guid>
      <description>&lt;p&gt;Effective and essential tools are critical when it comes to good performance. This is not an exemption to data scientists, they too seek tools and ways of making their work enjoyable and fast despite the challenges that come with the tasks. SQL is helpful in extracting important information from a database. In this article, we shall look at SQL commands that are essential for data scientists to know and use in their tasks. The commands vary from the most basic one to the most advanced with examples to make it possible to follow when reading the article.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a) SELECT...FROM...&lt;/strong&gt;&lt;br&gt;
This command is used to retrieve data from one or more tables in the database. SELECT command retrieves all data, sorts, and filters using accompanying different functions as illustrated.&lt;/p&gt;

&lt;p&gt;To get all data in a given table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT *
FROM _table_name_
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To get retrieved data sorted we can use the group by function as illustrated:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT *
FROM students
GROUP BY class
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, it is advisable that the GROUPBY function be accompanied by COUNT function to make the command sensible. For instance&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT count(_Column_name_) from _table_name_
group by _column_name2_;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can also select specific columns from a table and view them using the following SELECT command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select column1,colum3
from _table_name_;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;ii)Distinct&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
It is used to select only unique values in the tables for the specified column. For many tables, there is duplication of certain values and this brings redundancy when it comes to viewing. This function help solve this problem.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select distinct (_column_name_)
from _table_name_;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;&lt;strong&gt;iii) Where&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
This is an additional function that is used to filter data in the database. Here we specify the desired conditions to be met for the records we want to be displayed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select _column_name_ 
from _table_name_
where _condition_;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The only data that will be returned is the one that meets the conditions only and non-other Otherwise, no record will be displayed. We can also return more than one column, for instance in a class a school table where we want to see students who have more than 10 years, we can use the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select Age, First_Name
from School
where Age&amp;gt;10;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command returns the column of age and First names of the students in the table who have an age greater than 10 years. &lt;br&gt;
When it comes to the conditions, we use '&amp;gt;' for greater than,'&amp;gt;=' for greater or equal to,'&amp;lt;=' for less or equal to and '=' for equal to. It is awesome to filter data, isn't it?&lt;/p&gt;

&lt;p&gt;Where can also be combined by logic operators to get specific results of records. Foer instance, where the condition should meet more than one criteria, we use and operator to achieve this as shown in the syntax below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select Age, First_Name
from School
where Age&amp;gt;10 and First_Name!= James;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or can also be used to get records that meet either of the condition or both of them as in the example below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select Age, First_Name
from School
where Age&amp;gt;10 or First_Name = James;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;in addition to that, not can also be used to show the condition given, if met, the records that meet that condition should not be returned as shown in the example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select Age, First_Name
from School
where not Age&amp;gt;10;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This returns records of any student less than 9 years old.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;b) Order By&lt;/strong&gt;&lt;br&gt;
 By default, this command sorts desired data in ascending order. In the case that we want our data order by in descending order, we use the DESC function to achieve a descending order as in the sample:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select _column_name_
from _table_name_
order by _column_name1_;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to order in descending order we specify within the command as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select _column_name_
from _table_name_
order by _column_name1_ desc;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;ii)Order by more than one column&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
So let us say we want to use more than one column to order our data, we may ask ourselves if it is possible. The answer is yes. It is very possible to use more than one column as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select _column_name_
from _table_name_
order by _column_name1_asc,column_name2 desc;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using more than two columns to specify order by is mostly efficient when we are dealing with data that has duplication in some columns. Let us say, in a school table, we might find several students sharing first names and just to create some order in it, we might query the table to give us the table with the column of age and class ordered for neatness purposes which results to easy view.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;c)Group By&lt;/strong&gt;&lt;br&gt;
This function is used to summarize rows that have same values into groups. In most cases, it is used with aggregated functions like sum, average, max, min and count.  For example, given a school, we can find the total number of students, return their first names and group by age in a given class as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select count(_column_name_)
from _table_name_
group by column_name1;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is also acceptable to use other aggregation like minimum for example:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;i) Use count&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select count(_column_name_)
from _table_name_
group by column_name1;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;ii) Use max&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select max(_column_name_)
from _table_name_
group by column_name1;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;iii) Use average&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select avg(_column_name_)
from _table_name_
group by column_name1;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;iv) Use min&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select min (_column_name_)
from _table_name_
group by _column_name1;_
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The trick with the group, is we categorize data according to a certain value we require and by default is in ascending order which we can specify by using the desc function. Priority of order is given to the column name described in the group by function rather than select function.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;d) Join&lt;/strong&gt;&lt;br&gt;
This command joins rows from more than one table, based on related columns between them. The joins are in four ways;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;i)inner&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
Gives records with matching values in both tables. For instance, if we are querying a number of columns from two different tables that have a similar column name with the same values, then the rest of the columns from different tables will be returned on basis of the common column name and values. Example&lt;/p&gt;

&lt;p&gt;Our database has a table of customers_information and credit&lt;/p&gt;

&lt;p&gt;CUSTOMER_ID DATE    COUNTRY NAME    BUSINESS&lt;br&gt;
4   10/12/22    Kenya   Israel  Hairdressing&lt;br&gt;
7   14/12/22    Tanzania    Tesh    Hotel&lt;br&gt;
1   20/12/22    Uganda  Amor    School&lt;/p&gt;

&lt;p&gt;credit table&lt;br&gt;
CUSTOMER_ID CREDIT  STATUS&lt;br&gt;
7   Poor    Denied&lt;br&gt;
6   Good    In_process&lt;br&gt;
4   Excellent   Complete&lt;/p&gt;

&lt;p&gt;when we use inner join, the results will be as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select customers_information.DATE, credit.STATUS
from credit
inner join customers_information on credit.CUSTOMER_ID=customers_information.CUSTOMER_ID;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The syntax is as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select _column_name,_ _column_name2_
from _table_1
_inner join _table_2_
on _table_1.column_name = table_2.column_name2_;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CUSTOMER_ID DATE    STATUS&lt;br&gt;
7   14/12/22    Denied&lt;br&gt;
4   10/12/22    Complete&lt;/p&gt;

&lt;p&gt;It is possible to also join 3 tables, the more reason to like sql 😂 right?&lt;br&gt;
 We use the same syntax as above but with a slight modification to accommodate the third table as demonstrated:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select _column_name,_ _column_name2_,column_name3
from _table_1
_inner join _table_2_
on _table_1.column_name = table_2.column_name2_;
on _table_1.column_name = table_3.column_name3_;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;ii) Right Join&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
This command returns records of the second table (or right table). In case there are matching records from left table, then they are also returned together with those in table1.The syntax is as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select _column_name,_ _column_name2_
from _table_1
right join _table_2_
on _table_1.column_name = table_2.column_name2_;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;iii) Left Join&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
This is similar to right join only that it returns records of table 1 and any similar records found in table 2. The syntax is also similar to that of right join as shown&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select _column_name(s)
from _table_1
left join _table_2_
on _table_1.column_name = table_2.column_name2_;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;iv)Cross Join&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
This returns all records from all tables selected. It returns all matching records from the involved tables whether the rows in either table match or not When the conditional keyword is given(where) the results will be similar to those of inner join command query&lt;br&gt;
Its syntax is as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select _column_name,_ _column_name2_
from _table_1
cross join _table_2_;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;e) Aggregate&lt;/strong&gt;&lt;br&gt;
In this part, we shall take a look at the syntax that goes in the aggregate commands. They include, sum,avarage, count, minimum and maximum commands. Let us look at each of them briefly&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;i) Sum&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
Now let us say we have a numeric column, like that of age or number of children one has in a given city. It is possible to get the total number of children in that city using the sum command that follows the following syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select sum( _column_name)
from _table_name_;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In addition to this, we can also give conditions to the command of sum to get sum of specific data as shown below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select sum( _column_name_)
from _table_name_
where condition;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;ii)count&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
If we are not sure by the number of rows, or just need to confirm for safety purposes, then this is the command to use. count() returns the number of rows in a given column using the following syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select count( _column_name_)
from _table_name_;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As also in the sum syntax, conditions can be added to return a specific number of desired rows in a column.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;iii) Min&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
As from the word itself, min() function returns the least value in a given column. Additional conditions are also accepted using the where function. It follows the following syntax.:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select min( _column_name_)
from _table_name_;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;iv) avg&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
Given a numeric column, lets say the age of students in a class, we can get their average age by using the following syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select avg( _column_name_)
from _table_name_;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;v) Max&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
As the name suggest, this function can be used to get the maximum value of a numeric column as shown in the syntax below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select max( _column_name_)
from _table_name_;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;f)Null Values&lt;/strong&gt;&lt;br&gt;
If we want to check if a record has null values( no value) we use is null or is not null functions. It can be illustrated using the following example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select Age, First_Name
from School
where Age is null;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If there are empty records in the Age column, the results returns the number of null records and their count. If not the null table is return empty and count is zero.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;ii) Is not null&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
It is used to show all records that do not have empty values. It is illustrated in the following example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select Age, First_Name
from School
where Age is not null;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;g) Regulatory clauses&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;i) Limit&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
Given that sometimes we work with large data and sometimes not all the data is necessary to be viewed, we can use limit clause to specify the number like in the example only 5 rows will be returned:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select *
from School
limit 5;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;ii) Between&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
This clause gives a range of records that should be returned. An example is given below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select *
from School
where Age between 10 and 14;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;iii) in&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
It is used to select multiple values from records using the where statement as shown in the example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select *
from School
where Age in (3,5);

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;the syntax is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select _column_name_
from School
where column_name1 in (value1_in_column_name1,value5_in_column_name1);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;h) Manipulation&lt;/strong&gt;&lt;br&gt;
There are several commands that enable us to manipulate the tables in the databases to achieve a different thing. For instance, let us say the name given to a certain column is too technical for some people who need to view the table. In this case, the clause AS is used to achieve a change of name. It is also called the alias command as illustrated in the example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select Last_Name as Surname
from School
where Age in (3,5);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The record will return the column name 'Last_Name' but will be recorded as 'Surname'. &lt;br&gt;
It is important to note that the name is only durable during the query time and is not permanent on the table.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;ii) Update statement&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
If we want to change some things in the records that we have, we use the update syntax to achieve this. Where clause is used to specify the records that are being changed. Where is very important to use, otherwise if not specified, all records will be updated. The example below shows how this is possible:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;update _table_name_
set column_name =(val_1,val_2,val_3)
where condition;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;for instance&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;update School
set First_Name = ('Julius'),Age = ('3')
where Residence = Nairobi;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;iii)Delete statement&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
For instance, we have been recording values that are wrong to a given table and we have just realized. Is there a way to delete these records until we find the right value to update the records? Of course, using the delete statement makes this possible. Using the following example, we can delete values of age and school for a certain student:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;delete from school
where Age = 3;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now if we are frustrated and have just learnt that all our records are wrong, we can delete the records and remain with the table. What a fresh start, right? This can be achieved using the following syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;delete from School;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All records in the 'School' table will be deleted&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I) Union&lt;/strong&gt;&lt;br&gt;
Just like in mathematics, union statements return values of two tables which should be of the same data type and have the same order. They return only distinct values from the two column queried using the select statement as in the syntax below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select column_name from table1
union
select column_name(n) from table2;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To get duplicates, as in all the values from the two columns from the two tables, use union all as in the syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select column_name from table1
union all
select column_name(n) from table2;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Database **&lt;br&gt;
**&lt;em&gt;i)Creation of Database&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
It is a simple and straight command that will create a database of the desired name. It is as in the example below:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;create database _database_name_;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;ii) Deletion&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
We have been discussing the commands that affect tables and columns but for this section, the discussion will be on commands that affect the database generally;&lt;/p&gt;

&lt;p&gt;If we have been recording very confidential information for a given NGO concerning medical reports of individuals in a given area, and the task that the NGO desired is achieved, we can delete the database, in case that the records are no longer important. For this case, we use the drop 'database' command to achieve this as illustrated:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;drop database _databse_name_&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tables&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;&lt;em&gt;i) Creation&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
We can create a table from scratch or using an existing table. All this is illustrated in the following examples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;create table _table_name_(
      _column_name datatype,
      column_name1 datatype,
      column_name2 datatype_
     );

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;--creation from scratch&lt;/p&gt;

&lt;p&gt;creation from an existing table&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;create table _table_name_ as 
   select column_name,column_name1,column_name2
   from _existing_table_name_;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;ii) Modification&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
To add and delete or change columns of a given table permanently,we use the 'Alter Table' Command.&lt;br&gt;
To add column we use the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;alter table _table_name_
add _column_name datatype_;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can also change something in the column, like change of datatype of a given column. It is as illustrated:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;alter table _table_name_
modify column _column_name datatype_;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is also a possibility of dropping a column all together, in case that is not really helpful. For instance, we can drop the column of the year of birth if we have the age column of a given record. We use the following syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;alter table _table_name_
drop _column_name;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are the ABC's of the important command for a data scientist that when fully practiced, aid efficient and easy analysis of data&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Exploratory Data Analysis Ultimate Guide</title>
      <dc:creator>Belinda Florence</dc:creator>
      <pubDate>Wed, 01 Mar 2023 07:53:47 +0000</pubDate>
      <link>https://dev.to/belinda/exploratory-data-analysis-ultimate-guide-4a3f</link>
      <guid>https://dev.to/belinda/exploratory-data-analysis-ultimate-guide-4a3f</guid>
      <description>&lt;p&gt;Data is becoming more valuable to different institutions with time. Market behavior and potential customers can be predicted from existing data. This has led to the phenomena of "Big Data", large amounts of data that are analyzed computationally. &lt;br&gt;
Equally, EDA is a term that one can frequently come across in data science and it can be a little heavy for newbies especially when they are not familiar with what the abbreviation means. EDA in data science stands for Exploratory Data Analysis. So we shall take a look at this concept in depth in this article and perhaps, guide you into one or more issues in these concepts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Definition&lt;/strong&gt;&lt;br&gt;
There are important characteristics of a given data that one seeks when analyzing data. After the analysis, one can present the discovery (whether new or formerly known) in summary and maybe in a visual form. Therefore, &lt;strong&gt;&lt;em&gt;exploratory data analysis&lt;/em&gt;&lt;/strong&gt; can be defined as an &lt;br&gt;
 &lt;em&gt;approach to intimately analyze data seeking to become more familiar with&lt;/em&gt;. Simply put, it is a way to get a basic understanding of the data at hand. Regardless, the end goal is to establish patterns in the data after analysis that are relevant to the given institution or the analyst. There are several ways in which one can achieve this concept and there are different degrees to which one can reach in their analysis. In the following part, we shall take a look at the steps which are used in EDA and how to accomplish them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Steps in EDA&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Data collection where data is collected from different sources, either excel, an end point, or csv file in sites like Kaggle and Github. &lt;/li&gt;
&lt;li&gt;Load Data: Data is what is sought to be explored. Data can be loaded in different ways: &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Upload already available data from local machine: This is done by use of an available button in the notebook environment, (for my case I am using Jupyter notebook). In the Jupyter Notebook's Home page, there is a button on the upper right side of the page tagged "upload". When pressed it takes one to the area in which the file to be uploaded is located. After selecting the file, click the upload blue button to allow the file to be in the Jupyter Hub.&lt;/p&gt;

&lt;p&gt;Data can also be loaded using the command line; in the button next to the "upload" button is the "new" button. When click it gives options and among them is the "Terminal" button. Click the "Terminal" button to open the command line. Then enter following command to download data in the current directory one is in:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;wget &amp;lt;MY-FILE-URL&amp;gt;&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;In the case that the file being downloaded is a zip file, use pip to download  a tool that is used to unzip the file using the following command:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pip install unzip&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;You can then now proceed to unzip the file using the following command:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;unzip &amp;lt;"downloaded_file_name"&amp;gt;&lt;/code&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open the notebook and import a few libraries that will help you explore different aspects of your data; numpy, pandas, matplotlib, and seaborn. Please note, these are not the only libraries that can help in analysis, but they are the most used in the analysis, thus they are preferred.
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
import matplotlib as plt
import numpy as np
import seaborn as sns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Afterward click "Run" button just above the cell to run the cell and get feedback. If there is no error given then the cell has been successfully executed.&lt;/p&gt;

&lt;p&gt;Next, you read the data from where it is located. Use the following commands :&lt;/p&gt;

&lt;p&gt;&lt;code&gt;variable_name=pd.read_csv("directory of the data location")&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;When you run this command and it does not return an error message, it means the file can now be referred to using the variable name assigned to it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Actual EDA&lt;/strong&gt;&lt;br&gt;
Now, this is the beginning of an interesting part for me in EDA, where we get to get our hands dirty with the real thing. Analysis can be done in various aspects and aims at different goals. According to the difference in needs, analysis can be done using different libraries in Python language. The following are some of the functions and tools that can be used to do data analysis.&lt;br&gt;
&lt;em&gt;&lt;strong&gt;a) Read head&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
Using the following command, one is able to get a preview of the nature of the data being used without no much sweat.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;variable_name.head()&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;It returns the value of the first five rows if not specified. To specify the number of rows to preview, just enter the integer inside the brackets. For example&lt;/p&gt;

&lt;p&gt;&lt;code&gt;variable_name.head(10) # which returns the first 10 rows of the data&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Likewise, we can read the last 5 rows using the tail function that can be written in the following manner:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;variable_name.tail()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;b) Number of rows and columns&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
If interested to know how many rows and columns(dimensions) respectively you are working with, then you can use the following to know.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;variable_name.shape&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;c) Check for null values and types of data&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
 We use info() method to achieve this&lt;/p&gt;

&lt;p&gt;&lt;code&gt;variable_name.info()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;It returns the different columns present and their type of object. In addition, it tells you if any of the columns has null values.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;d) A summary of statistical status of the data&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
 It gives the mean, mode, maximum value, standard variation, and the quartiles. We can get this information using the described function as shown below:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;variable_name.describe()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;e) Getting a unique character in a given column&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
To analyze the presence and the nature of unique elements in the data set, use the unique() function as shown:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;variable_name.column_name.unique()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;f) To see how many times a value appears in a certain column&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
Frequency of a certain data can give a deeper insight to the analysis of data and can be gotten through the code below using the value_count() function. It returns the column name and the datatype alongside the results.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;variable_name.column_name.value_count()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;g) Number of array dimension or axes&lt;/strong&gt;&lt;/em&gt; &lt;br&gt;
To see the nature of the dataframe in terms of dimension, use the ndim function&lt;/p&gt;

&lt;p&gt;&lt;code&gt;variable_name.ndim&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;h) Number of elements in an object&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
There is a function that returns an integer that gives the number of elements in an object&lt;/p&gt;

&lt;p&gt;&lt;code&gt;variable_name.size&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;i) Check id dataframe is empty&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
We can analyze if the data we have is fully populated using the following code&lt;/p&gt;

&lt;p&gt;&lt;code&gt;variable_name.empty&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;j) To check memory usage of the dataframe&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
Use the following command:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;variable_name.memory_usage&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;k) Access a single value&lt;/strong&gt;&lt;br&gt;
When want to access a single value in either a row and a column, can use the following command:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;variable_name.at&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;l) Get columns in the data&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
 To achieve this, we use the column function that returns a list of all columns in their order as an array&lt;/p&gt;

&lt;p&gt;&lt;code&gt;variable_name.columns&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;m) Correlation&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
To see the negative, moderate, and positive correlation, we can use the corr() function to see this in a table.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;variable_name.corr()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Graphical Representation in Data Analysis&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We can also have a visual representation of data using the functions offered by numpy, pandas, seaborn, and matplotlib. In this section, we shall have a deep insight into these libraries and the range of interesting things they are capable of achieving.&lt;/p&gt;

&lt;p&gt;Import pyplot from matplotlib in the case that you have not imported it.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;a) Bar Chart&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
When we want a bar chart, we have to pass three parameters to the plot() function in matplotlib. They are the x-axis, the y-axis and the type of plotting we need. In our case, it is a bar chart.&lt;br&gt;
We can use the guideline to help us do this&lt;/p&gt;

&lt;p&gt;&lt;code&gt;variable_name.plot(x="column1", y= "column2",kind ="bar",figsize=(20,15)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;In the case that the figsize is not explicitly given, the plot returns a default size.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;b) Line Graph&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
When we want to see our data on a line graph, we use the same method as above, but pass in the kind field, line as our type of graph that we want drawn&lt;/p&gt;

&lt;p&gt;&lt;code&gt;variable_name.plot(x="column_name", y= "column_name",kind ="line",figsize=(20,15)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;c) Plot a single column&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
We use the seaborn library to achieve this. As part of the code, we use distplot function to plot the data in the given column as below:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;sns.distplot(variable_name["column_name"])&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;d) General Information&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
We can plot all the columns in a single graph and analyze it visually. However, when dealing with huge data, the analysis may be a little bit difficult cause of the congestion. We can try to reduce the congestion by passing the fig size that is a little bigger than the default one.&lt;br&gt;
We use the plot and the show function to achieve this as shown&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;variable_name.plot()

mplt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;&lt;strong&gt;d) General Plot of a Column&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
Data from a single column can be plotted using this code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;variable_name["column_name"].plot()

mplt.show()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;&lt;strong&gt;e)Comparison of two different Columns&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A comparison of two different columns can be done and a relationship traced if it exists. We should be careful to reasonably choose the columns under analysis to avoid weird graphs that are trying to stretch and accommodate the outrageous data range difference. We can use the following guideline to achieve this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;variable_name. plot.scatter(x="column_name",y="column_name",alpha = 0.5)

mplt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For this instance, we have chosen a scatter graph to be plotted in order to see the variations and relationships in between the columns.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;f) Box Graph&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Box graph represents a summary of a set of data in a five-number format, the first quartile, median, third median, minimum and maximum. We can have a graph to represent this information using the plot() function as shown&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;variable_name.plot.box(figsize=(n,m))

plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This returns the data displayed in a graph and each column is represented in the graph and an analysis can be done from that information.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;g) Correlation Objects&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We have seen above that python has a function that enables us to get the correlation of the data we have. Now correlation objects will be very useful as we will see in the next section. In this section, we shall demonstrate how to create a correlation object&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;variable_name_of_object = variable_name.corr()

data_set_name.corr()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;&lt;strong&gt;h) Heatmaps&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
We can view correlations graphically using the heatmap function that changes color in regards to change of correlation of elements with others. In the following code, we demonstrate how to achieve this:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;sns.heatmap(variable_name_of_object,cmap='Red',annot=True)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;We can also manipulate the data and add columns, and strike out some columns until we get what we desire. Python offers us these functionalities to enable us to explore different possibilities that can with data analysis. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;a) Create a column from derived information&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Now let us take for instance, a circumstance in which we want to get a new column that is a result of a mathematical operation of an already existing column, doing the operation on the elements of column one at a time can be tedious and especially working with the very large data set. It is therefore important to seek for a solution that is first and efficient. Fortunately, A new column can also be created from existing columns, derived information columns.&lt;br&gt;
The code below illustrates a guideline on how this can be achieved&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;variable_name_of_dataset ["new_column_name"] = variable_name_of_dataset ["existing_column_name"] *2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this example, the mathematical operation is multiplying the elements of the column that is existing by 2; thus the "*2" mark at the end of the operation.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;b) Renaming Columns&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
We download or get data that has its columns named according to the need and knowledge locus of the person collecting the data. However, the naming may be not conventional enough to suit the needs of the person doing the analysis. To make the data seem more familiar and usable, the data analyst can rename the columns by giving the alternate new name they desire to give the specific column. The guideline below shows how this is possible using rename function which takes two arguments, the original column name and the new column name :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Renaming Columns

new_variable_name = variable_name.rename(columns+
{"old_column_name":"new_column_name"})

new_variable_name.head()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the above guideline, we have given for rename of just a single column, but more than one column can be renamed. We just pass the old column names and the new ones as elements of a dictionary in python.&lt;/p&gt;

&lt;p&gt;The rename can also take the direction of letter cases, that is to lower or to upper case. We take the same steps, but instead of passing the old and new column names, we can pass the function of conversion to lower or upper case for a given column label&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Renaming Columns

new_variable_name = variable_name.rename(columns=str.lower)

new_variable_name.head()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Data analysis is not limited to the above functions, but these are key and the most used to see different aspects of data. I have done a notebook on most if not all the things mentioned above for reference in case you are stuck. It is in my github account and can be accessed by the link below, if it is helpful, please give an upvote: &lt;a href="https://github.com/Gamalie/Data-Science" rel="noopener noreferrer"&gt;https://github.com/Gamalie/Data-Science&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is done to illustrate the things discussed above and assist, those stuck on how to analyze their data in preparation for modeling into machine learning or deep learning model. &lt;/p&gt;

&lt;p&gt;There are also several sites that can assist one do more data exploration. Kaggle has guidelines on how to analyze different data sets. The official Pandas, Seaborn, numpy, and matplotlib documentation will assist in getting more understanding of the libraries and what they offer in terms of tools for analysis. &lt;/p&gt;

</description>
      <category>portfolio</category>
      <category>career</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Introduction to Python</title>
      <dc:creator>Belinda Florence</dc:creator>
      <pubDate>Sun, 19 Feb 2023 20:53:16 +0000</pubDate>
      <link>https://dev.to/belinda/introduction-to-python-nn7</link>
      <guid>https://dev.to/belinda/introduction-to-python-nn7</guid>
      <description>&lt;p&gt;Data science, big data, IoT, AI name it, the world is moving into the intelligent side and we have no choice but to move with it. As tech enthusiasts, we have to be on our toes to keep up with the changing technology in the world. The above technologies are formed on the foundation of Python. Up to this point, we can agree that Python is an important programming language. So let us delve into this interesting language and know more about it.&lt;br&gt;
Python is an object-oriented dynamic programming language that is interpreted; it uses objects to organize software designs rather than functions. One interesting thing about this language is that there is no declaration of variables or methods in the source code. You may ask, what is the advantage of not declaring variable forehand? Well, the types of data are checked at run-time rather than compile-time. this functionality makes the code brief and flexible. The shorter the code, the happier the developer. Python is an open source and therefore allows the developers to share ideas and learn from one another. The Python interpreter and the extensive standard library are freely available in source or binary form for all major platforms from the Python website, &lt;a href="https://www.python.org/" rel="noopener noreferrer"&gt;https://www.python.org/&lt;/a&gt;, and may be freely distributed.&lt;br&gt;
Now that you have an idea about Python, let us look at the installation process. The installation process has been made easier and much easier for Linux users as the OS comes with the latest version of Python. The preferred installer for python is pip. Starting with Python 3.4 is included by default with the Python binary installers. The standard packaging tools are all designed to be used from the command line. Using the following commands, you can install the latest version of a module and its dependencies from the Python Package.&lt;br&gt;
&lt;code&gt;python -m pip install MyPackage&lt;/code&gt;&lt;br&gt;
One can also install a specific version in the command line using the following command:&lt;br&gt;
&lt;code&gt;python -m pip install SomePackage==1.0.4    # specific version&lt;/code&gt;&lt;br&gt;
Existing modules can also be upgraded using the following command:&lt;br&gt;
&lt;code&gt;python -m pip install --upgrade SomePackage&lt;/code&gt;&lt;br&gt;
The creation of virtual environments is done through the venv module. &lt;br&gt;
The Python interpreter is usually installed as /usr/local/bin/python3.11 on those machines where it is available; putting /usr/local/bin in your Unix shell’s search path makes it possible to start it by typing the command:&lt;br&gt;
&lt;code&gt;python3.11&lt;/code&gt;&lt;br&gt;
On Windows machines where you have installed Python from the Microsoft Store, the python3.11 command will be available. If you have the py.exe launcher installed, you can use the py command.&lt;br&gt;
To install the py.exe go to &lt;a href="https://docs.python.org/3/download.html" rel="noopener noreferrer"&gt;&lt;/a&gt; and download the appropriate exe file according to your computer's specification.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open the installed interpreter using the following command:
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;python3.11&lt;/code&gt;&lt;br&gt;
This runs the interpreter directly&lt;/p&gt;

&lt;h2&gt;
  
  
  Run the following command see the version
&lt;/h2&gt;

&lt;p&gt;`import sys&lt;/p&gt;

&lt;p&gt;print("User Current Version:-", sys.version)&lt;br&gt;
`&lt;/p&gt;

&lt;h2&gt;
  
  
  Now you can run a simple code on the interpreter. Example:
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;c=b=a&lt;/code&gt;&lt;br&gt;
&lt;code&gt;b = 2&lt;/code&gt;&lt;br&gt;
&lt;code&gt;a = 4&lt;/code&gt;&lt;br&gt;
&lt;code&gt;c&lt;/code&gt;&lt;br&gt;
The answer will be &lt;br&gt;
&lt;code&gt;6&lt;/code&gt;&lt;br&gt;
From here you can run more mathematical simple operations to get more familiar with the language.&lt;/p&gt;

&lt;h2&gt;
  
  
  Python Syntax
&lt;/h2&gt;

&lt;p&gt;Now we do not want to be frustrated the first time we are using this programing language, at least not by syntax error. This calls us to have a basic knowledge of the dos and don'ts of this amazing language.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Python uses new lines to complete a command, as opposed to other programming languages which often use semicolons or parentheses.&lt;/li&gt;
&lt;li&gt;Python is indentation sensitive, that means the right indentation should be used for relevant and smooth flow of the program. Lines of code in the same category should be inline with each other. For example:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def degree_year(cy,ey) #method to check number of years in school
    return(cy-ey)
degree_year(2023,2017)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From the above code, the definition of the method is in line with the function used to call the defined method. If you put the second line of code in the same level as the other two, an error will occur. &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Commenting is done using the hashtag character "#" to indicate the beginning of the comment. Comments are writings within the code that are not executed but give further information about the particular line of code.&lt;/li&gt;
&lt;li&gt;Python is case-sensitive. For class names, the name should begin with uppercase and for method names it should start in lowercase followed by uppercase for the beginning of the next word given that the name is made up of two words. For instance:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;def className #method name is Carmel case&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;class Nairobi #class name begins with an upper case&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;For more identifiers and the rules that they follow, you can visit &lt;a href="https://www.scaler.com/topics/is-python-case-sensitive-when-dealing-with-identifiers/" rel="noopener noreferrer"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Python used for
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Creation of web applications.&lt;/li&gt;
&lt;li&gt;Python can be used alongside software to create workflows.&lt;/li&gt;
&lt;li&gt;Read and modify files when used alongside a database&lt;/li&gt;
&lt;li&gt;Python can be used to handle big data and perform complex mathematics.&lt;/li&gt;
&lt;li&gt;Python can be used for rapid prototyping, or for production-ready software development.&lt;/li&gt;
&lt;li&gt;Python is used in data science and deep learning through Jupyter and Collab notebook environment.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;For more information, you can visit the following websites for resources :&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;[(&lt;a href="https://www.w3schools.com/python/python_intro.asp)" rel="noopener noreferrer"&gt;https://www.w3schools.com/python/python_intro.asp)&lt;/a&gt;]&lt;/li&gt;
&lt;li&gt;[(&lt;a href="https://www.geeksforgeeks.org/check-the-version-of-the-python-interpreter/)" rel="noopener noreferrer"&gt;https://www.geeksforgeeks.org/check-the-version-of-the-python-interpreter/)&lt;/a&gt;]&lt;/li&gt;
&lt;li&gt;[(&lt;a href="https://docs.python.org/3/tutorial/interpreter.html)" rel="noopener noreferrer"&gt;https://docs.python.org/3/tutorial/interpreter.html)&lt;/a&gt;]&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Happy Coding!!!&lt;/p&gt;

</description>
      <category>pcgaming</category>
      <category>discuss</category>
      <category>gamedev</category>
    </item>
  </channel>
</rss>
