DEV Community: El Marie 💜

Answering questions about the UDACITY Artificial Intelligence Product Manager Nanodegree Certification 👩🏾‍🎓🥳

El Marie 💜 — Thu, 25 Jun 2020 13:10:04 +0000

I originally posted this on my youtube channel thus all the opening theatrics.

Building a personal data science project

El Marie 💜 — Thu, 11 Jun 2020 14:14:26 +0000

Introduction

The key to learning anything is practice, practice, and practice. The higher you go on the learning curve the more complex your practice exercises get. Do you remember the leap from your first coding exercise to building side projects over the weekend? It all builds on incremental practice.

One of my biggest challenges as a self-taught analyst(for the most part) was building my data science projects portfolio so I began with building my little data science projects.

Let us talk about how you too can plan and build your little first project which could be your gateway to a new job or an addition to your portfolio.

Identify a case/problem

Your choice of a case to work on can potentially be solving a problem, bringing a new perspective to existing phenomena or proving an unknown phenomenon. Take your time to research on scenarios you are interested in, what you can potentially do with them and whatever you choose, aim to do a comprehensive data analysis on it.

Tip: Always pick a problem or a scenario that correlates to something that you are passionate about building that way you will stay motivated to work through the entire project.

Finding a data set

Now that you have a case, you need to set out to find data that relates to the case. There are so many free data sets across the internet, however, if you want to take it a notch higher you might try to collect your data so that way you can learn what goes into designing a form that collects the right detail for your case. Otherwise, you can pick a sample data set from all the open data forums on the internet.

Tip: Kaggle has a great collection of datasets and lets you see what other people have done with these datasets, a great start for your little project!

Asking questions & Telling stories

Data Science centres on asking questions but not just any questions, you need to ask the right questions. In this step, you are going to formulate several questions that you will answer using the data set at hand and present these in a story flow to make sense of the data. One of the other common facets of data science is storytelling and using the answers to your questions you can write a great and compelling story to justify the case.

Tip: Write a script to prepare for this step, it will guide you on writing a great story and make your analysis straight to the point.

Examining Trends and interesting facts

While working with data, it is important to think outside the box and explore parameters from different categories to seek any unknown correlations and this step is also crucial in addressing bias and stereotypes in the data. At this step list down on all variants of test cases that can be applied on the data at hand and test each one while tweaking ideas.

Tip: This is a discovery process, so be open-minded.

Presentation

Now it is time to visualize and communicate your findings. First understand the audience you intend to communicate your findings to, as much as we love graphs, I recommend layering graphs with other graphics to captivate your audience.

Tip: Explore visualizations out of the traditional graphs and consider layering different graph styles to create captivating graphs.

Conclusion

Once you create your little data science project, share it with the world and add it to your portfolio. This can help someone in the industry and contributes to the wider knowledge that the data science community is building around the world.

Exploring Sentiment Analysis as an application of text mining.

El Marie 💜 — Tue, 30 Jan 2018 20:13:47 +0000

Introduction

This article is part 2 of Understanding Text Mining.If you just landed here, Part 1 is available here.
One of the applications of text mining is sentiment analysis.In order for us to go ahead and carry out a sentiment analysis of our mined text, we are required to clean and prepare our data set as we saw in Part 1.

Understanding Sentiment Analysis

Sentiment Analysis: The study of extracted information to identify reactions, attitudes, context, and emotions.As one of the applications of text mining, sentiment analysis exposes the attitudes in the mined text.

It is based on word polarities, it takes into account positive or negative words and neutral words are dismissed.

Sentiment analysis is done based on lexicons. A lexicon in simpler terms is a vocabulary, say the English lexicon.In this context, a lexicon is a selection of words with the two polarities that can be used as a metric in sentiment analysis.

There are many different types of lexicons that can be used depending on the context of the data you are working with.There is also a possibility of creating a custom lexicon depending on how much customization we would like to make with your data.

In this article, we shall make use of the syuzhet package.While there are a number of packages for sentiment analysis on CRAN, the syuzhet package is great to learn with because it is a combination of the most common lexicons like nrc, bing, and afinn.
We also make use of ggplot2 to further visualize our results from the sentiment analysis.

How does Sentiment analysis work?

In simple terms, sentiment analysis is performed as an intersection of a term-document (built from the mined text ) and a lexicon of choice.

The first step is to have a term-document and a lexicon of your choice.

Then form an intersection between the two sets.

Hands-on with Sentiment analysis

Example one

This is a simple example where we extract emotions from a sentence.We load the sentence, split each word using the strsplit() function to form a character vector and use the get_nrc_sentiment() function from the syuzhet library.This function takes in new_sentence and compares it with the nrc emotion lexicon to return the scores as shown below.

library(syuzhet)
sentence <- "i love cats such a bundle of joy."
new_sentence <- as.character(strsplit(sentence," "))
get_nrc_sentiment(new_sentence)
#This is the output
anger anticipation disgust fear joy sadness surprise trust negative
   0          0       0    0   2       0        0     0        0
positive
     2

Example two

This second example makes use of a TED talks data set that was downloaded from Kaggle under the name transcript.csv.It underwent cleaning using the tm package following the steps in part 1 of this article and was carried forward for sentiment analysis in this part 2.

#load the libraries
library(syuzhet)
library(tm)
library(ggplot2)
#mydataCopy is a term document,generated from cleaning #transcripts.csv 
mydataCopy <- mydata
#carryout sentiment mining using the get_nrc_sentiment()function #log the findings under a variable result
result <- get_nrc_sentiment(as.character(mydataCopy))
#change result from a list to a data frame and transpose it 
result1<-data.frame(t(result))
#rowSums computes column sums across rows for each level of a #grouping variable.
new_result <- data.frame(rowSums(result1))
#name rows and columns of the dataframe
names(new_result)[1] <- "count"
new_result <- cbind("sentiment" = rownames(new_result), new_result)
rownames(new_result) <- NULL
#plot the first 8 rows,the distinct emotions
qplot(sentiment, data=new_result[1:8,], weight=count, geom="bar",fill=sentiment)+ggtitle("TedTalk Sentiments")

#plot the last 2 rows ,positive and negative
qplot(sentiment, data=new_result[9:10,], weight=count, geom="bar",fill=sentiment)+ggtitle("TedTalk Sentiments")

Plot 1: Shows distinct emotions
Plot 2: Shows the combination of emotions under two polarities.

Conclusion

We have applied our sentiment analysis tricks on mined text to come up with an evident description of the emotions attached to text data.

This could be a whole project that can help you gain insights on how and when to talk to your audience, what they feel about a certain topic /product/service and what better way you can interact with them.

Now, go ahead choose an article/dataset /campaign that you want to try sentiment analysis on and follow the steps.

Happy Coding, I am always here to help <- @lornamariak

Understanding and Writing your first Text Mining Script with R.

El Marie 💜 — Thu, 11 Jan 2018 13:09:32 +0000

Introduction

One of the reasons data science has become popular is because of it’s ability to reveal so much information on large data sets in a split second or just a query.

Think about it deeply, on a daily basis how much information in form of text do we give out? All this information contains our sentiments, our opinions, our plans, pieces of advice, our favorite phrase among other things.

However, revealing each of those this can seem like finding a needle from a haystack at a glance until we use techniques like text mining/ analysis.
Text mining takes into account information retrieval, analysis and study of word frequencies and pattern recognition to aid visualization and predictive analytics.

In this article, We go through the major steps that a dataset undergoes to get ready for further analysis.we shall write our script using R and the code will be written in R studio.

To achieve our goal, we shall use an R package called “tm”.This package supports all text mining functions like loading data, cleaning data and building a term matrix.It is available on CRAN.

Let’s install and load the package in our work space to begin with.

#downloading and installing the package from CRAN
install.packages("tm")
#loading tm
library(tm)

Loading Data

Text to be mined can be loaded into R from different source formats.It can come from text files(.txt),pdfs (.pdf),csv files(.csv) e.t.c ,but no matter the source format ,to be used in the tm package it is turned into a “corpus”.

A corpus is defined as “a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject”.
The tm package use the Corpus() function to create a corpus.

#loading a text file from local computer
newdata<- readlines(filepath)
#Load data as corpus
#VectorSource() creates character vectors
mydata <- Corpus(VectorSource(newdata))

Refer to this guide to learn more about importing files into R.

Cleaning Data.

Once we have successfully loaded the data into the workspace, it is time to clean this data.Our goal at this step is to create independent terms(words) from the data file before we can start counting how frequent they appear.

Since R is case sensitive, we shall first convert the entire text to lowercase to avoid considering same words like “write” and “Write” differently.

We shall remove URLs, emojis,non-English words, punctuations, numbers, whitespace and stop words.
Stop words: The commonly used english words like “a”,” is ”,”the” in the tm package are referred to as stop words. These words have to be eliminated so as to render the results more accurate.It is also possible to create your own custom stop words.

# convert to lower case
mydata <- tm_map(mydata, content_transformer(tolower))
#remove ������ what would be emojis
mydata<-tm_map(mydata, content_transformer(gsub), pattern="\\W",replace=" ")
# remove URLs
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
mydata <- tm_map(mydata, content_transformer(removeURL)
)
# remove anything other than English letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
mydata <- tm_map(mydata, content_transformer(removeNumPunct))
# remove stopwords
mydata <- tm_map(mydata, removeWords, stopwords("english"))
#u can create custom stop words using the code below.
#myStopwords <- c(setdiff(stopwords('english'), c("r", "big")),"use", "see", "used", "via", "amp")
#mydata <- tm_map(mydata, removeWords, myStopwords)
# remove extra whitespace
mydata <- tm_map(mydata, stripWhitespace)
# Remove numbers
mydata <- tm_map(mydata, removeNumbers)
# Remove punctuations
mydata <- tm_map(mydata, removePunctuation)

Stemming

Stemming is the process of gathering words of similar origin into one word for example “communication”, “communicates”, “communicate”. Stemming helps us increase accuracy in our mined text by removing suffixes and reducing words to their basic forms.We shall use the SnowballC library.

library(SnowballC)
mydata <- tm_map(mydata, stemDocument)

Building a term Matrix and Revealing word frequencies

After the cleaning process, we are left with independent terms that exist throughout the document.These are stored in a matrix that shows each of their occurrences. This matrix logs the number of times the term appears in our clean data set thus being called a term matrix.

#create a term matrix and store it as dtm
dtm <- TermDocumentMatrix(mydata)

Word frequencies: These are the number of times words appear in data set.Word frequencies will indicate to us from the most frequently used words in the data set to the least used using the compilation of occurrences from the term matrix.

Conclusion

We have just written a basic text mining script, however, it is just the beginning of text mining.The ability to get the text in its raw format and clean it to this point will give us direction to things like building a word cloud, sentiment analysis, and building models.
Hold on to this script because it will come in handy when we start doing sentiment analysis.
Feel free to reach out to me with any question > @lornamariak.

"R libraries to aid you to learn data science in 2018"

El Marie 💜 — Wed, 03 Jan 2018 13:16:14 +0000

2018 is already here!What a year 2017 has been!
For someone who started learning data science later this year, it feels like the year has been short.The R learning curve may seem steep however continuous exposure to different tools and libraries/packages can make your experience simpler.
In this article, I share with you R packages under different branches of data science that have made my learning journey worthwhile so far.

Data Visualization

This is a very instrumental part of data science, for a data science newbie the ability to create great visualizations gives you the hope that you are on the right track.With great data visualizations comes a sense of appreciation for your work especially from none data scientists.The following packages will come in handy while visualizing in R.

1.ggplot2

This is an R package that a makes all that work of visualization much easy. It is known as the grammar of graphics and will take care of plotting details, has different graphical options and does great graph layering.
It is available on CRAN. Here is a great ggplot2 cheat sheet to get you started: ggplot2 cheat sheet

2.shiny

This an R package that gives users the power to explore dashboards and web apps.Shiny helps a lot with data collection and manipulation in real time as it handles reactivity in a great way.shiny apps can make use of HTML widgets, CSS themes and javascript actions to interface with R scripts.It is an awesome library for someone interested in data storytelling on their website.
shiny is available on CRAN.

Data Wrangling

One of the goals of every data scientist should be maximizing the data analysis time.To achieve this one needs to ensure the data they are working with is as clean as possible and can be subjected to manipulation easily.Data wrangling is the process of cleaning up data, removing redundancy and organizing it in a way that makes analysis much easier.The following packages are great and simple data wrangling tools.

1.tidyr

From the tidyr website ,tidy data is defined as data where

Each variable is in a column.

Each observation is a row.

Each value is a cell. tidyr makes use of simple verbs as R functions like gather()to carry out quick data tidying operations on large datasets. tidyr is available on CRAN.

2.dplyr

While dealing with data, there are common manipulations that have to be carried out and dplyr helps solve these by providing verb functions to carry out these manipulations.This helps you filter your data and carry out operations that can group the data for deeper meaning. dplyr is s available on CRAN.

Data Mining

This is one of the biggest challenges for data science newbies.Although very many websites are full of open data sets and are free, It is also an accomplishing feeling for a data science newbie to learn how to extract a data set from the numerous sources of information on and off the web. The following libraries will do the magic:

1.httr

This package will enable you access data via modern web APIs. It makes use of HTTP verb functions, requests return JSON data that can be parsed as R objects and it supports Oauth. This makes it easy for a newbie working with APIs in R. This package is available on CRAN

2.rvest

An R package for web scraping. It reads HTML docs through URLs, selects parts of the document using the CSS selectors and parses HTML tables as data frames in R. This package is available on CRAN

Conclusion

The first days of data science can be a bit confusing, however focusing on each one of these branches can help you understand data science step by step. I wish you a great learning experience in 2018 .Dont stop learning.
Feel free to reach out to me via twitter @lornamariak .I am happy to help and give some hype/support.Happy coding!