DEV Community: Yankho Chimpesa

Getting Started with Sentiment Analysis

Yankho Chimpesa — Sat, 25 Mar 2023 10:07:17 +0000

Sentiment analysis is the process of analyzing digital text to determine if the emotional tone of the message is positive, negative, or neutral.

The implications of sentiment analysis are hard to underestimate to increase the productivity of the business. Sentiment Analysis is one of those common NLP tasks that every Data Scientist need to perform.

Using basic Sentiment analysis, you can understand whether the sentiment behind a piece of text is positive, negative, or neutral.

Think about a scenario where you are a student that has enrolled in a particular online course and you are experiencing a problem, you can then post the problem on the class forum.
Sentiment analysis can then be used to not only identify the topic you are struggling with, but also how frustrated or discouraged you are, and tailor their comments to that sentiment.

Under the umbrella of text mining, sentiment analysis is routinely used to determine the voice of the customer in feedback materials and channels like reviews, surveys, web articles, and social media. As language evolves, it can become increasingly challenging to understand intent through these channels and defaulting to dictionary definitions may lead to inaccurate readings.

Today, companies have large volumes of text data like emails, customer support chat transcripts, social media comments, and reviews. Sentiment analysis tools can scan this text to automatically determine the author’s attitude towards a topic. Companies use the insights from sentiment analysis to improve customer service and increase brand reputation.

Why is sentiment analysis important?

Sentiment analysis, also known as opinion mining, is an important business intelligence tool that helps companies improve their products and services.

Opinion mining is a feature of sentiment analysis. Also known as aspect-based sentiment analysis in Natural Language Processing (NLP), this feature provides more granular information about the opinions related to words (such as the attributes of products or services) in text.

We give some benefits of sentiment analysis below.

Provide objective insights

Businesses can avoid personal bias associated with human reviewers by using artificial intelligence (AI)–based sentiment analysis tools. As a result, companies get consistent and objective results when analyzing customers’ opinions.

For example, consider the following sentence:

I'm amazed by the speed of the processor but disappointed that it heats up quickly.

Marketers might dismiss the discouraging part of the review and be positively biased towards the processor's performance. However, accurate sentiment analysis tools sort and classify text to pick up emotions objectively.

Build better products and services

A sentiment analysis system helps companies improve their products and services based on genuine and specific customer feedback. AI technologies identify real-world objects or situations (called entities) that customers associate with negative sentiment. From the above example, product engineers focus on improving the processor's heat management capability because the text analysis software associated disappointed (negative) with processor (entity) and heats up (entity).

Analyze at scale

Businesses constantly mine information from a vast amount of unstructured data, such as emails, chatbot transcripts, surveys, customer relationship management records, and product feedback. Cloud-based sentiment analysis tools allow businesses to scale the process of uncovering customer emotions in textual data at an affordable cost.

Real-time results

Businesses must be quick to respond to potential crises or market trends in today's fast-changing landscape. Marketers rely on sentiment analysis software to learn what customers feel about the company's brand, products, and services in real time and take immediate actions based on their findings. They can configure the software to send alerts when negative sentiments are detected for specific keywords.

Working Principles

Sentiment analysis uses several technologies to distill all your customers’ words into a single, actionable item. The process of sentiment analysis follows these four steps:

Breaking down the text into components: sentences, phrases, tokens, and parts of speech.
Identifying each phrase and component.
Assigning a sentiment score to each phrase with plus or minus points.
Combining scores for a final sentiment analysis.

By remembering descriptive words and phrases to assign them a sentiment weight, you and your team can build a sentiment library. Through manual scoring, your team decides how strong or weak each word should be, and the polarity of the corresponding phrase score, noting if it is positive, negative, or neutral. Multilingual sentiment analysis engines also must maintain unique libraries for every language they support through consistent scoring, new phrases, and the removal of irrelevant terms.

Sentiment analysis is an application of natural language processing (NLP) technologies that train computer software to understand text in ways similar to humans. The analysis typically goes through several stages before providing the final result.

Preprocessing

During the preprocessing stage, sentiment analysis identifies key words to highlight the core message of the text.

Tokenization breaks a sentence into several elements or tokens.
Lemmatization converts words into their root form. For example, the root form of am is be.
Stop-word removal filters out words that don't add meaningful value to the sentence. For example, with, for, at, and of are stop words.

Keyword analysis

NLP technologies further analyze the extracted keywords and give them a sentiment score. A sentiment score is a measurement scale that indicates the emotional element in the sentiment analysis system. It provides a relative perception of the emotion expressed in text for analytical purposes. For example, researchers use 10 to represent satisfaction and 0 for disappointment when analyzing customer reviews.
What are the approaches to sentiment analysis?

There are three main approaches used by sentiment analysis software.

Rule-based

The rule-based approach identifies, classifies, and scores specific keywords based on predetermined lexicons. Lexicons are compilations of words representing the writer's intent, emotion, and mood. Marketers assign sentiment scores to positive and negative lexicons to reflect the emotional weight of different expressions. To determine if a sentence is positive, negative, or neutral, the software scans for words listed in the lexicon and sums up the sentiment score. The final score is compared against the sentiment boundaries to determine the overall emotional bearing.

Rule-based analysis example

Consider a system with words like happy, affordable, and fast in the positive lexicon and words like poor, expensive, and difficult in a negative lexicon. Marketers determine positive word scores from 5 to 10 and negative word scores from -1 to -10. Special rules are set to identify double negatives, such as not bad, as a positive sentiment. Marketers decide that an overall sentiment score that falls above 3 is positive, while - 3 to 3 is labeled as mixed sentiment.

Pros and cons

A rule-based sentiment analysis system is straightforward to set up, but it's hard to scale. For example, you'll need to keep expanding the lexicons when you discover new keywords for conveying intent in the text input. Also, this approach may not be accurate when processing sentences influenced by different cultures.

Machine Learning

This approach uses machine learning (ML) techniques and sentiment classification algorithms, such as neural networks and deep learning, to teach computer software to identify emotional sentiment from text. This process involves creating a sentiment analysis model and training it repeatedly on known data so that it can guess the sentiment in unknown data with high accuracy.

Training

During the training, data scientists use sentiment analysis datasets that contain large numbers of examples. The ML software uses the datasets as input and trains itself to reach the predetermined conclusion. By training with a large number of diverse examples, the software differentiates and determines how different word arrangements affect the final sentiment score.

Pros and cons

ML sentiment analysis is advantageous because it processes a wide range of text information accurately. As long as the software undergoes training with sufficient examples, ML sentiment analysis can accurately predict the emotional tone of the messages. However, a trained ML model is specific to one business area. This means sentiment analysis software trained with marketing data cannot be used for social media monitoring without retraining.

Hybrid

Hybrid sentiment analysis works by combining both ML and rule-based systems. It uses features from both methods to optimize speed and accuracy when deriving contextual intent in text. However, it takes time and technical efforts to bring the two different systems together.

Different types of sentiment analysis

Businesses use different types of sentiment analysis to understand how their customers feel when interacting with products or services.
Fine-grained scoring

Fine-grained sentiment analysis refers to categorizing the text intent into multiple levels of emotion. Typically, the method involves rating user sentiment on a scale of 0 to 100, with each equal segment representing very positive, positive, neutral, negative, and very negative.

Aspect-based

Aspect-based analysis focuses on particular aspects of a product or service. For example, laptop manufacturers survey customers on their experience with sound, graphics, keyboard, and touchpad. They use sentiment analysis tools to connect customer intent with hardware-related keywords.

Intent-based

Intent-based analysis helps understand customer sentiment when conducting market research. Marketers use opinion mining to understand the position of a specific group of customers in the purchase cycle. They run targeted campaigns on customers interested in buying after picking up words like discounts, deals, and reviews in monitored conversations.

Emotional detection

Emotional detection involves analyzing the psychological state of a person when they are writing the text. Emotional detection is a more complex discipline of sentiment analysis, as it goes deeper than merely sorting into categories. In this approach, sentiment analysis models attempt to interpret various emotions, such as joy, anger, sadness, and regret, through the person's choice of words.

Sentiment analysis use cases

Businesses use sentiment analysis to derive intelligence and form actionable plans in different areas.

Improve customer service

Customer support teams use sentiment analysis tools to personalize responses based on the mood of the conversation. Matters with urgency are spotted by artificial intelligence (AI)–based chatbots with sentiment analysis capability and escalated to the support personnel.

Brand monitoring

Organizations constantly monitor mentions and chatter around their brands on social media, forums, blogs, news articles, and in other digital spaces. Sentiment analysis technologies allow the public relations team to be aware of related ongoing stories. The team can evaluate the underlying mood to address complaints or capitalize on positive trends.

Market research

A sentiment analysis system helps businesses improve their product offerings by learning what works and what doesn't. Marketers can analyze comments on online review sites, survey responses, and social media posts to gain deeper insights into specific product features. They convey the findings to the product engineers who innovate accordingly.

Track campaign performance

Marketers use sentiment analysis tools to ensure that their advertising campaign generates the expected response. They track conversations on social media platforms and ensure that the overall sentiment is encouraging. If the net sentiment falls short of expectation, marketers tweak the campaign based on real-time data analytics.

Crisis prevention

To monitor media publishing, sentiment analysis tools can collect mentions of predefined keywords in real time. Your public relations or customer success teams can use this information to inform their responses to negative posts, possibly shortening—or even averting a social media crisis before it can pick up speed.

Challenges in sentiment analysis

Despite advancements in natural language processing (NLP) technologies, understanding human language is challenging for machines. They may misinterpret finer nuances of human communication such as those given below.

Sarcasm

It is extremely difficult for a computer to analyze sentiment in sentences that comprise sarcasm. Consider the following sentence, Yeah, great. It took three weeks for my order to arrive. Unless the computer analyzes the sentence with a complete understanding of the scenario, it will label the experience as positive based on the word great.

Negation

Negation is the use of negative words to convey a reversal of meaning in the sentence. For example, I wouldn't say the subscription was expensive. Sentiment analysis algorithms might have difficulty interpreting such sentences correctly, particularly if the negation happens across two sentences, such as, I thought the subscription was cheap. It wasn't.

Multipolarity

Multipolarity occurs when a sentence contains more than one sentiment. For example, a product review reads, I'm happy with the sturdy build but not impressed with the color. It becomes difficult for the software to interpret the underlying sentiment. You'll need to use aspect-based sentiment analysis to extract each entity and its corresponding emotion.

Conclusion

In conclusion, getting started with sentiment analysis can be a challenging but rewarding experience. Sentiment analysis has a wide range of applications across various industries and can provide valuable insights into customer behavior and opinions. By analyzing large amounts of text data, sentiment analysis can help businesses make informed decisions about their products, services, and marketing strategies.

When getting started with sentiment analysis, it is important to have a clear understanding of the data you are working with and the goals of your analysis. Choosing the right tools and techniques for your project is also essential. Natural Language Processing (NLP) tools such as NLTK and spaCy can be helpful for pre-processing text data, while machine learning algorithms such as Naive Bayes, Support Vector Machines (SVM), and Random Forests can be used for classification tasks.

One of the main challenges of sentiment analysis is the inherent subjectivity of human language. Words and phrases can have different meanings and connotations depending on context, cultural background, and personal experience. Therefore, it is important to develop a robust set of training data and validate the accuracy of your model regularly.

Another important consideration when getting started with sentiment analysis is ethical concerns. It is essential to ensure that your analysis is not biased towards any particular group and that user privacy is respected. Additionally, it is important to consider the potential consequences of your analysis and ensure that it is being used for a beneficial purpose.

Despite these challenges, sentiment analysis has the potential to provide valuable insights into customer opinions and behavior. By analyzing large amounts of text data, businesses can gain a better understanding of customer needs and preferences, as well as identify potential issues with their products or services. This can lead to more informed decision-making and improved customer satisfaction.

In addition to businesses, sentiment analysis can also be used in a variety of other applications. For example, it can be used to analyze social media posts to track public opinion on political issues or to monitor customer sentiment towards a particular brand. It can also be used in healthcare to analyze patient feedback and improve the quality of care.

Overall, getting started with sentiment analysis requires a combination of technical skills, domain expertise, and ethical considerations. With the right tools and techniques, businesses and organizations can gain valuable insights into customer behavior and opinions, which can lead to improved decision-making and better outcomes for all stakeholders. As the field of sentiment analysis continues to evolve, it is important to stay up-to-date with the latest techniques and trends to ensure that your analysis remains relevant and effective.

Essential SQL Commands For Data Science

Yankho Chimpesa — Sun, 12 Mar 2023 08:51:05 +0000

Data is naturally at the heart of the job of a data scientist or data analyst. You can get your information from a variety of sources.
Because data is frequently stored in a SQL database, understanding SQL query commands is often required to perform this role successfully.
This article will introduce you to some of the more basic commands, as well as some of the more advanced operations that will be useful to you as a data analyst or data scientist.

The commands are classified based on multiple operations such as simple data retrieval, aggregations, joins and complex conditions.

The following are some of the essential SQL commands you need to have knowledge of as a data scientist:

SELECT

The SELECT command is used to retrieve data from a database. It is used to specify which columns and rows to retrieve from a table. Here is an example:

SELECT * 
FROM 
neighbourhoods

neighbourhood_id	neighbourhood
0	Ashfield
1	Bankstown
2	Blacktown
3	Burwood
4	Botany Bay

In this example, we are selecting all columns from a table called neighbourhoods.

The * operator is used to select all columns in a table:

FROM

The FROM command is used to specify the table or tables from which to retrieve data. Here is an example:
In this example, we are retrieving data from a table called names.

SELECT * 
FROM 
names

reg_id	name
0	Astrid
1	Barin
2	Blaje
3	Brian
4	Cody

If you need to retrieve data from multiple tables, you can use a JOIN statement. We will cover JOIN in more detail later in this article.

WHERE

The WHERE command is used to filter the data based on a specified condition. It is used to narrow down the results to only those rows that meet the specified condition.

Here is an example:
In this example, we are answering this question: How would you adapt the query to be sorted by host_id, to display the host_id and the host, and to be restricted to the neighbourhood_id of a particular neighbourhood, let's say number 35?

# 1/ Fetch only host_id, host from the listings table
# 2/ Make sure you filtered the data to just neighbourhood_id=35
# 3/ Make sure the output is sorted by host_id in descending order

SELECT host_id, host FROM listings
WHERE neighbourhood_id=35
ORDER BY host_id DESC

host_id	host
285488167	Rick
185783910	Tiina
109067745	Annie
41506490	Andrew

GROUP BY

The GROUP BY command is used to group the data based on one or more columns. It is used to aggregate data based on the grouping columns.

The GROUP BY requires aggregate functions:
COUNT: total number of rows
SUM: sum of all the values
MAX: maximum value
MIN: minimum value
AVG: average value

Here is an example:

We're now interested in tracking all neighbourhoods in which we are "over-represented". Let's first count all the occurences of each neighbourhood in our listings-table.

# Instructions: 
# 1/ Fetch neighbourhood_id from the listings table
# 2/ For the second column get the number of listings in each neighbourhood

# TO BE COMPLETED

SELECT neighbourhood_id,
COUNT(neighbourhood_id)
FROM listings
GROUP BY neighbourhood_id

neighbourhood_id	COUNT(neighbourhood_id)
2	3
0	1
1	1
4	1

HAVING

The HAVING command is used to filter the data after it has been grouped. It is used to filter out groups that do not meet a specified condition. Here is an example:

SELECT listing_id, COUNT(host_id) as count
FROM reviews
GROUP BY host_name
HAVING COUNT(host_id) > 10;

In this example, we are selecting listing_id and counting the number of values in host_id for each group of values in listing_id. We then use the HAVING clause to filter the results so that only groups with a count greater than 10 are included in the results.

ORDER BY

The ORDER BY command is used to sort the data based on one or more columns. It is used to sort the data in ascending or descending order. Here is an example:

Find all the listings where we set our neighbourhood_id to 27 and "Private room".


# Instructions: 
# 1/ Fetch host_id, host from the listings table
# 2/ Make sure you filtered the data to just neighbourhood_id=27 and room_type='Private room'
# 3/ Make sure the output is sorted by host_id in descending order

SELECT host_id, host 
FROM listings
WHERE 
neighbourhood_id=27 AND room_type='Private room'
ORDER BY host_id DESC

DISTINCT

In SQL, the DISTINCT keyword is used to select only unique values from a column or set of columns. Here are some examples of how to use the DISTINCT keyword in SQL:

SELECT DISTINCT first_name
FROM Customers;

first_name
Edwin
William
Samuel
Linda

In this example, we are selecting only the distinct values of first_name column from the table. The resulting query will return a list of unique values of the first_name column.

AS

The AS command is used to make aliases or rename column names.
We are renaming "customer id" to "ID" and "first name" to "Name" in the example below.

SELECT customer_id AS ID,
       first_name AS Name
FROM Customers;

ID	Name
1	Edwin
2	William
3	Samuel
4	Linda

LIKE

The LIKE command is used for string filtering. You will provide the expression and it will use it to find the values that are matching the expression.
Consider the following example:

# Instructions: 
# 1/ Fetch all columns from the listings table
# 2/ Make sure you filtered the data to names that start with Jos

SELECT 
*
FROM listings 
WHERE host LIKE 'Jos%'

listing_id	listing	host_id
22296011	Large private room on Camperdown park & Newtown	10873080

JOIN

In SQL, a JOIN statement is used to combine data from two or more tables based on a common column. Joining tables is a powerful way to retrieve data that is spread across multiple tables. There are several types of JOIN statements, including:

INNER JOIN: An inner join returns only the rows that have matching values in both tables being joined. Here is an example:

SELECT orders.order_id, customers.customer_name
FROM orders
INNER JOIN customers
ON orders.customer_id = customers.customer_id;

In this example, we are selecting the order_id from the orders table and the customer_name from the customers table where the customer_id in both tables matches.

LEFT JOIN: A left join returns all the rows from the left table (the table specified before the LEFT JOIN keyword) and the matching rows from the right table (the table specified after the LEFT JOIN keyword). If there are no matching rows in the right table, the result will contain NULL values for the right table columns. Here is an example:

SELECT customers.customer_name, orders.order_id
FROM customers
LEFT JOIN orders
ON customers.customer_id = orders.customer_id;

RIGHT JOIN: A right join is similar to a left join, but it returns all the rows from the right table and the matching rows from the left table. If there are no matching rows in the left table, the result will contain NULL values for the left table columns. Here is an example:

SELECT customers.customer_name, orders.order_id
FROM customers
RIGHT JOIN orders
ON customers.customer_id = orders.customer_id;

In this example, we are selecting the customer_name from the customers table and the order_id from the orders table where the customer_id in both tables matches. If there are no matching customers for an order, the result will contain NULL values for the customer_name column.

FULL OUTER JOIN: A full outer join returns all the rows from both tables and combines the matching rows from both tables. If there are no matching rows in one of the tables, the result will contain NULL values for the columns of the table that has no matching rows. Here is an example:

SELECT customers.customer_name, orders.order_id
FROM customers
FULL OUTER JOIN orders
ON customers.customer_id = orders.customer_id;

In this example, we are selecting the customer_name from the customers table and the order_id from the orders table where the customer_id in both tables matches. If there are no matching customers for an order or no matching orders for a customer, the result will contain NULL values for the respective columns. Note that not all database management systems support the FULL OUTER JOIN syntax.

These are the main types of JOIN statements in SQL. Understanding the different types of JOINs and when to use them is an important skill for data scientists who work with relational databases.

UNION

In SQL, the UNION operator is used to combine the results of two or more SELECT statements into a single result set. Here are some examples of how to use the UNION operator in SQL:

Simple UNION example:

SELECT host_id, host_name
FROM listings
UNION
SELECT reg_number, reg_name
FROM reviews;

In this example, we are selecting columns from two different tables and combining the results using the UNION operator. The resulting query will return all unique combinations of host_id, reg_number and host_name,reg_name from both tables.

UNION with ORDER BY:

SELECT host_id, host_name
FROM listings
UNION
SELECT reg_number, reg_name
FROM reviews
ORDER BY reg_number ASC;

In this example, we are using the UNION operator to combine the results of two SELECT statements, but we are also using the ORDER BY clause to sort the results by column1 in ascending order. The resulting query will return all unique combinations of host_id, reg_number and host_name, reg_name from both tables, sorted by reg_number.

UNION with WHERE clause:

SELECT host_id, host_name
FROM listings
WHERE purchase> 10
UNION
SELECT reg_number, reg_name
FROM reviews
WHERE order_price < 5;

In this example, we use the UNION operator to combine the results of two SELECT statements, but we also use WHERE clauses to filter the results of each SELECT statement prior to combining them.
The query that results will return all unique combinations of the columns that satisfy the conditions in either WHERE clause.

The UNION operator is an extremely useful tool for combining the results of multiple SELECT statements into a single result set.
You can use the UNION operator to perform complex queries on your data and extract meaningful insights from it.

CASE

CASE statement is a powerful tool that allows you to perform conditional logic within a SQL query. With the CASE statement, you can evaluate an expression and return different values based on different conditions. Here are some examples of how to use the CASE statement in MySQL:

SELECT item,
       amount,
       CASE
           WHEN amount < 1000 THEN 'Low'
           ELSE 'High'
       END AS Priority
FROM Orders;

item	amount	Priority
Keyboard	600	Low
Mouse	200	Low
Monitor	18000	High
Keyboard	900	Low
Mousepad	850	Low

Conclusion

In conclusion, SQL is a critical tool for any data scientist as it provides a powerful way to query, filter, and analyze data stored in relational databases. The ability to extract valuable insights from large datasets is a key component of data science, and SQL provides an efficient and effective way to accomplish this task.

In this article, we covered some essential SQL commands that every data scientist should know.

However, there are many other SQL commands and techniques that data scientists can use to enhance their data analysis skills. For instance, joining tables, aggregating data, and using subqueries can all help data scientists to analyze data more effectively. Additionally, using SQL with other tools such as Python, R, and visualization software can provide even more advanced capabilities in data analysis.

Finally, it's worth noting that while SQL is a powerful tool, it's not the only tool that data scientists should rely on. Other tools and techniques, such as machine learning, deep learning, and natural language processing, can also provide valuable insights into data. The key to successful data analysis is to use the right tools and techniques for the task at hand and to constantly learn and adapt as new technologies and methods emerge.

Mastering SQL commands is an essential skill for data scientists looking to extract valuable insights from large datasets.
By understanding how to connect to a database, retrieve data, filter data, and sort data, data scientists can effectively manipulate data and extract insights that will help them make informed business decisions. However, it's important to remember that SQL is just one tool in the data scientist's toolkit, and that the most successful data analysis requires a diverse set of skills and techniques

Exploratory Data Analysis (EDA)Ultimate Guide

Yankho Chimpesa — Fri, 24 Feb 2023 20:24:48 +0000

An important phase in data analysis and data science is exploratory data analysis (EDA), which involves looking at and visualizing data to comprehend its properties and interactions between variables.
It aids in the discovery of patterns, outliers, and potential data issues. This article will serve as the ultimate guide to exploratory data analysis, including its definition, steps, and techniques.

Definition

Exploratory data analysis is the process of studying data to highlight its key features using quantitative and visual techniques.
It entails comprehending the structure of the data, spotting trends and connections, and looking for probable outliers or abnormalities.
Gaining insights into the data, spotting potential issues, and getting the data ready for more analysis are the key objectives of EDA. As a result, it is unquestionably the most important step in a data science project, accounting for nearly 70-80% of the total time spent on the project.

Since EDA is an iterative process, the analysis may be honed or expanded in response to the findings of earlier analysis.

Types

Univariate data analysis

Univariate data analysis (EDA) is a type of exploratory data analysis (EDA) that examines the distribution and characteristics of a single variable at a time.
The primary goal of univariate analysis is to comprehend the data's central tendency, variability, and distribution https://www.geeksforgeeks.org/exploratory-data-analysis-eda-types-and-tools/.
Some common techniques used in univariate analysis include:

Descriptive Statistics: Descriptive statistics, such as mean, median, mode, range, and standard deviation, provide a summary of the central tendency, dispersion, and shape of the distribution of a variable.

To calculate descriptive statistics such as mean, median, and standard deviation, we can use the NumPy library. Here's an example:

import numpy as np

data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

mean = np.mean(data)
median = np.median(data)
std_dev = np.std(data)

print("Mean:", mean)
print("Median:", median)
print("Standard Deviation:", std_dev)

#Output
Mean: 5.5
Median: 5.5
Standard Deviation: 2.8722813232690143

Frequency Distributions: Frequency distributions show how many times each value or range of values occurs in a variable. This helps to understand the shape of the distribution, such as whether it is symmetric or skewed.
To create a frequency distribution, we can use the pandas library. Here's an example:

import pandas as pd

data = [1, 1, 2, 3, 3, 3, 4, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8, 9, 10]

freq_dist = pd.Series(data).value_counts()

print(freq_dist)

#Output

5     4
3     3
6     2
4     2
1     2
8     2
2     1
7     1
9     1
10    1
dtype: int64

Histograms: Histograms are graphical representations of frequency distributions that use bars to show the frequency of each value or range of values in a variable. Histograms provide a visual representation of the distribution of the data.

Box Plots: Box plots, also known as box-and-whisker plots, provide a graphical summary of the distribution of a variable. They show the median, quartiles, and outliers of the data.

Probability Distributions: Probability distributions, such as the normal distribution, provide a mathematical model for the distribution of the data. They can be used to make predictions about the data and to test hypotheses.

Bivariate Analysis

Bivariate analysis is a type of exploratory data analysis (EDA) in which the relationship between two variables is examined.
The goal of bivariate analysis is to identify any patterns or trends in the data and to understand how the two variables are related to each other https://www.analyticsvidhya.com/blog/2022/02/a-quick-guide-to-bivariate-analysis-in-python/.

There are several techniques that can be used to perform bivariate analysis, including:

Scatter plots - Scatter plots are a visual way to explore the relationship between two variables. A scatter plot displays the values of two variables as points on a two-dimensional graph, with one variable represented on the x-axis and the other on the y-axis. The pattern of the points can provide insights into the relationship between the two variables. For example, if the points are clustered around a straight line, it suggests a linear relationship between the variables.

Correlation analysis - Correlation analysis is a statistical technique used to measure the strength and direction of the relationship between two variables. The correlation coefficient ranges from -1 to +1, with a value of -1 indicating a perfect negative correlation, a value of +1 indicating a perfect positive correlation, and a value of 0 indicating no correlation. Correlation analysis can help to identify the strength and direction of the relationship between two variables.

Covariance analysis - Covariance is a statistical measure that describes how two variables are related to each other. Covariance is similar to correlation, but it does not take into account the scale of the variables. A positive covariance indicates that the two variables tend to move together, while a negative covariance indicates that the two variables tend to move in opposite directions.

Heat maps - Heat maps are graphical representations of data that use color coding to represent the value of a variable. Heat maps can be used to explore the relationship between two variables by displaying the correlation matrix in a color-coded format. This allows us to quickly identify patterns and trends in the data.

Regression analysis - Regression analysis is a statistical technique used to model the relationship between two variables. Regression analysis can be used to predict the value of one variable based on the value of another variable. For example, we could use regression analysis to predict the sales of a product based on the advertising spend.

By using these techniques, we can gain insights into the relationship between two variables and use this information to inform further analysis and modeling.

Multivariate analysis

Multivariate analysis is a type of exploratory data analysis (EDA) that involves analyzing the relationship between three or more variables. The goal of multivariate analysis is to understand how multiple variables are related to each other and to identify any patterns or trends in the data https://towardsdatascience.com/multivariate-analysis-going-beyond-one-variable-at-a-time-5d341bd4daca.

There are several techniques that can be used to perform multivariate analysis, including:

Factor analysis - Factor analysis is a statistical technique used to identify patterns in the relationship between multiple variables. Factor analysis reduces the number of variables by grouping them into a smaller number of factors, based on their correlation with each other.

Cluster analysis - Cluster analysis is a statistical technique used to group similar objects or individuals based on their characteristics. Cluster analysis can be used to identify patterns in the data and to identify subgroups of individuals or objects.

Principal component analysis - Principal component analysis (PCA) is a statistical technique used to transform a large number of variables into a smaller number of principal components. PCA can be used to reduce the dimensionality of the data and to identify the most important variables.

Discriminant analysis - Discriminant analysis is a statistical technique used to classify individuals or objects into two or more groups based on their characteristics. Discriminant analysis can be used to identify the variables that are most important in distinguishing between the groups.

_Canonical correlation analysis _- Canonical correlation analysis is a statistical technique used to identify the relationship between two sets of variables. Canonical correlation analysis can be used to identify the variables that are most important in explaining the relationship between the two sets of variables.

By using these techniques, we can gain insights into the relationship between multiple variables and use this information to inform further analysis and modeling. Multivariate analysis is particularly useful when working with large datasets or when exploring complex relationships between variables.

Example

# Correlation coeffiecient

corr_df = df.corr()
f,ax=plt.subplots(figsize=(20,20))
sns.heatmap(corr_df,annot=True,fmt=".2f", ax=ax,linewidths=0.5,linecolor="yellow")
plt.xticks(rotation=45)
plt.yticks(rotation=45)
plt.title('Correlations coefficient of the data')
plt.show()

Steps of Exploratory Data Analysis

EDA is typically carried out in several steps, which include:

Data Collection: This involves gathering relevant data for analysis. Data can be collected from various sources, including public datasets, surveys, and databases.
Data Cleaning: This step involves checking for missing data, errors, and outliers. The data is cleaned by removing duplicates, correcting data entry errors, and filling in missing values.
Data Visualization: This step involves creating visualizations to identify patterns and relationships in the data. Common visualization techniques include scatter plots, histograms, and box plots.
Data Transformation: This step involves transforming the data to make it more suitable for analysis. This can include normalization, scaling, and standardization.
Data Modeling: This step involves creating models to describe the relationships between variables. Models can be simple, such as linear regression, or complex, such as decision trees or neural networks.

Data Collection

The first step in EDA is to collect relevant data for analysis. The data can be collected from various sources, such as public datasets, surveys, and databases. In Python, you can use libraries like pandas to read and manipulate data.

See the Example below:

#the pyforest library helps reduce listing multiple import statements

import pyforest

# Read data from a CSV file
df= pd.read_csv('IT Salary Survey EU  2020.csv')
df.head()

Once the data has been put into a pandas dataframe, you may start exploring it with a variety of tools and functions.

Data Cleaning

Cleaning the data is EDA's second phase.
Data cleaning is an important step in exploratory data analysis (EDA) because it ensures that the data is correct, complete, and reliable.
The process of identifying and correcting errors, inconsistencies, and missing values in a dataset is known as data cleaning.
The pandas and dask libraries can be used to clean the data.

Data cleaning is a crucial step in exploratory data analysis (EDA) as it helps to ensure that the data is accurate, complete, and reliable. Data cleaning involves identifying and correcting errors, inconsistencies, and missing values in the dataset.

The process of data cleaning in EDA typically involves the following:

Data inspection: In this step, the data is visually inspected to identify any obvious errors or inconsistencies, such as missing values, outliers, or incorrect data types.
Handling missing values: Missing values can be handled by either removing the rows or filling in the missing values with an appropriate estimate, such as the mean or median.
Handling outliers: Outliers are data points that are significantly different from the other data points in the dataset. Outliers can be handled by removing them from the dataset or by transforming the data to reduce the impact of outliers.
Normalization: Normalization is the process of transforming the data so that it follows a standard distribution. This can help to reduce the impact of outliers and make it easier to compare data points.
Validation: Data validation involves checking the data to ensure that it meets the requirements of the analysis. This includes checking for errors, inconsistencies, and other issues that could affect the validity of the results.
Transformation: Data transformation involves converting the data into a form that is suitable for analysis. This can include aggregating data, creating new variables, or converting variables into different formats.

Overall, data cleaning is an important step in EDA because it ensures that the data is accurate, reliable, and fit for analysis.
By cleaning the data, analysts and data scientists can gain valuable insights and make informed decisions based on the data.

Example:

# Displaying the info and the last columns, rows of the dataset
df.tail()
df.info

# Check for missing values
data.isnull().sum()

# Fill in missing values
data.fillna(data.mean(), inplace=True)

# Remove duplicates
data.drop_duplicates(inplace=True)

In the above code, we first check for any missing values in the data by using the isnull() method, which returns a boolean dataframe indicating which cells are null or missing. We then use the fillna() method to replace any missing values with the mean value of the column. Finally, we use the drop_duplicates() method to remove any duplicate rows in the dataframe.

Data Visualization

The third step in EDA is to create visualizations of the data.
Data visualization is an essential part of exploratory data analysis (EDA). It involves the creation of graphical representations of the data that make it easier to understand and interpret.

Data visualization helps to identify patterns, trends, and relationships in the data that may be difficult to discern from raw data alone. Visualizations in Python can be created using libraries such as Matplotlib and Seaborn.

Overall, data cleaning is a critical step in EDA as it helps to ensure that the data is accurate, reliable, and suitable for analysis. By cleaning the data, analysts can gain valuable insights and make informed decisions based on the data.

There are many types of data visualizations that can be used in EDA, including:

Scatterplots: Scatterplots are used to visualize the relationship between two continuous variables. They show how the values of one variable are related to the values of another variable.

Histograms: Histograms are used to visualize the distribution of a single continuous variable. They show how the values of the variable are spread across a range of values.

Bar charts: Bar charts are used to visualize the distribution of a categorical variable. They show the frequency or proportion of each category.

Box plots: Box plots are used to visualize the distribution of a continuous variable. They show the median, quartiles, and outliers of the variable.

Heat maps: Heat maps are used to visualize the relationship between two categorical variables. They show the frequency or proportion of each combination of categories.

Line charts: Line charts are used to visualize trends in a continuous variable over time.

When creating data visualizations, it is important to choose the right type of visualization for the data being analyzed. The visualization should be clear and easy to understand, and the labels and axis should be clearly labeled.

Example:

import matplotlib.pyplot as plt
import seaborn as sns

# Scatter plot
plt.scatter(data["Years of experience in Germany"], data["Age"])
plt.xlabel("Years of experience in Germany")
plt.ylabel("Age")
plt.show()

# Histogram
sns.histplot(data["Years of experience in Germany"], bins=10)
plt.xlabel("Years of experience in Germany")
plt.ylabel("Frequency")
plt.show()

# Box plot
sns.boxplot(x=data["group"], y=data["value"])
plt.show

Learn Exploratory Data Analysis (EDA) in Python - YouTube

This playlist is intended to follow the prior playlist on learning python programming for data analytics. If you have never programmed before (or have very l...

youtube.com

Data Transformation

The fourth step in EDA is to transform the data to make it more suitable for analysis. This can include normalization, scaling, and standardization. You can use libraries like Scikit-learn to transform the data.

from sklearn.preprocessing import StandardScaler

# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

# Normalize the data
from sklearn.preprocessing import

Conclusion

Exploratory Data Analysis is an essential process in data analysis that provides insights into the data, identifying potential problems, and preparing the data for further analysis. In this article, we have covered the steps involved in EDA, including data collection, cleaning, visualization, transformation, and modeling.

Data cleaning involves identifying and correcting errors in the data, while data visualization enables the identification of patterns and relationships in the data using techniques such as histograms, box plots, and density plots. Data transformation, including normalization, scaling, and standardization, prepares the data for modeling.

EDA is a crucial step in data analysis that can help identify potential problems in the data, such as missing values, outliers, and anomalies, and provide insights into relationships between variables. This enables researchers and analysts to make informed decisions and gain insights that can be used to solve problems or make predictions.

Overall, the EDA process is iterative, meaning that the analysis may be refined or expanded based on the results of previous analysis. EDA is an essential step in the data analysis process, and it is critical to ensure that the data is clean and ready for further analysis.

For more information about EDA, check out my repository on github where i did Exploratory data analysis on the "IT Salary Survey EU" dataset that i did find on kaggle.
Here is the link to the github repo https://github.com/Yankho817/MyProjects.

Python 101: Introduction to Python for Data Science

Yankho Chimpesa — Sat, 18 Feb 2023 19:36:31 +0000

The technique of deriving useful and practical insights from data is known as data science. Tools like Python, R, Jupyter, Spider, etc. just make it possible for us to accomplish that. It's also crucial to realize that the basics of data science remain the same regardless of the technologies used https://datascienceparichay.com/python-for-data-science/introduction/.

Python is one of the most well-known and widely used programming languages due to its versatility; one of its applications is in data science.It includes a number of libraries that make it simple to perform core data analysis tasks.

It has a simple syntax and structure that makes it easy to learn for both novice and experienced users. In this article, we'll discuss why python is so popular in data science, give a quick overview of the basics of python, look at some python libraries, and show an example of python in data science in action.

Overview

Python is a high level programming language that was initially made available in 1991. It is an interpreted language, meaning that no compilation is necessary before running any given program in Python. Its popularity has created a sizable community, which makes it simpler to locate information and help.

Because it contains so many helpful libraries and tools for manipulating, analyzing, and visualizing data, Python is your go-to programming language for data science. NumPy, Pandas, Dask, Seaborn, and Matplotlib are some of the most widely used libraries for data science applications.

Installation

Installing anaconda is highly recommended, anaconda does not only install Python for you but also other crucial tools like Jupyter, Spyder, RStudio, etc.
The installation procedure is not that difficult.
Follow the steps below;

Navigate to Anaconda’s Individual Edition’s page download the Anaconda installer depending on your system requirements https://www.anaconda.com/.
Utilize the downloaded installer to install Anaconda.Provide guidelines on how you want the anaconda installed during the installation procedure. Use the default configurations if you're unsure or go to this guide for further information.
After a successful installation, the Anaconda Navigator would be accessible. To quickly test the installation, launch a Jupyter Notebook and issue a straightforward Python command.
comparable to print("Hey Fellas").
A Jupyter Notebook can be opened by launching it from the Anaconda Navigator.
A Jupyter notebook is an open-source web-based application that allows you to create and share live documents that contain code, equations, visualizations, and narrative text. Jupyter notebooks are used for data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

The Jupyter notebook combines the interactive capabilities of IPython with the versatile document format of the notebook. The notebook is capable of running code in a wide range of programming languages, including Python, R, Julia, and Scala. It can also be used to produce rich, interactive visualizationshttps://jupyter.org/try-jupyter/retro/notebooks/?path=notebooks/Intro.ipynb.

That's all there is to it. After successfully completing the aforementioned steps, you will have all the tools required to execute Python, whether it is directly in the command prompt or through a straightforward program like Jupyter Notebook.

Syntax Basics

Python's very nature implies that it was created to be easily read and written.
The following are some Python syntax examples;

Variables in Python

A variable in Python is a term that designates a value or an object.
In a program, data is stored and managed using variables.
In Python, you just have to use the = (assignment) operator to pair a name and a value to create a variable.
Examples include:

yob = 12
print(yob)

In the example 'yob' is our variable while '5' is the assigned value.

Since variables in Python are dynamically typed, the type of the variable is decided upon during execution based on the value that has been set to it. In contrast to statically typed languages, where a variable's type must be specified prior to use, this one does not.

Python allows letters, numbers, and underscores to be used in variable names, although digits cannot be the first character.
As a matter of tradition, words are separated by underscores in lowercase when writing variable names https://www.pythontutorial.net/python-basics/python-variables/.

Understanding Data types

Python includes a number of data types, each with its own set of operations and methods.
Here are some of the most common Python data types:

Numbers: Python is capable of handling complicated, floating-point, and integer numbers.
Floating-point numbers are represented by the float class, while integers are represented by the int class.
The complex class is a representation for complex numbers.
Strings: Using strings, you can represent text data.
They are enclosed in single quotes (') or double quotes ("), and are represented by the str class.
Booleans: Truth values are represented by Booleans. They are represented by the bool class and can only have True or False values.
Lists: They are used to organize a collection of items. The list class represents them, and they can contain items of any data type.
Tuples: Tuples are similar to lists in that their contents cannot be changed, but they are immutable. The tuple class is used to represent them.
Dictionary: A dictionary is a collection of key-value pairs. They are created with curly braces and are represented by the dict class.
Sets: Sets are used to store values that are unique. They are created with curly braces and are represented by the set class.

Here are a few Python examples of how to use these data types:

# Numbers
X = 5          # integer
Y = 3.14       # float
Z = 2 + 3j     # complex number

# Strings
S = ‘Hey, There!’
T = “Python is amazing”

# Booleans
A = True
B = False

# Lists
My_list = [1, 2, 3, “four”, 5.0, 6]

# Tuples
My_tuple = (1, 2, 3, “four”, 5.0, "six")

# Dictionaries
My_dict = {“name”: “Yankho”, “age”: 22, “city”: “Blantyre”}

# Sets
My_set = {2, 4, 6, 8, 10}

Functions in Python

In Python, a function is a block of code that performs a specific task or set of tasks. Functions are used to make code reusable, modular, and easier to read and maintain.

To define a function in Python, you use the def keyword, followed by the function name and parentheses, and a colon. The function body is indented beneath the function header.
Here’s a simple example:

Def say_hello():
Print(“Hello, world!”)

In this example, we’ve defined a function called say_hello() that simply prints the string “Hello, world!” when it is called.

To call a function in Python, you simply write the function name followed by parentheses.
For example:

Copy code
Say_hello()

This will call the say_hello() function, and the output will be “Hello, world!”.
Functions can also take parameters, which are values that you pass to the function https://www.pythontutorial.net/python-basics/python-functions/.

Control Statements in Python

In Python, control statements are used to control the flow of the program. They allow you to perform different actions depending on conditions or iterate over data structures.

Here are the three main types of control statements in Python:

Conditional statements (if, else, and elif)

Conditional statements are used to execute different code depending on certain conditions. The basic syntax of an if statement is:

If condition:
    # code to be executed if condition is True

An if statement can be followed by an optional else statement, which is executed if the condition is False:

If condition:
    # code to be executed if condition is True
Else:
    # code to be executed if condition is False

If you have multiple conditions to check, you can use the elif statement:

If condition1:
    # code to be executed if condition1 is True
Elif condition2:
    # code to be executed if condition2 is True
Else:
    # code to be executed if both condition1 and condition2 are False

Loops (for and while)

Loops are used to execute a block of code repeatedly. The for loop is used to iterate over a sequence (such as a list or a string), while the while loop is used to repeat a block of code as long as a certain condition is True.

The basic syntax of a for loop is:

For element in sequence:
    # code to be executed for each element in sequence

The basic syntax of a while loop is:

While condition:
    # code to be executed as long as condition is

Control statements (break, continue, and pass)

Control statements are used to change the normal flow of a loop or a conditional statement. The break statement is used to exit a loop, the continue statement is used to skip the current iteration and move on to the next one, and the pass statement is used as a placeholder when you don’t want to execute any code.

Here’s an example that combines these control statements:

For i in range(1, 10):
    If i % 2 == 0:
        Continue # skip even numbers
    If i == 7:
        Break # exit the loop when i is 7
    If i == 3:
        Pass # do nothing when i is 3
    Print(i)

This code will print the numbers 1, 5, and 9. It skips even numbers using the continue statement, exits the loop when i is 7 using the break statement, and does nothing when i is 3 using the pass statement.

Python Libraries for Data Science

Data science projects benefit greatly from the many libraries that Python has to offer. The following list of well-known Python data science libraries includes usage examples:

NumPy: NumPy is a fundamental library for scientific computing in Python. It provides powerful array manipulation capabilities and mathematical functions and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more https://numpy.org/doc/stable/user/whatisnumpy.html.

The library offers numpy arrays that resemble lists and can be up to 50 times quicker than Python lists.

Import numpy as np

# Creating an array
Arr = np.array([1, 2, 3, 4, 5])

# Mathematical operations on arrays
Print(np.sin(arr))
Print(np.exp(arr))

Pandas: Pandas is a library for analyzing and manipulating data. It offers data structures that make processing and analyzing tabular data efficient. Additionally, Pandas has a flexible dataframe object that can read data from numerous well-known formats, including Excel, SQL, CSV, and more. It offers highly helpful tools for both reshaping and performing various types of analytics on your data https://pandas.pydata.org/docs/user_guide/index.html.

Consider the example below;

Import pandas as pd

# Reading a CSV file
Data = pd.read_csv(‘data.csv’)

# Grouping and aggregating data
Grouped = data.groupby(‘category’)
Averages = grouped.mean()

Matplotlib: Matplotlib is a library for creating visualizations in Python. It provides a wide range of plotting tools for visualizing data in various formats https://matplotlib.org/stable/index.html.

See the example below;


Import matplotlib.pyplot as plt

# Creating a scatter plot
X = [1, 2, 3, 4, 5]
Y = [2, 4, 6, 8, 10]
Plt.scatter(x, y)

# Adding labels and a title
Plt.xlabel(‘X values’)
Plt.ylabel(‘Y values’)
Plt.title(‘Scatter plot of X and Y’)

Scikit-learn: Scikit-learn is a machine learning library for Python. It provides a wide range of algorithms and tools for machine learning tasks such as classification, regression, and clustering.

See the example below;

From sklearn.linear_model import LinearRegression

# Creating a linear regression model
Model = LinearRegression()

# Fitting the model to data
X = [[1, 2], [3, 4], [5, 6]]
Y = [3, 7, 11]
Model.fit(X, y)

# Making predictions with the model
Print(model.predict([[7, 8]]))

Seaborn: Seaborn is a library for creating statistical visualizations in Python. It provides a wide range of tools for creating advanced statistical plots https://seaborn.pydata.org/.

See the example below;


Import seaborn as sns

# Creating a heatmap
Data = pd.read_csv(‘data.csv’)
Corr = data.corr()
Sns.heatmap(corr)

# Adding a title
Plt.title(‘Correlation heatmap of variables in data.csv’)

Dask: Dask is a library for parallel computing in Python. It provides tools for handling large datasets that do not fit into memory by partitioning them across multiple processors or machines. This ease of transition between single-machine to moderate cluster enables users to both start simple and grow when necessary.

Dask is convenient on a laptop. It installs trivially with conda or pip and extends the size of convenient datasets from “fits in memory” to “fits on disk” https://docs.dask.org/en/stable/index.html.

Example:

Import dask.dataframe as dd

# Reading a CSV file
Df = dd.read_csv(‘large_file.csv’)

# Computing the mean of a column
Mean = df[‘column’].mean().compute()

Pyforest: Pyforest is a lazy-import library for data science. It automatically imports commonly used data science libraries when they are first used in a script, so you don’t have to manually import them. pyforest offers the following solution:

You may utilize all of your libraries as usual.
Pyforest will import a library if it isn't already and add an import statement to the first Jupyter cell.
A library won't be imported if it isn't being used.
Your notebooks continue to be duplicated and shared without your having to worry about imports.

After setting up pyforest and its Jupyter extension, you can use your preferred Python Data Science tools as usual without having to write import statements https://pypi.org/project/pyforest/.

For example, if you want to read a CSV with pandas:

Import pyforest

# No need to explicitly import pandas
Df = pd.read_csv(‘data.csv’)

# No need to explicitly import matplotlib
Plt.plot([1, 2, 3], [4, 5, 6])

Note that while Pyforest can make your code more concise, it can also make it less clear where your functions are coming from, which can be a downside in larger codebases.

Conclusion

Python is a powerful and versatile programming language for data science. It has become increasingly popular due to its user-friendly syntax and the extensive range of libraries available for data analysis, manipulation, and visualization.

The Jupyter Notebook environment is an essential tool for data scientists, as it allows for efficient documentation, visualization, and communication of code and analysis. Moreover, the popular Python libraries such as NumPy, Pandas, Matplotlib, and Scikit-learn are indispensable for a range of data science tasks.

With Python, data scientists can explore, manipulate, and visualize data in a variety of formats, and can create predictive models and make data-driven decisions. Python’s popularity has led to a growing community of developers and data scientists, who share best practices, libraries, and techniques.

In summary, Python is a powerful and versatile tool for data science, and it is a necessary skill for anyone in the field. With continued practice and experience, data scientists can leverage Python’s capabilities to analyze and draw insights from large and complex data sets, and make data-driven decisions that are crucial in today’s data-driven world.