DEV Community: Clare Chebor

Transforming Data into Insightful Visualizations: My Tableau Journey

Clare Chebor — Fri, 26 Jul 2024 11:38:55 +0000

In today’s data-driven world, the ability to transform complex datasets into clear, actionable insights is more crucial than ever. As a Data Scientist with a passion for data visualization, I’ve worked on several exciting Tableau projects that have honed my skills in creating interactive and impactful dashboards. Here’s a glimpse into some of my most recent Tableau projects, including a British Airways reviews analysis, a university dashboard, and an Airbnb dashboard.

1. British Airways Reviews Analysis
One of my standout projects involved analyzing customer reviews for British Airways using Tableau. This project aimed to uncover insights from a vast collection of reviews, focusing on sentiment analysis and emerging trends. By creating interactive dashboards, I was able to present data in a way that highlighted key areas for improvement and customer satisfaction. Features such as sentiment heatmaps and trend lines allowed stakeholders to visualize patterns over time, ultimately aiding in strategic decision-making to enhance customer service.

Key Achievements:

Developed sentiment analysis visualizations to track customer feedback.
Created trend analyses to identify periods of increased or decreased satisfaction.
Provided actionable insights to improve service quality and customer experience.

2. University Dashboard
For a university-focused project, I designed a comprehensive Tableau dashboard to visualize institutional data. The goal was to support research initiatives and facilitate informed decision-making across the university. This dashboard included key performance indicators, enrollment trends, and academic performance metrics. By providing a clear and interactive view of the data, the dashboard helped university administrators and researchers make data-driven decisions to enhance institutional effectiveness.

Key Achievements:

Designed a user-friendly interface for easy navigation and data exploration.
Included metrics such as enrollment figures, graduation rates, and academic performance.
Enabled stakeholders to interact with the data and generate customized reports.

3. Airbnb Dashboard
In another project, I analyzed Airbnb rental data to provide insights into pricing, occupancy rates, and market trends. The Tableau dashboard I created offered an interactive way to explore these metrics and uncover valuable patterns. By visualizing trends in rental prices and occupancy, the dashboard helped property owners and managers optimize their rental strategies and improve their competitive edge.

Key Achievements:

Developed visualizations to track rental pricing trends and occupancy rates.
Created comparative analyses to benchmark performance against market trends.
Provided recommendations for optimizing pricing strategies and enhancing guest experiences.

Conclusion
Each Tableau project has been a valuable learning experience, allowing me to develop my skills in creating interactive and insightful visualizations. From analyzing customer reviews to supporting academic research and optimizing rental strategies, the power of Tableau has been instrumental in transforming complex data into actionable insights. As I continue to explore the possibilities of data visualization, I look forward to leveraging these skills to drive impactful decisions and support strategic initiatives in future projects.

Exploratory Data Analysis using Data Visualization Technique

Clare Chebor — Sun, 08 Oct 2023 11:28:46 +0000

Exploratory Data Analysis
Exploratory Data Analysis, or EDA, is an important step in any Data Analysis or Data Science project. EDA is the process of investigating the dataset to discover patterns, and anomalies (outliers), and form hypotheses based on our understanding of the dataset.

EDA involves generating summary statistics for numerical data in the dataset and creating various graphical representations to understand the data better. In this article, we will understand EDA with the help of an example dataset. We will use Python language (Pandas library) for this purpose.

Importing libraries
We will start by importing the libraries we will require for performing EDA. These include NumPy, Pandas, Matplotlib, and Seaborn.
import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns

Reading data
We will now read the data from a CSV file into a Pandas DataFrame. You can download the dataset for your reference.
df=pd.read_csv('exams.csv')replace the 'exams.csv' with your dataset name
Let us have a look at how our dataset looks like using df.head(). The output should look like this:

Descriptive Statistics
Everything is perfect! The data looks just like we wanted. You can easily tell that the dataset contains data about different students at a school/college and their scores in 3 subjects. Let us start by looking at descriptive statistic parameters for the dataset. We will use describe() for this.

df.describe(include='all')
By assigning include attribute a value of ‘all’, we make sure that categorical features are also included in the result.
The output DataFrame should look like this:

For numerical parameters, fields like mean, standard deviation, percentiles, and maximum have been populated. For categorical features, count, unique, top (most frequent value), and corresponding frequency have been populated. This gives us a broad idea of our dataset.
Missing value imputation
We will now check for missing values in our dataset. In case there are any missing entries, we will impute them with appropriate values (mode in case of a categorical feature, and median or mean in case of numerical feature). We will use the isnull() function for this purpose.
df.isnull().sum()
This will tell us how many missing values we have in each column in our dataset. The output (Pandas Series) should look like this:

Fortunately for us, there are no missing values in this dataset. We will now proceed to analyze this dataset, observe patterns, and identify outliers with the help of graphs and figures.
Graphical representation
We will start with *Univariate Analysis, using a bar graph for this purpose. We will look at the distribution of students across gender, race/ethnicity, their lunch status, and whether they have a test preparation course or not.
plt.subplot(221) df['gender'].value_counts().plot(kind='bar', title='Gender of students', figsize=(16,9)) plt.xticks(rotation=0) plt.subplot(222) df['race/ethnicity'].value_counts().plot(kind='bar', title='Race/ethnicity of students') plt.xticks(rotation=0) plt.subplot(223) df['lunch'].value_counts().plot(kind='bar', title='Lunch status of students') plt.xticks(rotation=0) plt.subplot(224) df['test preparation course'].value_counts().plot(kind='bar', title='Test preparation course') plt.xticks(rotation=0) plt.show()

We can infer many things from the graph. There are more girls in the school than boys. The majority of the students belong to groups C and D. More than 60% of the students have a standard lunch at school. Also, more than 60% of students have not taken any test preparation course.

Continuing with Univariate Analysis, next, we will be making a boxplot of the numerical columns (math score, reading score, and writing score) in the dataset. A boxplot helps us visualize the data in terms of quartiles. It also identifies outliers in the dataset, if any. We will use the boxplot() function for this.

df.boxplot()
The output should look like this:

The middle portion represents the interquartile range (IQR). The horizontal green line in the middle represents the median of the data. The hollow circles near the tails represent outliers in the dataset. However, since it is very much possible for a student to score extremely low marks on a test, we will not remove these outliers.

We will now make a distribution plot of the math scores of the students. A distribution plot tells us how the data is distributed. We will use the distplot() function.

sns.distplot(df['math score'])
The plot in the output should look like this:

The graph represents a perfect bell curve closely. The peak is at around 65 marks, the mean of the math score of the students in the dataset. A similar distribution plot can also be made for reading and writing scores.

We will now look at the correlation between the 3 scores with the help of a heatmap. For this, we will use corr() and heatmap() function for this exercise.
corr = df.corr() sns.heatmap(corr, annot=True, square=True) plt.yticks(rotation=0) plt.show()
The plot in the output should look like this:

The heatmap shows that the 3 scores are highly correlated. The reading score has a correlation coefficient of 0.95 with the writing score. Math score has a correlation coefficient of 0.82 with the reading score, and 0.80 with the writing score.

We will now move on to Bivariate Analysis. We will look at a relational plot in Seaborn. It helps us to understand the relationship between 2 variables on different subsets of the dataset. We will try to understand the relationship between the math score and the reading score of students of different genders.
sns.relplot(x='math score', y='reading score', hue='gender', data=df)
The relational plot should look like this:

The graph shows a clear difference in scores between the male and female students. For the same math score, female students are more likely to have a higher reading score than male students. However, for the same reading score, male students are expected to have a higher math score than female students.
Finally, we will analyze students’ performance in math, reading, and writing based on the level of education of their parents and test preparation course. First, let us have a look at the impact of parents’ level of education on their child’s performance in school using a line plot.
df.groupby('parental level of education')[['math score', 'reading score', 'writing score']].mean().T.plot(figsize=(12,8))

It is very clear from this graph that students whose parents are more educated than others (master’s degree, bachelor’s degree, and associate’s degree) are performing better on average than students whose parents are less educated (high school). This can be a genetic difference or simply a difference in the students’ environment at home. More educated parents are more likely to push their students toward studies.
Secondly, let’s look at the impact of the test preparation course on students’ performance using a horizontal bar graph.
df.groupby('test preparation course')[['math score', 'reading score', 'writing score']].mean().T.plot(kind='barh', figsize=(8,8))
The output looks like this:

Again, it is very clear that students who have completed the test preparation course have performed better, on average, as compared to students who have not opted for the course.
Conclusion
In this article, we understood the meaning of Exploratory Data Analysis (EDA) and Data Visualization with the help of an example dataset from(Kaggle). We looked at how we could analyze the dataset, draw conclusions from the same, and form a hypothesis based on that.

Reference:Analytics Vidhya

Data Science for Beginners

Clare Chebor — Sat, 30 Sep 2023 18:40:32 +0000

DATA SCIENCE
Data science is an interdisciplinary field that combines various techniques, algorithms, and processes to extract knowledge and insights from structured and unstructured data. These insights can be used for decision-making, predictive modeling, and problem-solving across a wide range of domains, from finance and healthcare to marketing and sports.
Data scientists are the wizards who wield their analytical skills and domain knowledge to transform raw data into actionable insights. They are skilled in statistical analysis, machine learning, data visualization, and programming languages like Python or R.
Key Concepts in Data Science

Data Collection: Data science begins with the collection of data. This data can come from various sources, such as databases, spreadsheets, sensors, social media, or even text documents. The quality and quantity of data are crucial for the success of any data science project.
Data Cleaning and Preprocessing: Raw data is often messy, incomplete, or inconsistent. Data scientists spend a significant amount of time cleaning and preprocessing data, which involves tasks like handling missing values, removing outliers, and standardizing data formats.
Exploratory Data Analysis (EDA): EDA is the process of visualizing and summarizing data to understand its underlying patterns and characteristics. Data scientists use tools like histograms, scatter plots, and statistical measures to gain insights from the data.
Machine Learning: Machine learning is a subset of data science that focuses on building predictive models from data. This involves training algorithms to make predictions or classifications based on historical data. Common algorithms include decision trees, support vector machines, and neural networks.
Data Visualization: Data visualization is the art of presenting data in a visual format, such as charts, graphs, and dashboards. It helps convey insights effectively to non-technical stakeholders.
Model Evaluation: After building a machine learning model, data scientists evaluate its performance using various metrics. This step is crucial for ensuring that the model is accurate and reliable.

Tools and Technologies
Data scientists use a variety of tools and technologies to perform their work. Some of the essential tools include:
•Programming Languages: Python and R are the primary languages used for data science due to their rich libraries and communities.
•Data Manipulation Libraries: Pandas (Python) and dplyr (R) are widely used for data manipulation tasks.
•Machine Learning Frameworks: Scikit-Learn (Python) and TensorFlow/PyTorch (Python) are popular for building machine learning models.
•Data Visualization Tools: Matplotlib, Seaborn, and Plotly (Python) are used for creating data visualizations.
•Statistical Packages: R offers a comprehensive suite of statistical packages.

Learning Data Science
For beginners interested in data science, here are some steps to get started:

Learn the Basics: Start with the fundamentals of programming, statistics, and mathematics.
Choose a Language: Pick either Python or R as your primary programming language and become proficient in it.
Explore Online Courses: Enroll in online courses and tutorials. Platforms like Lux Academy Coursera, edX, Data Camp, and Khan Academy offer excellent introductory courses.
Hands-on Practice: Apply what you learn by working on small data projects. Kaggle is a great platform for practicing and competing in data science competitions.
Read and Stay Informed: Follow data science blogs, books, and academic papers to stay updated on the latest trends and techniques.
Join a Community: Participate in data science communities, boot camps offered by various organizations/Data Science Academy, and forums to seek help, share knowledge, and collaborate with others.

Conclusion
Data science is a dynamic and rewarding field that continues to evolve. While it might seem intimidating at first, remember that every data scientist started as a beginner. With determination, curiosity, and continuous learning, you can embark on a fascinating journey into the world of data science, unlocking the power of data to make informed decisions and solve real-world problems.