DEV Community: Angelica Lo Duca

How Generative AI Can Help You Improve Your Data Visualization Charts

Angelica Lo Duca — Mon, 29 Jan 2024 11:03:24 +0000

🚀 Exciting News in Data Visualization! 📊✨

Transform your data visualization game with the power of generative AI! 🤖📈 Check out our latest blog article with 5 Key Takeaways:

1️⃣ Master the basic structure of a data visualization chart.
2️⃣ Harness the potential of Python Altair for chart creation.
3️⃣ Turbocharge your chart generation with GitHub Copilot.
4️⃣ Elevate your content using ChatGPT for relevant annotations.
5️⃣ Spice up your charts with engaging images from DALL-E.

🕒 Tired of spending hours on mundane charts? Discover how generative AI tools like Altair, Copilot, ChatGPT, and DALL-E can revolutionize your data visualization process. 🚀💡

Follow the steps outlined in the article:

1️⃣ Write your basic chart with GitHub Copilot.
2️⃣ Utilize ChatGPT to generate catchy titles and annotations.
3️⃣ Enhance readability and captivate your audience by adding DALL-E generated images to your chart.

Ready to dive in? 💻🔍 Read here

How to Tailor A Column Chart for Communication

Angelica Lo Duca — Thu, 18 Jan 2024 04:56:39 +0000

❓Ever found yourself lost in the complexity of a column chart, struggling to decipher its meaning amidst a sea of information? You're not alone. Drawing a column chart is a fantastic way to represent categories and values, but it can become overwhelming with unnecessary details. In this blog post, we're diving into a strategy to simplify column charts, particularly when dealing with three main categories.

🤔 The Challenge: Overwhelming Complexity
Yes, column charts can sometimes be too much to handle, making it challenging for your audience to extract meaningful insights. But fear not, as we propose a strategic methodology to cut through the noise and bring clarity to your visualizations.

🔍 The Three-Step Methodology:

📊 Analyze Data: Start by understanding your data deeply. Identify the key categories and values crucial for your message.
❌ Delete Useless Data: Streamline your chart by removing irrelevant data points. Less clutter means a clearer focus on the essentials.
✂️ Approximate Remaining Data: Strike a balance by approximating the remaining data points. This step simplifies the chart while retaining vital trends.
🚀 Drawing Results with Altair:
Implementing this methodology is a breeze with Altair, a Python library for data visualization. Transform complex data into visually compelling charts that captivate your audience's attention and convey your message with impact.

💡 Considerations:
While this methodology offers a simplified view, it comes with a trade-off—a loss of information. Perfect for targeted communication during presentations, it might not be ideal for detailed technical reports. Choose wisely based on your communication goals.

🌐 Learn More here

Using Vega-Lite for Data Visualization

Angelica Lo Duca — Tue, 26 Dec 2023 16:49:41 +0000

Hi all,
Today I want to share with you an article about Vega-lite, a data visualization grammar. The article is a tutorial for beginners, showing how to get started with Vega-lite.
The idea behind Vega-lite is to write your data visualization in JSON and use a renderer to show the chart.

Read the full article here

How to Improve Your ChatGPT Outputs Using Configuration Parameters

Angelica Lo Duca — Thu, 14 Dec 2023 09:34:15 +0000

📚 Excited to share insights from my recent read! 🌟 David Clinton's "The Complete Obsolete Guide to Generative AI" from Manning Publications has been an eye-opener, especially diving into the second chapter.

Ever wondered about the key parameters shaping an AI model? This book delves deep into configuring them to match specific needs. Parameters like temperature, Top P value, frequency penalty, and presence penalty play a pivotal role in fine-tuning output.

Understanding and tweaking these settings can significantly impact ChatGPT's output. Setting parameters enables tailoring the output, whether you seek a more deterministic response closely linked to the input or desire a more creative and diverse output.

To get hands-on, we'll simulate a scenario extracted from my book Data Storytelling with Generative AI using Python and Altair.

Read the full article here 👇👇👇
https://towardsdatascience.com/how-to-improve-your-chatgpt-outputs-using-configuration-parameters-0eebd575646e

Using Slope Charts to Simplify Your Data Visualization

Angelica Lo Duca — Fri, 08 Dec 2023 21:55:31 +0000

We may plot charts to include as many concepts as possible in our visualization. As a result, our chart could be difficult to read and distracting. For this reason, before plotting anything, sit in your chair and plan what you want to communicate. Then, look at your data and decide what is effectively necessary to plot. Leave the rest out of your visualization.

In this tutorial, we’ll see how to use slope charts to simplify an overwhelming trendline. If you are a data analyst, you might jump out of your chair and get scared because, using a slope chart, you will see a significant loss of information. But I assure you that, in some cases, it will really be worth it.

Let’s see the cases where a slope chart can be used.

3 Ways to Embed a Matplotlib Chart into an HTML Page

Angelica Lo Duca — Thu, 22 Jun 2023 20:32:36 +0000

Are you struggling with integrating Matplotlib charts into your HTML pages? We've got you covered! Our latest article dives deep into innovative techniques for seamlessly embedding stunning charts while preserving interactivity. 🚀

Discover three powerful solutions:
1️⃣ mpld3 library: Unleash the potential of Matplotlib with mpld3's effortless chart integration.
2️⃣ Base64 Encoding: Learn how to encode charts as base64 for seamless HTML integration.
3️⃣ Leveraging py-script: Dive into the power of py-script to embed interactive charts with ease.

Don't miss this comprehensive guide that takes you step-by-step through each solution, providing clear instructions and examples. Elevate your data visualization game and conquer the complexities of chart integration effortlessly. 💪

How to Organize a Data Science Project

Angelica Lo Duca — Sat, 10 Jun 2023 07:51:40 +0000

Having trouble figuring out the best way to organize your data science projects?

Check out these strategies for efficient planning and manual installation, Cookiecutter, or utilizing a cloud service 🤓 💻 #datascience #organization

More details in this article: https://towardsdatascience.com/how-to-organize-your-data-science-project-3710a476bf8c

Book: Comet for Data Science

Angelica Lo Duca — Thu, 22 Sep 2022 11:56:29 +0000

About one year ago, I have discovered Comet, an experimentation platform for model tracking and monitoring. Since then, I have used Comet’s features to keep my projects organized and make them move from the early stages to production. The simplicity of Comet’s platform helped me complete all of the projects on it.

While studying Comet, I came across Heartbeat, a Medium publication edited by the Comet team, and started writing for them. Thanks to Heartbeat, I deepened several aspects of Comet, and I moved from the simple role of data analyst to that of a builder.

I authored Comet for Data Science as the result of my studies and tests, after talking with the Comet team and performing my personal tests.

Throughout the book, you will learn how to build a successful Data Science project, starting from the first steps, up to model deployment. In detail, the book will take you through the concepts of Data Science from a Comet perspective, with the hope that it will increase your productivity. By choosing this book, you will learn about data science from Comet’s point of view, with the hope that you will be able to increase your speed in creating these projects. Throughout the course of the book, you will see many practical examples that are used to better understand the concepts, and can also serve as starting points for your own projects.

The book is organized as follows:

Getting Started with Comet

An Overview of Comet
Exploratory Data Analysis in Comet
Model Evaluation in Comet

A Deep Dive to Comet

Workspaces, Projects, Experiments and Models
How to Build a Narrative in Comet
An Overview of DevOps concepts
Extending the Gitlab DevOps platform with Comet

Examples and Use Cases

Comet for Machine Learning
Comet for Natural Language Processing
Comet for Deep Learning
Comet for Time Series Analysis
You can find more details about the book at https://www.cometfordatascience.com/

Happy reading!

Model Evaluation in Scikit-learn

Angelica Lo Duca — Tue, 22 Mar 2022 12:16:22 +0000

Scikit-learn is one of the most popular Python libraries for Machine Learning. It provides models, datasets, and other useful functions. In this article, I will describe the most popular techniques provided by scikit-learn for Model Evaluation.

Model Evaluation permits us to evaluate the performance of a model, and compare different models, to choose the best one to send into production. There are different techniques for Model Evaluation, which depend on the specific task we want to solve. In this article, we focus on the following tasks:
Regression
Classification
For each task, I will describe how to calculate the most popular metrics, through a practical example.

1 Loading the Dataset

As an example dataset, I use the Wine Quality Data Set, provided by the UCI Machine Learning Repository. To use this dataset, you should cite the source properly, as follows:
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: the University of California, School of Information and Computer Science.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547–553, 2009.

I download the data folder, which contains two datasets: one for the red wine, and the other for the white wine. I build a single dataset, which is the concatenation of the two datasets, as follows.

I load both datasets as Pandas Dataframes, and, then, I merge them:

import pandas as pd targets = ['red', 'white'] df_list = [] df = pd.DataFrame() for target in targets: df_temp = pd.read_csv(f"../Datasets/winequality-{target}.csv", sep=';') df_temp['target'] = target df_list.append(df_temp) print(df_temp.shape) df = pd.concat([df_list[0], df_list[1]])

I have added a new column, which contains the original dataset name (red or white).

The dataset contains 6497 rows and 13 columns.
Now, I define a function, which encodes all the categorical columns:

from sklearn.preprocessing import LabelEncoder def transform_categorical(data): categories = (data.dtypes =="object") cat_cols = list(categories[categories].index) label_encoder = LabelEncoder() for col in cat_cols: data[col] = label_encoder.fit_transform(data[col])

I also define another function, which scales numerical columns:

from sklearn.preprocessing import MinMaxScaler def scale_numerical(data): scaler = MinMaxScaler() data[data.columns] = scaler.fit_transform(data[data.columns])

2 Regression

To evaluate a regression model, the most popular metrics are:

Continue Reading on Towards Data Science

An Overview of Visual Techniques for Exploratory Data Analysis in Python

Angelica Lo Duca — Wed, 02 Mar 2022 14:27:49 +0000

Before we can carry out our Data Science project, we must first try to understand the data and ask ourselves some questions. Exploratory Data Analysis (EDA) is the preliminary phase of a Data Science project, that allows us to extract important information from the data, understand which questions it can answer, and which ones it cannot.

We can perform EDA using different techniques, such as visual and quantitative techniques. In this article, we focus on visual techniques. Many different types of graphs can be used to analyze data visually. They include line charts, bar charts, scatter plots, area plots, table charts, histograms, lollipop charts, maps, and much more.

During the Visual EDA phase, the type of chart we use depends on the type of question we want to answer. We do not focus on aesthetics during this phase, because we are only interested in answering our questions. Aesthetics will be attended to in the final data narrative phase.
We can perform two types of EDA:

univariate analysis, which focuses on a single variable at a time
multivariate analysis, which focuses on multiple variables at a time.

When performing EDA, we can have the following types of variables:

Numerical — a variable that can be quantified. It can be either discrete or continuous.
Categorical — a variable that can assume only a limited number of values.
Ordinal — a numeric variable that can be sorted

In this article, I show you some of the most common visual techniques for EDA through a practical example, that uses the matplolib and seaborn Python libraries. The described concepts are general so you can easily adapt them to the other Python libraries or programming languages.
The article is organized as follows:

Setup of the Scenario
Visual Techniques for Univariate Analysis
Visual Techniques for Multivariate Analysis

1 Setup of the Scenario

The purpose of this scenario is to illustrate the main graphs for Visual EDA. As a sample dataset, we use the IT Salary Survey for EU Region, available under the CC0 license. I would like to thank Parul Pandey, who wrote a fantastic article about 5 real-world datasets for EDA. I discovered the dataset used in this article there.

Firstly we load the dataset as a Pandas dataframe:
import pandas as pd df = pd.read_csv('../Datasets/IT Salary Survey EU 2020.csv', parse_dates=['Timestamp']) df.head()
The dataset contains 1253 rows and the following 23 columns:
'Timestamp',
'Age',
'Gender',
'City',
'Position ',
'Total years of experience',
'Years of experience in Germany',
'Seniority level',
'Your main technology / programming language',
'Other technologies/programming languages you use often',
'Yearly brutto salary (without bonus and stocks) in EUR',
'Yearly bonus + stocks in EUR',
'Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country',
'Annual bonus+stocks one year ago. Only answer if staying in same country',
'Number of vacation days',
'Employment status',
'Сontract duration',
'Main language at work',
'Company size',
'Company type',
'Have you lost your job due to the coronavirus outbreak?',
'Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week',
'Have you received additional monetary support from your employer due to Work From Home? If yes, how much in 2020 in EUR'

2 Visual Techniques for Univariate Analysis

Univariate Analysis considers a single variable at a time. We can consider two types of univariate analysis:

categorical variables
numerical variables

2.1 Categorical Variables

The first graph we can plot is the count plot, which counts the frequency of each category. In our example, we can plot the frequency of the Position column, by considering only positions with a frequency greater than 10. Firstly, we create the mask:
mask = df['Position '].value_counts() df_10 = df[df['Position '].isin(mask.index[mask > 10])]
Then, we build the graph:
import matplotlib.pyplot as plt import seaborn as sns colors = sns.color_palette('rocket_r') plt.figure(figsize=(15,6)) sns.set(font_scale=1.2) plt.xticks(rotation = 45) sns.countplot(df_10['Position '], palette=colors) plt.show()

The second type of graph we can plot is the pie chart, which shows the same information of the count plot, but it also adds the percentage:
values = df_10['Position '].value_counts() plt.figure(figsize=(10,10)) values.plot(kind='pie', colors = colors,fontsize=17, autopct='%.2f') plt.legend(labels=mask.index, loc="best") plt.show()

2.2 Numerical Variables

In this case, we may be interested in data distribution, so we could plot a histogram. The histogram breaks up all the possible values down into bins, then works out which bin a value belongs to. In our example, we could plot a histogram of the top 10 salaries, so we build a mask as follows:

Continue Reading on Towards Data Science

Is a Small Dataset Risky?

Angelica Lo Duca — Mon, 21 Feb 2022 10:17:37 +0000

Recently I have written an article about the risks of using the train_test_split() function provided by the scikit-learn Python package. That article has raised a lot of comments, some positives, and others with some concerns. The main concern in the article was that I used a small dataset to demonstrate my theory, which was: be careful when you use the train_test_split()function, because the different seeds may produce very different models.

The main concern was that the train_test_split() function does not behave strangely; the problem is that I used a small dataset to demonstrate my thesis.

In this article, I try to discover which is the performance of a Linear Regression model by varying the dataset size. In addition, I compare the obtained performance of the algorithm with that obtained by varying the random seed in the train_test_split() function.

I organize the article as follows:

Possible issues with a small dataset
Possible countermeasures
Practical Example

1 Possible issue of a small dataset

A small dataset is a dataset with a little number of samples. The quantity small depends on the nature of the problem to solve. For example, if we want to analyze the average opinion about a given product, 100,000 reviews may be a lot, but if we have the same number of samples to calculate the most discussed topic on Twitter, the number of samples is really small.

Let us suppose that we have a small dataset, i.e. the number of samples is not sufficient to represent our problem. We could encounter at least the following issues:

Outliers — an outlier is a sample that significantly deviates from the rest of the dataset.
Overfitting — a model performs well with the training set, but it has poor performance with the test test
Sampling Bias — the dataset does not reflect reality.
Missing Values — a sample is not complete, some features could miss.
…

2 Possible Countermeasures

One obvious countermeasure to the issue of having a small dataset could be to increase the size of the dataset. We could achieve this result by collecting new data or producing new synthetic data.

Another possible solution could be using an ensemble approach, where instead of using just one best model, we can train different models and then combine them to get the best model.
Other countermeasures could include the usage of regularization, confidence intervals, and consortium approach, as described in this very interesting article entitled Problems of Small Data and How to Handle Them.

3 A Practical Example

In this example, we use the Weather Conditions in World War Two available on Kaggle, under the U.S. Government Works license. The experiment builds a very simple linear regression model that tries to predict the maximum temperature, provided the minimum temperature.

We run two batteries of tests: the first varies the dataset size, the second varies the random seed provided as input to the train_test_split() function.
In the first battery of tests, we run 1190 tests with a variable number of samples (from 100 up to the full dataset size), extracted randomly, and then, for each test, we calculate the Root Mean Squared Error (RMSE).

In the second battery of tests, we run other 1000 tests with a variable value for random_seed provided as input to the train_test_split(), and we calculate RMSE. Finally, we compare the results of the two batteries of tests, in terms of mean and standard deviation.

3.1 Load dataset

First, we load the dataset as a Pandas dataframe:

import pandas as pd
df = pd.read_csv('Summary of Weather.csv')

The dataset has 119,040 rows and 31 columns. For our experiment, we use only the MinTemp and MaxTemp columns.

Continue reading on Towards Data Science

Why You Should Not Trust the train_test_split() Function

Angelica Lo Duca — Mon, 14 Feb 2022 16:29:59 +0000

Surely almost all data scientists have tried to use the train_test_split() function at least once in their life. The train_test_split() function is provided by the scikit-learn Python package. Usually, we do not care much about the effects of using this function, because with a single line of code we obtain the division of the dataset into two parts, train and test set.

Indeed, using this function could be dangerous. And in this article, I will try to explain why.

The article is organized as follows:

Overview of the train_test_split() function
Potential risks
Possible countermeasures.

1 Overview of the train_test_split() function

The train_test_split() function is provided by the model_selection subpackage available under the sklearn package. The function receives as input the following parameters:

arrays — the dataset to be split;
test_size — the size of the test set. It could be either a float or an integer number. If it is a float, it should be a number between 0.0 and 1.0 and represents the proportion of the dataset to include in the test set. If it is an integer, it is the total number of samples to include in the test set. If the test_size is not set, the value is set automatically to the complement of the train size;
train_size — the size of the train set. Its behavior is complementary to the test_size variable;
random_state — before applying to split, the dataset is shuffled. The random_state variable is an integer that initializes the seed used for shuffling. It is used to make the experiment reproducible;
shuffle — it specifies whether to shuffle data before splitting or not. The default value is True;
stratify — if not None, it specifies an array of frequencies for class labels. This permits the splitting phase the preserve the frequency of class labels as specified.

Usually, we copy the example of how to use the train_test_split() from the scikit-learn documentation and we use it as follows:

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

We don’t care much about the effects of this feature. Let’s just go ahead with the code.
But there are potential risks, which I will show you in the next section.

2 Potential Risks

Internally, the train_test_split() function uses a seed that allows you to pseudorandomly separate the data into two groups: training and test set.

The number is pseudorandom because the same data subdivision corresponds to the same seed value. This aspect is very useful to ensure the reproducibility of the experiments.
Unfortunately, the use of one seed rather than another could lead to totally different datasets, and even modify the performance of the chosen Machine Learning model that receives the training set as input.

To understand the problem, let's take an example.

Continue reading on Towards Data Science