<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jesús Bosch Ayguadé</title>
    <description>The latest articles on DEV Community by Jesús Bosch Ayguadé (@jrbayguade).</description>
    <link>https://dev.to/jrbayguade</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3287677%2F356c797e-0bd6-4b74-a39b-0667f1723886.jpg</url>
      <title>DEV Community: Jesús Bosch Ayguadé</title>
      <link>https://dev.to/jrbayguade</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jrbayguade"/>
    <language>en</language>
    <item>
      <title>Create a Pareto chart with ggplot with R</title>
      <dc:creator>Jesús Bosch Ayguadé</dc:creator>
      <pubDate>Tue, 24 Jun 2025 09:19:17 +0000</pubDate>
      <link>https://dev.to/jrbayguade/create-a-pareto-chart-with-ggplot-with-r-2522</link>
      <guid>https://dev.to/jrbayguade/create-a-pareto-chart-with-ggplot-with-r-2522</guid>
      <description>&lt;p&gt;Vilfredo Fritz Pareto was an Italian statistician and sociologist that described the famous Pareto Principle. &lt;/p&gt;

&lt;p&gt;In short, it says that 80% of the outcome is explained by 20% of the causes. That means that if you are able to identify that small hidden root cause, you can fix 80% of your issues.&lt;/p&gt;

&lt;p&gt;In R we can do this with the built in graphics... or we can go with the way more trendy, fashionable, and powerful library &lt;a href="https://ggplot2.tidyverse.org/" rel="noopener noreferrer"&gt;ggplot&lt;/a&gt; (self called "the grammar of graphics").&lt;/p&gt;

&lt;p&gt;First of all, we will need the data to feed the chart.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;# Sample data: defect types and their frequencies&lt;/code&gt;&lt;br&gt;
&lt;code&gt;defects &amp;lt;- c("A" = 50, "B" = 30, "C" = 15, "D" = 5)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;# Sort in descending order&lt;/code&gt;&lt;br&gt;
&lt;code&gt;defects_sorted &amp;lt;- sort(defects, decreasing = TRUE)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;# Calculate cumulative percentages&lt;/code&gt;&lt;br&gt;
&lt;code&gt;cumulative_freq &amp;lt;- cumsum(defects_sorted)&lt;/code&gt;&lt;br&gt;
&lt;code&gt;cumulative_pct &amp;lt;- cumulative_freq / sum(defects_sorted) * 100&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;df_defects &amp;lt;- data.frame(&lt;/code&gt;&lt;br&gt;
&lt;code&gt;category = names(defects_sorted),&lt;/code&gt;&lt;br&gt;
&lt;code&gt;frequency = as.numeric(defects_sorted),&lt;/code&gt;&lt;br&gt;
&lt;code&gt;cumulative_freq,&lt;/code&gt;&lt;br&gt;
&lt;code&gt;cumulative_pct&lt;/code&gt;&lt;br&gt;
&lt;code&gt;)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;What happened above? We created a dataframe of different deffects and their frequency (A happens 50 times, B 30 times, etc).&lt;/p&gt;

&lt;p&gt;After that, we sort the defects in descending order (remember we want to find the root causes that cause the 80% of trouble right? easier to see visually if we put the larger ones first).&lt;/p&gt;

&lt;p&gt;Then we complete a typical frequency table with the cumulative frequency and the cummulative relative frequency (which is the data we want to show in the chart later on).&lt;/p&gt;

&lt;p&gt;Now we have a beautiful data frame that renders contains the following data:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;category frequency cumulative_freq cumulative_pct&lt;br&gt;
A        A        50              50             50&lt;br&gt;
B        B        30              80             80&lt;br&gt;
C        C        15              95             95&lt;br&gt;
D        D         5             100            100&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you don't have the ggplot library installed, simply install it with the following instruction:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;install.packages("ggplot")&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Then, let the magic happen:&lt;br&gt;
&lt;code&gt;library(ggplot2)&lt;/code&gt;&lt;br&gt;
&lt;code&gt;ggplot(data = df_defects, mapping = aes(x = category, y =  frequency)) +&lt;/code&gt;&lt;br&gt;
&lt;code&gt;geom_col() +&lt;/code&gt;&lt;br&gt;
&lt;code&gt;geom_line(mapping = aes(x = category, y = cumulative_pct), group = 1,&lt;/code&gt;&lt;br&gt;
&lt;code&gt;colour = "red", size = 3) +&lt;/code&gt;&lt;br&gt;
&lt;code&gt;xlab("Category") +&lt;/code&gt;&lt;br&gt;
&lt;code&gt;ylab("Frequency")&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ggplot&lt;/strong&gt; can look a bit scary at first.. but believe me it has all the logic in the world after the initial little learning curve. For now let's look at the (beautiful) output:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdxjpaenyhhpmipr9p774.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdxjpaenyhhpmipr9p774.png" alt="Pareto Chart output using ggplot" width="800" height="809"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see, the red line shows the % accumulated on the top of every category, in this example we see that we need the 3 initial columns on the left to get above the 80%.&lt;/p&gt;

&lt;p&gt;Going back to the code.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;ggplot() creates an empty canvas and tells it which dataset to use (df_defects) and how to map the data (categories go on the x axis and frequencies on the y axis). &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;geom_col() draws the actual bars based on those mappings. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;geom_line() adds a red line showing cumulative percentages, with group = 1 telling ggplot to connect all points into one continuous line, and size = 3 making it thick and visible. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;xlab() and ylab() add descriptive labels to the axes so readers know what they're looking at.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In resume, and contrary to other methods I am more used to, and this blew my mind at the beginning, in ggplot2, you build visualizations by adding layers with the + operator, like stacking transparent sheets. &lt;/p&gt;

&lt;p&gt;Each function adds a specific element. You start with ggplot() that creates the foundation, geom_col() adds bars, geom_line() overlays a line, and xlab() or ylab() add text labels. &lt;/p&gt;

&lt;p&gt;This is a modular approach that lets you combine different visual elements incrementally, reaching high levels of personalizatin and beautiness if you have some decent taste.&lt;/p&gt;

&lt;p&gt;If you want to learn more about ggplot, I recommend &lt;a href="https://dev.to/iamdurga/r-exercise-getting-started-with-ggplot-in-r-1a4c"&gt;this other article from another user at dev.to&lt;/a&gt;&lt;/p&gt;

</description>
      <category>r</category>
      <category>datascience</category>
      <category>tutorial</category>
      <category>statistics</category>
    </item>
  </channel>
</rss>
