<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Barbara</title>
    <description>The latest articles on DEV Community by Barbara (@barbara).</description>
    <link>https://dev.to/barbara</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F139641%2F57beaebb-75c9-418a-8488-9cc8702d50d4.jpeg</url>
      <title>DEV Community: Barbara</title>
      <link>https://dev.to/barbara</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/barbara"/>
    <language>en</language>
    <item>
      <title>Data Visualisation Basics</title>
      <dc:creator>Barbara</dc:creator>
      <pubDate>Fri, 06 Sep 2024 13:47:02 +0000</pubDate>
      <link>https://dev.to/barbara/data-visualisation-basics-2moa</link>
      <guid>https://dev.to/barbara/data-visualisation-basics-2moa</guid>
      <description>&lt;h1&gt;
  
  
  Why use data vis
&lt;/h1&gt;

&lt;p&gt;When you need to work with a new data source, with a huge amount of data, it can be important to use data visualization to understand the data better.&lt;br&gt;
The data analysis process is most of the times done in 5 steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Extract - Obtain the data from a spreadsheet, SQL, the web, etc. &lt;/li&gt;
&lt;li&gt;Clean - Here we could use exploratory visuals. &lt;/li&gt;
&lt;li&gt;Explore - Here we use exploratory visuals. &lt;/li&gt;
&lt;li&gt;Analyze - Here we might use either exploratory or explanatory visuals. &lt;/li&gt;
&lt;li&gt;Share - Here is where explanatory visuals live. &lt;/li&gt;
&lt;/ol&gt;

&lt;h1&gt;
  
  
  Types of data
&lt;/h1&gt;

&lt;p&gt;To be able to choose an appropriate plot for a given measure, it is important to know what data you are dealing with.&lt;/p&gt;

&lt;h2&gt;
  
  
  Qualitative aka categorical types
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Nominal qualitative data
&lt;/h3&gt;

&lt;p&gt;Labels with no order or rank associated with the items itself.&lt;br&gt;
&lt;strong&gt;Examples&lt;/strong&gt;: Gender, marital status, menu items&lt;/p&gt;

&lt;h3&gt;
  
  
  Ordinal qualitative data
&lt;/h3&gt;

&lt;p&gt;Labels that have an order or ranking.&lt;br&gt;
&lt;strong&gt;Examples&lt;/strong&gt;: letter grades, rating&lt;/p&gt;

&lt;h2&gt;
  
  
  Quantitative aka numeric types
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Discrete quantitative values
&lt;/h3&gt;

&lt;p&gt;Numbers can not be split into smaller units&lt;br&gt;
&lt;strong&gt;Examples&lt;/strong&gt;: Pages in a Book, number of trees in a park&lt;/p&gt;

&lt;h3&gt;
  
  
  Continuous quantitative values
&lt;/h3&gt;

&lt;p&gt;Numbers can be split in smaller units&lt;br&gt;
&lt;strong&gt;Examples&lt;/strong&gt;: Height, Age, Income, Workhours&lt;/p&gt;

&lt;h1&gt;
  
  
  Summary Statistics
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Numerical Data
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Mean&lt;/strong&gt;: The average value.&lt;br&gt;
&lt;strong&gt;Median&lt;/strong&gt;: The middle value when the data is sorted.&lt;br&gt;
&lt;strong&gt;Mode&lt;/strong&gt;: The most frequently occurring value.&lt;br&gt;
&lt;strong&gt;Variance/Standard Deviation&lt;/strong&gt;: Measures of spread or dispersion.&lt;br&gt;
&lt;strong&gt;Range&lt;/strong&gt;: Difference between the maximum and minimum values.&lt;/p&gt;

&lt;h2&gt;
  
  
  Categorical Data
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Frequency&lt;/strong&gt;: The count of occurrences of each category.&lt;br&gt;
&lt;strong&gt;Mode&lt;/strong&gt;: The most frequent category.&lt;/p&gt;

&lt;h1&gt;
  
  
  Visualizations
&lt;/h1&gt;

&lt;p&gt;You can get insights to a new data source very quick and also see connections between different datatypes easier.&lt;br&gt;
Because when you only use the standard statistics to summarize your data, you will get the min, max, mean, median and mode, but this might be misleading in other aspects. Like it is shown in Anscombe's Quartet: the mean and deviation are always the same, but the data distribution is always different.&lt;/p&gt;

&lt;p&gt;In data visualization, we have two types:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Exploratory data visualization
We use this to get insights about the data. It does not need to be visually appealing.&lt;/li&gt;
&lt;li&gt;Explanatory data visualization
This visualizations need to be accurate, insightful and visually appealing as this is presented to the users.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Chart Junk, Data Ink Ratio and Design Integrity
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Chart Junk
&lt;/h3&gt;

&lt;p&gt;To be able to read the information provided via plot without distraction, it is important to avoid chart junk. Like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Heavy grid lines&lt;/li&gt;
&lt;li&gt;Pictures in the visuals&lt;/li&gt;
&lt;li&gt;Shades &lt;/li&gt;
&lt;li&gt;3d components&lt;/li&gt;
&lt;li&gt;Ornaments&lt;/li&gt;
&lt;li&gt;Superfluous texts
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3jzv9thonqt1w0c0s648.png" alt="Image description" width="737" height="391"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Data Ink Ratio
&lt;/h3&gt;

&lt;p&gt;The lower your chart junk in a visual is the higher the data ink ratio is. This just means the more "ink" in the visual is used to transport the message of the data, the better it is.&lt;/p&gt;

&lt;h3&gt;
  
  
  Design Integrity
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;Lie Factor&lt;/strong&gt; is calculated as:&lt;/p&gt;

&lt;p&gt;$$&lt;br&gt;
\text{Lie Factor} = \frac{\text{Size of effect shown in graphic}}{\text{Size of effect in data}}&lt;br&gt;
$$&lt;/p&gt;

&lt;p&gt;The delta stands for the difference. So it is the relative change shown in the graphic divided by the actual relative change in the data. Ideally it should be 1. If it is not, it means that there is some missmatch in the way the data is presented and the actual change.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1dzp2loid56sxw970la3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1dzp2loid56sxw970la3.png" alt="Image description" width="455" height="651"&gt;&lt;/a&gt;&lt;br&gt;
In the example above, taken from the wiki, the lie factor is 3, when comparing the pixels of each doctor, representing the numbers of doctors in California.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs4ql12ba4142v6pnb34f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs4ql12ba4142v6pnb34f.png" alt="Image description" width="726" height="366"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Tidy data
&lt;/h3&gt;

&lt;p&gt;make sure you're data is cleaned properly and ready to use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;each variable is a column&lt;/li&gt;
&lt;li&gt;each observation is a row&lt;/li&gt;
&lt;li&gt;each type of observational unit is a table&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Univariate Exploration of Data
&lt;/h1&gt;

&lt;p&gt;This refers to the analysis of a single variable (or feature) in a dataset.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bar Chart
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;always plot starting with 0 to present values in real comparable way.&lt;/li&gt;
&lt;li&gt;sort nominal data&lt;/li&gt;
&lt;li&gt;don't sort ordinal data - here it is more important to know how often the most important category appears than the most frequent&lt;/li&gt;
&lt;li&gt;if you have a lot of categories use a horizontal bar chart: having the categories on the y-axes, to make it better readable.
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu5h7rsvt3iwlcm0v64up.png" alt="Image description" width="638" height="416"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fonrwjex4yh6pfb52updb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fonrwjex4yh6pfb52updb.png" alt="Image description" width="638" height="416"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh3j6j404jgkpdm23a1my.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh3j6j404jgkpdm23a1my.png" alt="Image description" width="715" height="416"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9fuorc1r4ejizypq0kgc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9fuorc1r4ejizypq0kgc.png" alt="Image description" width="705" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Histogram
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;quantitative version of a bar chart. This is used to plot numeric values. &lt;/li&gt;
&lt;li&gt;values are grouped into continous bins, one bar for each is plotted
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkhfjs7yph34f3i6vn70u.png" alt="Image description" width="705" height="395"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  KDE - Kernel Density Estimation
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;often a Gaussian or normal distribution, to estimate the density at each point.&lt;/li&gt;
&lt;li&gt;KDE plots can reveal trends and the shape of the distribution more clearly, especially for data that is not uniformly distributed.
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6gnehnv8tclwg4qi162a.png" alt="Image description" width="435" height="261"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Pie Chart and Donut Plot
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;data needs to be in relative frequencies&lt;/li&gt;
&lt;li&gt;pie charts work best with 3 slices at maximum. If there are more wedges to display it gets unreadable and the different amounts are hard to compare. Then you would prefer a bar chart.
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F18guy4lpqyqximu90w69.png" alt="Image description" width="484" height="899"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  BiVariate Exploration of Data
&lt;/h1&gt;

&lt;p&gt;Analyzes the relationship between two variables in a dataset.&lt;/p&gt;

&lt;h2&gt;
  
  
  Clustered Bar Charts
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;displays the relationship between two categorical values. The bars are organized in clusters based on the level of the first variable.
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5605tuz2v6bdsoqwui48.png" alt="Image description" width="710" height="443"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Scatterplots
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;each data point is plotted individually as a point, its x-position corresponding to one feature value and its y-position corresponding to the second.&lt;/li&gt;
&lt;li&gt;if the plot suffers from overplotting (too many datapoints overlap): you can use transparency and jitter (every point is moved slightly from its true value)
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fghhmhosc8r2mvskl2gwh.png" alt="Image description" width="686" height="391"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Heatmaps
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;2d version of a Histogram&lt;/li&gt;
&lt;li&gt;data points are placed with its x-position corresponding to one feature value and its y-position corresponding to the second.&lt;/li&gt;
&lt;li&gt;the plotting area is divided into a grid, and the numbers of points add up there and the counts are indicated by color
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffei2gsefu1bgv1hz2m5l.png" alt="Image description" width="672" height="380"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Violin plots
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;show the relationship between quantitative (numerical) and qualitative (categorical) variables on a lower level of absraction.&lt;/li&gt;
&lt;li&gt;the distribution is plotted like a kernel density estimate, so we can have a clear&lt;/li&gt;
&lt;li&gt;to display the key statistics at the same time, you can embedd a box plot in a violin plot.
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi4fwm0r6i7l8yuuarttv.png" alt="Image description" width="682" height="415"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Box plots
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;it also plots the relationship between quantitative (numerical) and qualitative (categorical) variables on a lower level of absraction.&lt;/li&gt;
&lt;li&gt;compared to the violin plot, the box plot leans more on the summarization of the data, primarily just reporting a set of descriptive statistics for the numeric values on each categorical level.&lt;/li&gt;
&lt;li&gt;it visualizes the five-number summary of the data: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Key elements of a boxplot:&lt;br&gt;
&lt;strong&gt;Box&lt;/strong&gt;: The central part of the plot represents the interquartile range (IQR), which is the range between the first quartile (Q1, 25th percentile) and the third quartile (Q3, 75th percentile). This contains the middle 50% of the data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Median Line&lt;/strong&gt;: Inside the box, a line represents the median (Q2, 50th percentile) of the dataset.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Whiskers&lt;/strong&gt;: Lines extending from the box, known as "whiskers," show the range of the data that lies within 1.5 times the IQR from Q1 and Q3. They typically extend to the smallest and largest values within this range.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Outliers&lt;/strong&gt;: Any data points that fall outside 1.5 times the IQR are considered outliers and are often represented by individual dots or marks beyond the whiskers.&lt;br&gt;
&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvjqcb0zp1l87ik6b9sa7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvjqcb0zp1l87ik6b9sa7.png" alt="Image description" width="682" height="415"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Combined Violin and Box Plot
&lt;/h2&gt;

&lt;p&gt;The violin plot shows the density across different categories, and the boxplot provides the summary statistics&lt;br&gt;
&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8a8uqgnlxu9o2ck0n8xb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8a8uqgnlxu9o2ck0n8xb.png" alt="Image description" width="707" height="508"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Faceting
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;the data is divided into disjoint subsets, most often by different levels of a categorical variable. For each of these subsets of the data, the same plot type is rendered on other variables, ie more histograms next to each other with different categorical values.
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frvejflge0gd205olgwez.png" alt="Image description" width="709" height="327"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Line plot
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;used to plot the trend of one number variable against a seconde variable.
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fplpmm7qzelacxaalm7tn.png" alt="Image description" width="710" height="345"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quantile-Quantile (Q-Q) plot
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;is a type of plot used to compare the distribution of a dataset with a theoretical distribution (like a normal distribution) or to compare two datasets to check if they follow the same distribution.
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1mirars5qgdp44myrc82.png" alt="Image description" width="404" height="397"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Swarm plot
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Like to a scatterplot, each data point is plotted with position according to its value on the two variables being plotted. Instead of randomly jittering points as in a normal scatterplot, points are placed as close to their actual value as possible without allowing any overlap. 
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgiy2i5orr057rsaraa73.png" alt="Image description" width="636" height="414"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Spider plot
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;compare multiple variables across different categories on a radial grid. Also know as radar chart.
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6rr4532l84yecehatkxn.png" alt="Image description" width="616" height="552"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Useful links
&lt;/h1&gt;

&lt;h2&gt;
  
  
  My sample notebook
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/BarbaraJoebstl/data_vis/" rel="noopener noreferrer"&gt;Sample Code&lt;/a&gt; &lt;/p&gt;

&lt;h2&gt;
  
  
  Libs used for the sample plots:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://matplotlib.org/" rel="noopener noreferrer"&gt;Matplotlib&lt;/a&gt;: a versatile library for visualizations, but it can take some code effort to put together common visualizations.&lt;/li&gt;
&lt;li&gt; &lt;a href="https://seaborn.pydata.org/" rel="noopener noreferrer"&gt;Seaborn&lt;/a&gt;: built on top of matplotlib, adds a number of functions to make common statistical visualizations easier to generate.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pandas.pydata.org/" rel="noopener noreferrer"&gt;pandas&lt;/a&gt;: while this library includes some convenient methods for visualizing data that hook into matplotlib, we'll mainly be using it for its main purpose as a general tool for working with data (&lt;a href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf" rel="noopener noreferrer"&gt;https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Further reading:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Anscombes Quartett: Same stats for the data, but different distribution: &lt;a href="https://en.wikipedia.org/wiki/Anscombe%27s_quartet" rel="noopener noreferrer"&gt;https://en.wikipedia.org/wiki/Anscombe%27s_quartet&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Chartchunk: &lt;a href="https://en.wikipedia.org/wiki/Chartjunk" rel="noopener noreferrer"&gt;https://en.wikipedia.org/wiki/Chartjunk&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Data Ink Ratio: &lt;a href="https://infovis-wiki.net/wiki/Data-Ink_Ratio" rel="noopener noreferrer"&gt;https://infovis-wiki.net/wiki/Data-Ink_Ratio&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Lie factor: &lt;a href="https://infovis-wiki.net/wiki/Lie_Factor" rel="noopener noreferrer"&gt;https://infovis-wiki.net/wiki/Lie_Factor&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Tidy data: &lt;a href="https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html" rel="noopener noreferrer"&gt;https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Colorblind-friendly visualizations: &lt;a href="https://www.tableau.com/blog/examining-data-viz-rules-dont-use-red-green-together" rel="noopener noreferrer"&gt;https://www.tableau.com/blog/examining-data-viz-rules-dont-use-red-green-together&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>datavis</category>
      <category>python</category>
      <category>scrollwithme</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>My K8s Cheatsheet</title>
      <dc:creator>Barbara</dc:creator>
      <pubDate>Fri, 12 Jan 2024 12:48:18 +0000</pubDate>
      <link>https://dev.to/barbara/my-k8s-cheatsheet-2d8p</link>
      <guid>https://dev.to/barbara/my-k8s-cheatsheet-2d8p</guid>
      <description>&lt;p&gt;In this cheatsheet I summed up the most used commands. &lt;br&gt;
In doubt you can always consult&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;kubectl --help&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://kubernetes.io/docs/home/"&gt;The K8s documentation&lt;/a&gt;
or play around out on &lt;a href="https://killercoda.com/killer-shell-ckad/"&gt;killercoda&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  General Helper
&lt;/h2&gt;

&lt;p&gt;Add aliases and functions to the .bashrc, to save time avoid typing:&lt;/p&gt;
&lt;h3&gt;
  
  
  Aliases and Functions
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Alias for kubernetes&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;alias k = 'kubectl'&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;Do a dry-run and output it as yaml&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;export do = "-o yaml --dry-run="client"&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;k create deployment test --image="nginx:alpine" $do &amp;gt; deployment.yaml&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do it immediately&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;export now = "--force --grace-period=0"&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;k delete deployment test $now&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Set the namespace&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kn(){
kubectl config set-context --current --namespace="$1"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;call it like: &lt;code&gt;kn crazynamespace&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run a command from a temp container&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tmp(){
 kubectl run tmp --image="nginx:alpine" -i -rm --restart=Never -- sh -c "$1"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;call it like: &lt;code&gt;tmp curl http://servicename:namespace:port&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Kubectl commands
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Get a configuration as .yaml
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;k get deployment -o yaml &amp;gt; depl.yaml&lt;/code&gt;&lt;br&gt;
&lt;code&gt;k get pod -o yaml &amp;gt; pod.yaml&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;k create deployment -o yaml &amp;gt; depl.yaml&lt;/code&gt;&lt;br&gt;
&lt;code&gt;k run pod1 $do &amp;gt; pod.yaml&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Pod
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Create a pod that has a command:
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;k run pod1 image=imagetouse --comand  -- sh -c "commandlinecommand" $do &amp;gt; pod.yaml&lt;/code&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Search pods in a namepspace for label
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;k get pod -o yaml | grep searchitem&lt;/code&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Create a service for a pot
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;k expose podname --name=servicename --port=3333 --target-port=3333&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Serviceaccount
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;k create serviceaccount yourServiceAccount&lt;/code&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  add to pod
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kind: Pod
metadata:
    name: yourpod
    namespace: yourns
spec:
    serviceAccountName: yourServiceAccount
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Secrets
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;k get secrets&lt;/code&gt;&lt;br&gt;
&lt;code&gt;k create secret generic mySecret --from-literal key=value&lt;/code&gt;&lt;br&gt;
&lt;code&gt;k create secret generic mySecret --from-file=path/to/file&lt;/code&gt;&lt;br&gt;
&lt;code&gt;k get secret -o jsonpath='{.data.yourKey}' | base64 decode &amp;gt; supersecret.txt&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Configmaps
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;k create configmap myConfigmap --from-literal key=value $do &amp;gt; configmap.yaml&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;k create configmap myconfigmap --from-file=path/to/file $do &amp;gt; configmap.yaml&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Clusterrole
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;k create clusterrole myclusterrole --verb=get, list, create, delete --resource=tralala&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Clusterrolebinding
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;k create clusterrolebinding my-cluster-role-binding --clusterrole=my-cluster-role --serviceaccount=default:my-service-account&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;k create sa admin-user&lt;/code&gt;&lt;br&gt;
&lt;code&gt;k create clusterrolebinding admin-user --clusterrole cluster-admin --serviceaccount kubernetes-dashboard:admin-user&lt;/code&gt;&lt;br&gt;
&lt;code&gt;k create token admin-user&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Patch
&lt;/h3&gt;

&lt;p&gt;// to add a selector to the created service&lt;br&gt;
k patch service old-app -p '{"spec":{"selector":{"app": "new-app"}}}'&lt;br&gt;
--&amp;gt; you can patch anything, need to know the level&lt;/p&gt;

&lt;h3&gt;
  
  
  Label and Annotate
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;k label pod -l type=runner another=label&lt;/code&gt;&lt;br&gt;
&lt;code&gt;k annotate pod -l type=runner type="i am a great type"&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Networking
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Expose
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;k expose deployment example --port=8765 --target-port=9376 \&lt;br&gt;
        --name=example-service --type=LoadBalancer&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;k expose podname --name=servicename --port=3333 --target-port=3333 --type=Nodeport&lt;/code&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Curl with temp pod to test
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;k run tmp --restart=Never --rm --image=nginx:alpine -i -- curl http://servicename.namespace:port&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ROLLOUTS
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Rollouts and rollbacks
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;k get deploy&lt;/code&gt;&lt;br&gt;
&lt;code&gt;k rollout history&lt;/code&gt;&lt;br&gt;
&lt;code&gt;k undo deploy deploymentname&lt;/code&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Rolling update
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;k scale deploy/dev-web --replicas=4&lt;/code&gt;&lt;br&gt;
&lt;code&gt;k edit deployment yourdeployment&lt;/code&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Canary rollout
&lt;/h4&gt;

&lt;p&gt;depl1: repl: 2&lt;br&gt;
depl2: repl: 8&lt;/p&gt;

&lt;h4&gt;
  
  
  Green Blue deployment
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;deploy both&lt;/li&gt;
&lt;li&gt;switch version&lt;/li&gt;
&lt;li&gt;scale down deploy1&lt;/li&gt;
&lt;li&gt;update service&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Scale a deployment
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;k scale deployment/my-nginx --replicas=1&lt;/code&gt;&lt;br&gt;
&lt;code&gt;k autoscale deployment/my-nginx --min=1 --max=3&lt;/code&gt;&lt;br&gt;
&lt;code&gt;k get pods -l app=nginx&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Storage
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;k create pvc name &amp;gt; pvc.yaml&lt;/code&gt;&lt;br&gt;
&lt;code&gt;k create pv name &amp;gt; pv.yaml&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;--&amp;gt; get pv and pvc  at the same time to see if it is working&lt;br&gt;
&lt;code&gt;k get pv, pvc&lt;/code&gt;&lt;br&gt;
--&amp;gt; status is bound, storageClass is manual -&amp;gt; everything is working&lt;br&gt;
--&amp;gt; if Storage class needed:&lt;br&gt;
to try:&lt;br&gt;
&lt;code&gt;k create sc yourStorageClass -o yaml --dry-run="client" &amp;gt; sc.yaml&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Troubleshooting
&lt;/h3&gt;

&lt;h4&gt;
  
  
  try to call outside:
&lt;/h4&gt;

&lt;p&gt;k exec frontend-789cbdc677-c9v8h -- wget -O- &lt;a href="http://www.google.com"&gt;www.google.com&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  check if env variables exist in a pod
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;k exec pod1 -- env | grep "&amp;lt;key&amp;gt;=&amp;lt;value&amp;gt;"&lt;/code&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  check if volume is mounted
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;k exec pod1 -- cat /path/to/mount&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  PODMAN
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;podman build -t super:v1&lt;/code&gt;&lt;br&gt;
&lt;code&gt;podman run --name my-container super:v1&lt;/code&gt;&lt;br&gt;
&lt;code&gt;podman save -o /path/to/output/myimage.tar super:v1&lt;/code&gt;&lt;br&gt;
(podman uses oci format as default, docker does not)&lt;/p&gt;

&lt;h2&gt;
  
  
  HELM
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;helm repo&lt;/code&gt;&lt;br&gt;
&lt;code&gt;helm repo list&lt;/code&gt;&lt;br&gt;
&lt;code&gt;helm repo update&lt;/code&gt;&lt;br&gt;
&lt;code&gt;helm search repo whatever&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;helm -n yourns upgrade&lt;/code&gt;&lt;br&gt;
&lt;code&gt;helm -n yourns install currentthingi imageToTake --set replicaCount=2&lt;/code&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  k --help
&lt;/h2&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://kubernetes.io/docs/home/"&gt;https://kubernetes.io/docs/home/&lt;/a&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://killercoda.com/killer-shell-ckad/"&gt;https://killercoda.com/killer-shell-ckad/&lt;/a&gt;
&lt;/h2&gt;

</description>
      <category>kubernetes</category>
      <category>cheatsheet</category>
      <category>cmd</category>
      <category>devops</category>
    </item>
    <item>
      <title>Kubernetes Troubleshooting</title>
      <dc:creator>Barbara</dc:creator>
      <pubDate>Tue, 21 Nov 2023 08:00:00 +0000</pubDate>
      <link>https://dev.to/barbara/kubernetes-troubleshooting-575p</link>
      <guid>https://dev.to/barbara/kubernetes-troubleshooting-575p</guid>
      <description>&lt;p&gt;With Kubernetes large and diverse workloads can be handled.&lt;br&gt;
To keep track of all these processes, monitoring is essential.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring
&lt;/h2&gt;

&lt;p&gt;To monitor the application you need to collect metrics, like CPU, memory, disk usage and bandwidth on your nodes.&lt;/p&gt;

&lt;p&gt;Because Kubernetes is a distributed system, it needs to be monitored and trace cluster-wide. &lt;/p&gt;

&lt;p&gt;You can use exterior tools like &lt;strong&gt;Prometheus&lt;/strong&gt; and visualize it with &lt;strong&gt;Grafana&lt;/strong&gt;. But to get started I recommend you to use the &lt;strong&gt;Kubernetes dashboard&lt;/strong&gt;, as it is very easy to set up and you have a default user interface with the most important metrics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Logging
&lt;/h2&gt;

&lt;p&gt;If you have aggregated logs, you can visualize issues and search the logs for issues. &lt;/p&gt;

&lt;p&gt;In Kubernetes the kubelet writes container logs to local files. With the command &lt;code&gt;kubectl logs&lt;/code&gt; you can see this logs.&lt;/p&gt;

&lt;p&gt;If you want to perform cluster wide logging, you can use &lt;strong&gt;Fluentd&lt;/strong&gt; to aggregate logs.&lt;br&gt;
Fluentd agents run on each node via a DeamonSet and feed them to an ElasticSearch instance prior to visualization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Errors in the container
&lt;/h3&gt;

&lt;p&gt;If you are not sure where to start, run&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl describe your-pod&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This will report &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the overall status of the pod: running, pending or an error state&lt;/li&gt;
&lt;li&gt;the container configuration&lt;/li&gt;
&lt;li&gt;the container events&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the pod is already running you can first look at the standard outs of the container. One common issue is that there are not enough resources allocated.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl logs your-pod your-container&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;You can look for error messages in the logs.&lt;/p&gt;

&lt;p&gt;If there are errors inside a container you execute into the shell of the container to see what is going on.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl exec -it yourdeployment -- /bin/sh&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Networking issues
&lt;/h3&gt;

&lt;p&gt;This could be the next place, where the issues arise.&lt;br&gt;
So you can go ahead and check the DNS, firewalls and general connectivity.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security issues
&lt;/h3&gt;

&lt;p&gt;You might want to check your RBAC.&lt;br&gt;
SELinux and AppArmor are also common issues, especially with network-centric applications.&lt;/p&gt;

&lt;p&gt;If you don't know where to start, you can disable security for testing, to delimit the issue source. But be sure to reenable security afterwards.&lt;/p&gt;

&lt;p&gt;Another reason - not only for security issues - could be an update. You can roll back to find out when the issue was introduced.&lt;/p&gt;

&lt;p&gt;Further reading:&lt;br&gt;
&lt;a href="https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/"&gt;Kubernetes dashboard&lt;/a&gt;&lt;br&gt;
&lt;a href="https://prometheus.io/"&gt;Prometheus&lt;/a&gt;&lt;br&gt;
&lt;a href=""&gt;Fluentd&lt;/a&gt;&lt;br&gt;
&lt;a href="https://kubernetes.io/docs/tasks/debug/debug-cluster/"&gt;Troubleshoot a cluster&lt;/a&gt;&lt;br&gt;
&lt;a href="https://kubernetes.io/docs/tasks/debug/debug-application/debug-pods/"&gt;Troubleshoot applications&lt;/a&gt;&lt;br&gt;
&lt;a href="https://kubernetes.io/docs/tasks/debug/debug-application/debug-pods/"&gt;Debug Pods&lt;/a&gt;&lt;/p&gt;

</description>
      <category>troubleshooting</category>
      <category>monitoring</category>
      <category>logging</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Expose Applications from a K8s cluster</title>
      <dc:creator>Barbara</dc:creator>
      <pubDate>Mon, 20 Nov 2023 18:30:00 +0000</pubDate>
      <link>https://dev.to/barbara/expose-applications-from-a-k8s-cluster-2i7</link>
      <guid>https://dev.to/barbara/expose-applications-from-a-k8s-cluster-2i7</guid>
      <description>&lt;p&gt;To expose applications from our Kubernetes cluster we need different service types.&lt;/p&gt;

&lt;h2&gt;
  
  
  Service Types
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ClusterIP
&lt;/h3&gt;

&lt;p&gt;The ClusterIP service type is the default and only provides access internally - within the cluster. &lt;br&gt;
If you need to expose a service to the external world, you might consider other service types such as NodePort or LoadBalancer.&lt;/p&gt;

&lt;p&gt;The kubectl proxy command creates a local service to access a ClusterIP. This can be useful for troubleshooting or development work.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: v1
kind: Service
metadata:
  name: internal-cluster-ip-service
spec:
  selector:
    app: your-app
  ports:
    - protocol: TCP
      port: 80 #exposes this port internally
      targetPort: 8080 # directs traffic to pods on that port
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  NodePort
&lt;/h3&gt;

&lt;p&gt;The NodePort type is great for debugging, or when a static IP address is necessary, such as opening a particular address through a firewall. The NodePort range is defined in the cluster configuration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kind: Service
metadata:
  name: your-nodeport-service
spec:
  type: NodePort
  selector:
    app: your-app
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 80
      nodePort: 30080 # port 8080 is exposed on all nodes, and reachable from their ip on port 30080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a service via kubectl:&lt;br&gt;
&lt;code&gt;kubectl expose deployment/nginx --port=80 --type=NodePort&lt;/code&gt;&lt;br&gt;
This service creates a service for the nginx deployment.&lt;br&gt;
&lt;code&gt;kubectl get svc&lt;/code&gt;&lt;br&gt;
&lt;code&gt;kubectl get svc nginx -o yaml&lt;/code&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  LoadBalancer
&lt;/h3&gt;

&lt;p&gt;LoadBalancer is a type of service that automatically provides external access to services within a cluster by distributing incoming network traffic across multiple nodes. &lt;/p&gt;

&lt;p&gt;Using a LoadBalancer service is a convenient way to expose services externally, especially in production environments, where load balancing and high availability are crucial.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: v1
kind: Service
metadata:
  name: your-loadbalancer-service
spec:
  type: LoadBalancer
  selector:
    app: your-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  ExternalName
&lt;/h3&gt;

&lt;p&gt;With this service you can map a Kubernetes service to a DNS Name. Use of the service returns a CNAME record.&lt;br&gt;
Working with the ExternalName service is handy when using a resource external to the cluster, perhaps prior to full integration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: v1
kind: Service
metadata:
  name: geiler-service
spec:
  type: ExternalName
  externalName: geil.example.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Ingress
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Ingress Resource
&lt;/h3&gt;

&lt;p&gt;An ingress resource is an API object containing a list of rules matched against all incoming requests.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: your-app-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  rules:
  - host: your-app.example.com  # Replace with your desired domain or IP
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: your-app-service
            port:
              number: 80
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;kubectl apply -f ingress.yaml&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Ingress Controller
&lt;/h3&gt;

&lt;p&gt;An ingress controller manages all the ingress rules to route traffic to existing services.&lt;br&gt;
This is important if the number of services gets high.&lt;/p&gt;

&lt;h3&gt;
  
  
  Service Mesh
&lt;/h3&gt;

&lt;p&gt;If you need service discovery, rate limiting, traffic management and advanced metrics you can implement a service mesh.&lt;/p&gt;

&lt;p&gt;Further reading:&lt;br&gt;
&lt;a href="https://kubernetes.io/docs/concepts/services-networking/ingress/"&gt;Kubernetes Ingress&lt;/a&gt;&lt;br&gt;
&lt;a href="https://avinetworks.com/glossary/kubernetes-service-mesh/"&gt;What is a service mesh&lt;/a&gt;&lt;/p&gt;

</description>
      <category>nodeport</category>
      <category>loadbalancer</category>
      <category>ingresscontroller</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Kubernetes Security</title>
      <dc:creator>Barbara</dc:creator>
      <pubDate>Sun, 19 Nov 2023 19:15:05 +0000</pubDate>
      <link>https://dev.to/barbara/kubernetes-security-3o0j</link>
      <guid>https://dev.to/barbara/kubernetes-security-3o0j</guid>
      <description>&lt;p&gt;In this post you are going to learn about the basics of the Kubernetes security. You will see how the "admission control" of the kube-apiserver works, how to authorize with RBAC and how to set network policies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Accessing the Kubernetes API
&lt;/h2&gt;

&lt;p&gt;All requests that reach the API are encrypted using TLS,  therefore you need to configure SSL certificates or use &lt;code&gt;kubeadmin&lt;/code&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Authentication&lt;/li&gt;
&lt;li&gt;Authorization&lt;/li&gt;
&lt;li&gt;Admission Control&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Authentication
&lt;/h3&gt;

&lt;p&gt;This is done with certificates, tokens or a basic authentication (username and password).&lt;/p&gt;

&lt;p&gt;Users are not created by the API and should be managed by the operating system or an external server.&lt;br&gt;
System accounts aka service accounts aka service principal are used by processes to access the API.&lt;/p&gt;

&lt;p&gt;It can also be done with Webhooks, to verify bearer tokens or a connection with an external OpenId provider.&lt;/p&gt;

&lt;p&gt;You define the type of authentication in the &lt;code&gt;kube-apiserver&lt;/code&gt; startup options and select the authenticator module:&lt;br&gt;
&lt;code&gt;--basic-auth-file&lt;/code&gt;&lt;br&gt;
&lt;code&gt;--oidc-issuer-url&lt;/code&gt;&lt;br&gt;
&lt;code&gt;--token-auth-file&lt;/code&gt;&lt;br&gt;
&lt;code&gt;--authorization-webhook-config-file&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;If one or more Authenticator Modules are used, each is tried until successful, and the order is not guaranteed. &lt;br&gt;
Anonymous access can also be enabled, otherwise you will get a 401 response. &lt;/p&gt;
&lt;h3&gt;
  
  
  Authorization
&lt;/h3&gt;

&lt;p&gt;There are three main modules for Authorization&lt;br&gt;
Node: is needed for the kubelet to communicate with the kube-apiserver&lt;br&gt;
RBAC - Role Bases Access Control: All non kubelet traffic would be checked by RBAC, if set&lt;br&gt;
Webhook:All non kubelet traffic would be checked by RBAC, if set&lt;/p&gt;

&lt;p&gt;You can configure them in the kube-apiserver startup options&lt;br&gt;
&lt;code&gt;--authorization-mode=Node,RBAC&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The attributes of the request are checked agains the policies like (user, usergroup, namespace, http verbs).&lt;br&gt;
To see the authorization information of a cluster run&lt;br&gt;
&lt;code&gt;kubectl config get-contexts&lt;/code&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  RBAC - Role Based Access Control
&lt;/h4&gt;

&lt;p&gt;All resources are modelled API objects in Kubernetes.&lt;/p&gt;
&lt;h5&gt;
  
  
  API Groups
&lt;/h5&gt;

&lt;p&gt;These resources belong to API groups, like core and apps. They allow HTTP verbs like POST, GET, PUT, DELETE.&lt;br&gt;
RBAC settings are additive, with no permission allowed unless defined.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rules&lt;/strong&gt; - Rules can act upon an API group.&lt;br&gt;
&lt;strong&gt;Roles&lt;/strong&gt; - One or more rules with affect a scope of a single namespace.&lt;br&gt;
&lt;strong&gt;ClusterRoles&lt;/strong&gt; - Scoped for the entire cluster.&lt;/p&gt;
&lt;h3&gt;
  
  
  Admission Control
&lt;/h3&gt;

&lt;p&gt;Admission controllers intercept and modify requests.&lt;br&gt;
They can modify the content or validate it, and potentially deny the request.&lt;br&gt;
&lt;code&gt;--enable-admission-plugins=NamespaceLifecycle,LimitRanger&lt;/code&gt;&lt;br&gt;
&lt;code&gt;--disable-admission-plugins=PodNodeSelector&lt;/code&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Security Contexts
&lt;/h2&gt;

&lt;p&gt;This is a Kubernetes object that defines privileges and access control settings for a Pod or a container inside a Pod. Below you can find the most used security context options.&lt;/p&gt;
&lt;h3&gt;
  
  
  RunAsUser
&lt;/h3&gt;

&lt;p&gt;Specifies the user or group ID under which the process should run inside the container. This helps to isolate processes and restrict their access.&lt;/p&gt;
&lt;h3&gt;
  
  
  Privileged
&lt;/h3&gt;

&lt;p&gt;If set to true, the container gains access to all Linux capabilities, effectively turning off all isolation between the host and the container. Using privileged mode should be done cautiously, as it can introduce security risks.&lt;/p&gt;
&lt;h3&gt;
  
  
  ReadOnlyRootFilesystem
&lt;/h3&gt;

&lt;p&gt;When set to true, the container's root file system is mounted as read-only. This provides an additional layer of security by preventing processes within the container from writing to the root file system.&lt;/p&gt;
&lt;h3&gt;
  
  
  Capabilities:
&lt;/h3&gt;

&lt;p&gt;Allows you to add or remove specific Linux capabilities for processes within the container. This provides fine-grained control over what the processes are allowed to do.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: v1
kind: Pod
metadata:
  name: yourpod
spec:
  containers:
  - name: yourcontainer
    image: yourimage
    securityContext:
      runAsUser: 1000 // user id, default is 0, which is the root user
      capabilities:
        add: ["NET_ADMIN"]
      readOnlyRootFilesystem: true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the security context is set wrong, you will see an warning in the status of your pods.&lt;/p&gt;

&lt;h3&gt;
  
  
  PodSecurity Admission Controllers
&lt;/h3&gt;

&lt;p&gt;PodSecurity admission controllers are part of the built-in set of admission controllers in Kubernetes.&lt;br&gt;
You can define policies on different levels and customize them as needed.&lt;br&gt;
They are part of the Admission Control Framework.&lt;br&gt;
They are designed to be compatible with a variety of container runtimes.&lt;br&gt;
You can set it in the cluster configuration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
apiServer:
  extraArgs:
    enable-admission-plugins: "PodSecurity,PodNodeSelector"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and use the policy like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: restrictive
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
    - ALL
  volumes:
    - 'configMap'
    - 'emptyDir'
  hostNetwork: false
  hostIPC: false
  hostPID: false
  runAsUser:
    rule: 'MustRunAsNonRoot'
  seLinux:
    rule: 'RunAsAny'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Network Security Policies
&lt;/h2&gt;

&lt;p&gt;By default, all pods can reach each other all ingress and egress traffic is allowed. This has been a high-level networking requirement in Kubernetes. But the ingress and egress trafic can be controlled by a policy. The network policy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Network Policy Sample
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ingress-egress-policy
  namespace: default
spec:
  podSelector:
    matchLabels:
      role: db
  policy types:
    - Ingress
    - Egress
  ingress:
    - from:
        - ipBlock:
            cidr: 172.17.0.0/16
            except:
              - 172.17.1.0/24
        - namespaceSelector:
            matchLabels:
              project: yourproject
        - podSelector:
            matchLabels:
              role: frontend
      ports:
        - protocol: TCP
          port: 6379
  egress:
    - to:
        - ipBlock:
            cidr: 10.0.0.0/24
      ports:
        - protocol: TCP
          port: 5978

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Default Network Policy
&lt;/h4&gt;

&lt;p&gt;The empty braces in the below example match all Pods, that are not selected by another Network Policy and will not allow ingress traffic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
spec:
  podSelector: {}
  policyTypes:
  - Ingress
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Further reading:&lt;br&gt;
&lt;a href="https://kubernetes.io/docs/concepts/security/controlling-access/"&gt;Controlling Access to Kubernetes API&lt;/a&gt;&lt;br&gt;
&lt;a href="https://www.internetsociety.org/deploy360/tls/basics/"&gt;What is TLS&lt;/a&gt;&lt;br&gt;
&lt;a href="https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/"&gt;Configure Service Accounts&lt;/a&gt;&lt;br&gt;
&lt;a href="https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/"&gt;Dynamic Admission Control&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/ahmetb/kubernetes-network-policy-recipes"&gt;Network Policy Recipes&lt;/a&gt;&lt;/p&gt;

</description>
      <category>rbac</category>
      <category>admissioncontrol</category>
      <category>networkpolicies</category>
      <category>kuberentes</category>
    </item>
    <item>
      <title>Kubernetes Volumes</title>
      <dc:creator>Barbara</dc:creator>
      <pubDate>Sun, 19 Nov 2023 17:18:54 +0000</pubDate>
      <link>https://dev.to/barbara/kubernetes-volumes-33e2</link>
      <guid>https://dev.to/barbara/kubernetes-volumes-33e2</guid>
      <description>&lt;h2&gt;
  
  
  Volumes
&lt;/h2&gt;

&lt;p&gt;Volumes are needed to store data within a container or share data among other containers.&lt;br&gt;
All volumes requested by a Pod must be mounted &lt;em&gt;before&lt;/em&gt; the containers within the Pod are started. This applies also to secrets and configmaps.&lt;/p&gt;
&lt;h3&gt;
  
  
  Shared Volume
&lt;/h3&gt;

&lt;p&gt;Below you can find a sample of how to create a shared volume.&lt;br&gt;
But be aware that one container can overwrite the data that from the other container.&lt;br&gt;
You can use locking or versioning to overcome this topic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   containers:
   - name: firstcontainer
     image: busybox
     volumeMounts:
     - mountPath: /firstdir
       name: sharevol
   - name: secondcontainer
     image: busybox
     volumeMounts:
     - mountPath: /seconddir
       name: sharevol
   volumes:
   - name: sharevol
     emptyDir: {}  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;$ kubectl exec -ti example -c secondcontainer -- touch /seconddir/bla&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;$ kubectl exec -ti example -c firstcontainer -- ls -l /firstdir&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Persistent Volume - PV
&lt;/h3&gt;

&lt;p&gt;This is a storage abstraction used to keep data even if the Pods is killed. In the Pods you define a volume of that type.&lt;br&gt;
&lt;code&gt;kubectl get pv&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Sample of a PV with hostPath Type&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kind: PersistentVolume
apiVersion: v1
metadata:
name: 10Gpv01
labels:
type: local
spec:
capacity:
        storage: 10Gi
    accessModes:
        - ReadWriteOnce
    hostPath:
        path: "/somepath/data01"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Persistent Volume Claim - PVC
&lt;/h3&gt;

&lt;p&gt;With the PVC volumes can be accessed by multiple pods and allow state persistency. &lt;br&gt;
The cluster attaches the Persistent Volume. &lt;/p&gt;

&lt;p&gt;There is no concurrency checking, so data corruption is probable unless locking takes place outside. &lt;/p&gt;

&lt;p&gt;There are 3 access modes for the PVC:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;RWO - ReadWriteOnce by a single node&lt;/li&gt;
&lt;li&gt;ROX - ReadOnlyMany by multiple nodes&lt;/li&gt;
&lt;li&gt;RWX - ReadWriteMany by many nodes&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;kubectl get pvc&lt;/code&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Phases to persistent storage
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;Provisioning: Can be done in advance, ie resources from a cloud provider&lt;/li&gt;
&lt;li&gt;Binding: Once a watch loop on master notices a PVC it requests the access.&lt;/li&gt;
&lt;li&gt;Using: The volume is mounted to the Pod and can now be used.&lt;/li&gt;
&lt;li&gt;Releasing: When the pod is down, the PVC is deleted. The resident data remains depending on the &lt;code&gt;persitenVolumReclaimPolicy&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Reclaiming: 
You have three options: Retain, Delete, Recycle &lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;
  
  
  Empty Dir
&lt;/h4&gt;

&lt;p&gt;The kubelet creates an &lt;code&gt;emptyDir&lt;/code&gt;. It will create the directory in the container but not mount any storage. The data written to that storage is not persistent, as it will be deleted when the Pod is deleted.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: v1
kind: Pod
metadata:
    name: sample
    namespace: default
spec:
    containers:
    - image: sample
      name: sample
      command:
        - sleep
        - "3600"
      volumeMounts:
      - mountPath: /sample-mount
        name: sample-volume
    volumes:
    - name: sample-volume
            emptyDir: {}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Other Volume types
&lt;/h4&gt;

&lt;h5&gt;
  
  
  GCEpersistenDisk and awsElsaticBlockStore
&lt;/h5&gt;

&lt;p&gt;You can mount your GCE or your EBS into your Pods.&lt;/p&gt;

&lt;h5&gt;
  
  
  hostPath
&lt;/h5&gt;

&lt;p&gt;This mounts a resource from the host node filesystem. The resource must be already in advance in order to be used.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DirectoryOrCreate&lt;/li&gt;
&lt;li&gt;FileOrCreate&lt;/li&gt;
&lt;/ul&gt;

&lt;h5&gt;
  
  
  and many more
&lt;/h5&gt;

&lt;p&gt;&lt;strong&gt;NFS&lt;/strong&gt; - Network File System&lt;br&gt;
&lt;strong&gt;iSCSI&lt;/strong&gt; - Internet Small Computer System Interface&lt;br&gt;
&lt;strong&gt;RBD&lt;/strong&gt; (RADOS Block Device) - RBD is a block storage device that runs on top of the Ceph distributed storage system. It allows you to create block devices that can be mounted and used like a regular disk. RBD is often used in virtualization environments, providing storage for virtual machines.&lt;br&gt;
&lt;strong&gt;CephFS&lt;/strong&gt; - CephFS is a distributed file system built on top of the Ceph storage system.&lt;br&gt;
&lt;strong&gt;GlusterFS&lt;/strong&gt; - open-source, distributed file system that can scale out to petabytes of storage. It works by aggregating various storage resources across nodes into a single, global namespace. &lt;/p&gt;
&lt;h3&gt;
  
  
  Dynamic Provisioning
&lt;/h3&gt;

&lt;p&gt;With the kind StorageClass, a user can request a claim, which the API Server fills via auto-provisioning. Common choices for dynamic storage are AWS and GCE.&lt;/p&gt;

&lt;p&gt;Sample for gce:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: storage.k8s.io/v1        
kind: StorageClass
metadata:
  name: you-name-it                        
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  ConfigMaps
&lt;/h3&gt;

&lt;p&gt;This kind of storage is used to store sensitive data, that does not need to be encoded, but should not be stored within the application itself. &lt;br&gt;
Using configmaps we can decouple the container image from the configuration artifacts.&lt;br&gt;
If configmaps are marked as "optional" they don't need to be mounted before a pod wants to use them.&lt;/p&gt;

&lt;p&gt;They can be consumed in various ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pod environmental variables from single or multiple ConfigMaps&lt;/li&gt;
&lt;li&gt;Use ConfigMap values in Pod commands&lt;/li&gt;
&lt;li&gt;Populate Volume from ConfigMap&lt;/li&gt;
&lt;li&gt;Add ConfigMap data to a specific path in Volume&lt;/li&gt;
&lt;li&gt;Set file names and access mode in Volume from ConfigMap data&lt;/li&gt;
&lt;li&gt;Can be used by system components and controllers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Create a Configmap from literal:&lt;br&gt;
&lt;code&gt;kubectl create cm yourcm --from-literal yoursecret=topsecret&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Create a Configmap from a file:&lt;br&gt;
&lt;code&gt;kubectl -f your-cm.yaml create&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Sample ConfigMap:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: v1
data:
  yoursecret: topsecret
  level: "3"
kind: ConfigMap
metadata:
  name: yourcm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;read the configmap&lt;br&gt;
&lt;code&gt;kubectl get configmap yourcm -o yaml&lt;/code&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Secrets
&lt;/h3&gt;

&lt;p&gt;This kind of storage is used to store sensitive data, that needs to be encoded. &lt;/p&gt;

&lt;p&gt;A Secret in Kubernetes is base64-encoded by default.&lt;br&gt;
If you want to encrypt secrets, you have to create a &lt;strong&gt;EncryptionConfiguration&lt;/strong&gt;.&lt;br&gt;
There is no limit to the number of secrets, but there is a 1MB limit to their size.&lt;br&gt;
Secrets are stored in the tmpfs storage on the host node and are only sent to the host running Pod.&lt;/p&gt;
&lt;h4&gt;
  
  
  Secret as an environmental variable
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;kubectl get secrets&lt;/code&gt;&lt;br&gt;
&lt;code&gt;kubectl create secret generic --help&lt;/code&gt;&lt;br&gt;
&lt;code&gt;kubectl create secret generic mysecret --from-literal=password=supersecret&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;spec:
     containers:
     -image: yourimage
      name: yourcontainername
      env:
      - name: ROOT_PASSWORD
        valueFrom: 
         secretKeyRef:
           name: yoursecret
           key: password
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Mounting secrets as volumes
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;spec:
    containers:
    - image: busybox
      name: busy
      command:
        - sleep
        - "3600"
      volumeMounts:
      - mountPath: /mysqlpassword
        name: mysql
    volumes:
    - name: mysql
      secret:
        secretName: mysql
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify that the secret is available in thte container:&lt;br&gt;
&lt;code&gt;kubectl exec -ti busybox -- cat /mysqlpassword/password&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Further reading:&lt;br&gt;
&lt;a href="https://trainingportal.linuxfoundation.org/learn/course/kubernetes-for-developers-lfd259/"&gt;https://trainingportal.linuxfoundation.org/learn/course/kubernetes-for-developers-lfd259/&lt;/a&gt;&lt;br&gt;
Volumes on Kubernetes: &lt;a href="https://kubernetes.io/docs/concepts/storage/volumes/"&gt;https://kubernetes.io/docs/concepts/storage/volumes/&lt;/a&gt;&lt;br&gt;
Ceph: &lt;a href="https://ubuntu.com/ceph/what-is-ceph"&gt;https://ubuntu.com/ceph/what-is-ceph&lt;/a&gt;&lt;/p&gt;

</description>
      <category>volumes</category>
      <category>configmaps</category>
      <category>secrets</category>
      <category>persistant</category>
    </item>
    <item>
      <title>Kubernetes Deployment</title>
      <dc:creator>Barbara</dc:creator>
      <pubDate>Wed, 15 Nov 2023 11:48:08 +0000</pubDate>
      <link>https://dev.to/barbara/deploy-f47</link>
      <guid>https://dev.to/barbara/deploy-f47</guid>
      <description>&lt;h2&gt;
  
  
  Deployment
&lt;/h2&gt;

&lt;p&gt;A K8s Deployment is a declarative configuration in a .yaml or .json file to define the desired state of an containerized piece of code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Create a basic deployment.yaml
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;kubectl create deploy your-deployment --image=your-image -oyaml --dry-run=client &amp;gt; deploy.yaml&lt;/code&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Modify as needed, for example you can add livenessProbes&lt;/li&gt;
&lt;li&gt;run &lt;code&gt;kubectl apply -f=deploy.yaml&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;check run &lt;code&gt;kubectl describe deploy&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Deployment Configuration Status
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;kubectl get deployments&lt;/code&gt;&lt;br&gt;
&lt;code&gt;kubectl describe deployment yourdeploymentname&lt;/code&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  availableReplicas
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;indicates how many were configured by the ReplicaSet. This would be compared to the later value of 
#### readyReplicas &lt;/li&gt;
&lt;li&gt;which would be used to determine if all replicas have been fully generated and without error.
#### observedGeneration&lt;/li&gt;
&lt;li&gt;shows how often the deployment has been updated. This information can be used to understand the rollout and rollback situation of the deployment.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Scaling and Rolling Updates
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;kubectl scale deploy/dev-web --replicas=4&lt;/code&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Rolling Update
&lt;/h3&gt;

&lt;p&gt;If you want to modify non-immutable values, you can change them in an editor.&lt;br&gt;
This triggers a rolling update of the deployment. While the deployment would show an older age, a review of the Pods would show a recent update and older version of the web server application deployed.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl edit deployment yourdeployment&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;containers:
      - image: geile-app:1.8 #&amp;lt;&amp;lt;---Change version number
        imagePullPolicy: IfNotPresent
        name: dev-geile-app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will update gradually replacing old pods with new ones to ensure continuous availability of the service.&lt;br&gt;
It is the default update method&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 1
    maxSurge: 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Canary Rollout
&lt;/h3&gt;

&lt;p&gt;A new version of the application is deployed to a small percentage of the pods or replicas in the Kubernetes cluster. This can be achieved using a Deployment resource with specific strategies and configurations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: apps/v1
kind: Deployment
metadata:
  name: your-app
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25% &amp;lt;--- incremental increase in the number of pods running the canary version.
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: your-app
    spec:
      containers:
      - name: your-app
        image: your-registry/your-app:canary
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Blue- Green Deployment
&lt;/h3&gt;

&lt;p&gt;In a Blue-Green Deployment, two identical environments, typically referred to as "Blue" and "Green," are maintained: - one for the current production version (Blue) and&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one for the new version being deployed (Green). &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The deployment process involves switching the traffic from the Blue environment to the Green environment once the new version is considered ready for production.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: blue-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: your-app
      color: blue
  template:
    metadata:
      labels:
        app: your-app
        color: blue
    spec:
      containers:
        - name: your-app
          image: registry/your-app:blue
          ports:
            - containerPort: 80

# green-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: green-deployment
spec:
  replicas: 0  # Keeping replicas at 0 initially
  selector:
    matchLabels:
      app: your-app
      color: green
  template:
    metadata:
      labels:
        app: your-app
        color: green
    spec:
      containers:
        - name: your-app
          image: registry/your-app:green
          ports:
            - containerPort: 80

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: your-app-service
spec:
  selector:
    app: your-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and then deploy the first&lt;br&gt;
&lt;code&gt;kubectl apply -f blue-deployment.yaml&lt;/code&gt;&lt;br&gt;
once validated and deploy the other version&lt;br&gt;
&lt;code&gt;kubectl apply -f green-deployment.yaml&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Deployment Rollbacks
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Look what happened
&lt;code&gt;kubectl rollout history deployment/mydeploy&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Check the status of the deployment
&lt;code&gt;kubectl get pods&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;If you need to do a rollback
&lt;code&gt;kubectl rollout undo deployment/mydeploy&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Further reading:&lt;br&gt;
&lt;a href="https://trainingportal.linuxfoundation.org/learn/course/kubernetes-for-developers-lfd259/"&gt;https://trainingportal.linuxfoundation.org/learn/course/kubernetes-for-developers-lfd259/&lt;/a&gt;&lt;br&gt;
&lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/"&gt;https://kubernetes.io/docs/concepts/workloads/controllers/deployment/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>yaml</category>
      <category>rollback</category>
      <category>deployment</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Kubernetes Design</title>
      <dc:creator>Barbara</dc:creator>
      <pubDate>Mon, 09 Oct 2023 12:10:55 +0000</pubDate>
      <link>https://dev.to/barbara/kubernetes-design-2m8g</link>
      <guid>https://dev.to/barbara/kubernetes-design-2m8g</guid>
      <description>&lt;p&gt;In this blogpost, you will get a crisp guide through the design concepts of Kubernetes. Let's go:&lt;/p&gt;

&lt;h2&gt;
  
  
  Decoupled resources
&lt;/h2&gt;

&lt;p&gt;Each component should be decoupled from outer resources, so that every component can be removed, replaced or rebuilt.&lt;br&gt;
Use Services for connections to other resources to provide flexibility.&lt;/p&gt;
&lt;h2&gt;
  
  
  Transience
&lt;/h2&gt;

&lt;p&gt;Each object should be built with the expectation that other components will die and be rebuilt.&lt;br&gt;
Having this in mind, we can update and scale with ease.&lt;/p&gt;
&lt;h2&gt;
  
  
  Flexible Framework
&lt;/h2&gt;

&lt;p&gt;Multiple independent resources work together, but they are decoupled and do not expect a permanent relationship to other resources.&lt;br&gt;
This framework of independent resources is not as efficient, as we have a lot of controllers or watch-loops in place to monitor the current cluster state and change things until the state matches the configuration.&lt;br&gt;
But on the other hand, this framework allows us to have more flexibility, a very high availability and scalability.&lt;/p&gt;
&lt;h2&gt;
  
  
  Resource Usage
&lt;/h2&gt;

&lt;p&gt;Kubernetes allows us to easily scale clusters and sets resource limits via configuration.&lt;/p&gt;
&lt;h3&gt;
  
  
  CPU
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;spec.containers[].resources.limits.cpu
spec.containers[].resources.requests.cpu
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;1 CPU in K8s is equivalent to 1 AWS vCPU or 1GCP Core or 1 AzurevCore or 1 Hyperthread on a bare-metal Intel processor with Hyperthreading (eg to behave like multiple virtual cores).&lt;/p&gt;
&lt;h3&gt;
  
  
  RAM
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;spec.containers[].resources.limits.memory
spec.containers[].resources.requests.memory
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;With Docker the limits.memory value is converted to an integer value to be used in the &lt;code&gt;docker run --memory &amp;lt;value&amp;gt; &amp;lt;image&amp;gt;&lt;/code&gt; command.&lt;br&gt;
If the container exceeds its memory limit, it may be restarted or the entire Pod could be evicted from the node.&lt;/p&gt;
&lt;h3&gt;
  
  
  Ephemeral Storage
&lt;/h3&gt;

&lt;p&gt;Container files, logs can be stored there. If the containers use more than the limit in the Pod, the Pod will be evicted.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;spec.containers[].resources.limits.ephemeral-storage
spec.containers[].resources.requests.ephemeral-storage
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Label Selectors
&lt;/h2&gt;

&lt;p&gt;They provide a flexible and dynamic way to interact with your K8s cluster and help with the following points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Resource Organization&lt;/li&gt;
&lt;li&gt;Resource Identification&lt;/li&gt;
&lt;li&gt;Selective Resource Access&lt;/li&gt;
&lt;li&gt;Application Deployment&lt;/li&gt;
&lt;li&gt;Scaling and Load Balancing&lt;/li&gt;
&lt;li&gt;Rolling Updates and Rollbacks&lt;/li&gt;
&lt;li&gt;Monitoring and Logging&lt;/li&gt;
&lt;li&gt;Multi-Tenancy&lt;/li&gt;
&lt;li&gt;Custom Workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Selectors are namespace scoped, you can add the &lt;code&gt;--all-namespaces&lt;/code&gt; argument to select matching objects in all namespaces&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl get object-name -o yaml&lt;/code&gt;&lt;br&gt;
&lt;code&gt;kubectl get pod pod-name -o yaml&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;labels:
  app: sample
  pod-template-hash: 0815
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;kubectl -n yourns get --selector app=your_pod&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Container Pods
&lt;/h2&gt;

&lt;p&gt;Having multiple containers allows independent development and scaling for every container to best meet the needs of the workload.&lt;br&gt;
Every container in a POD shares a single IP address and namespace.&lt;br&gt;
Each container has equal potential access to storage given to the Pod.&lt;/p&gt;
&lt;h2&gt;
  
  
  Different types of containers
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Ambassador
&lt;/h3&gt;

&lt;p&gt;Used to communicate outside resources, often outside a cluster. With this you don't need to implement a new service or a new entry to an ingress controller. &lt;/p&gt;
&lt;h3&gt;
  
  
  Adapter
&lt;/h3&gt;

&lt;p&gt;is used to modify the data generated by the primary container. An example would be a datastream that needs to be modified for a usecase.&lt;/p&gt;
&lt;h3&gt;
  
  
  Sidecar
&lt;/h3&gt;

&lt;p&gt;you can compare it to a sidecar on a motorcycle. It often provides services that are not found in the main application. For example a logging container. So it remains decoupled and scalable. &lt;/p&gt;
&lt;h3&gt;
  
  
  initContainer
&lt;/h3&gt;

&lt;p&gt;An init container allows one or more containers to run only if one or more previous containers run and exit successfully. &lt;br&gt;
For example a git-sync container would be an init container for another applications that needs to have always the latest information from a given git.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;spec:
  containers:
   -name: app
    image: app-image
  initContainers:
  - name: git-sync
    image: git-sync-image
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  CRD - Custom Resource Definition
&lt;/h2&gt;

&lt;p&gt;With CRDs you can extend the K8s API and create custom resource. &lt;br&gt;
With the help of a CRD you can add databases, message queues or machine learning models and many more and create custom schemas for your custom resource.&lt;br&gt;
There are also public CRDs that can be used. For example Helm CRDs or the Prometheus Operator CRDs.&lt;/p&gt;
&lt;h2&gt;
  
  
  Job
&lt;/h2&gt;

&lt;p&gt;It is a resource object to manage and run a task or a batch process. Jobs ensure that a given number of pods successfully completes.&lt;/p&gt;
&lt;h3&gt;
  
  
  Characteristics
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;One-Time Execution. For example a database migration or a backup&lt;/li&gt;
&lt;li&gt;Parallelism.  number of parallel pod completions&lt;/li&gt;
&lt;li&gt;Pod Template. defines the container(s) to run&lt;/li&gt;
&lt;li&gt;Completion and Failure Handling. can be defined via &lt;code&gt;completions&lt;/code&gt; and &lt;code&gt;backoffLimit&lt;/code&gt;. The backoffLimit defines the number of retries.&lt;/li&gt;
&lt;li&gt;Garbage collection.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sample of a job manifest.yaml in K8s&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: batch/v1
kind: Job
metadata:
  name: your-job
spec:
  completions: 3  # Number of desired completions
  parallelism: 1  # Number of pods running in parallel
  template:
    spec:
      containers:
      - name: your-container
        image: your-image
        command: ["echo", "Hello World!"]
  backoffLimit: 2  # Maximum number of retries in case of failure
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  checklist to see if your design is good
&lt;/h2&gt;

&lt;p&gt;[] The application is as decoupled as it can be&lt;br&gt;
[] Nothing can be taken out of an existing container.&lt;br&gt;
[] Every container is transient and is able to react properly  when other containers are transient&lt;br&gt;
[] Chaos Monkey can run without my users noticing it&lt;br&gt;
[] Every component can be scaled to meet the workload&lt;/p&gt;

&lt;p&gt;Further reading:&lt;br&gt;
K8s CRD: &lt;a href="https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/"&gt;https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/&lt;/a&gt;&lt;br&gt;
Resource: &lt;a href="https://training.linuxfoundation.org/training/kubernetes-for-developers/"&gt;https://training.linuxfoundation.org/training/kubernetes-for-developers/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>desgin</category>
      <category>kubernetes</category>
      <category>resourcemanagement</category>
      <category>framework</category>
    </item>
    <item>
      <title>Kubernetes Build</title>
      <dc:creator>Barbara</dc:creator>
      <pubDate>Fri, 29 Sep 2023 15:47:34 +0000</pubDate>
      <link>https://dev.to/barbara/kubernetes-build-3cll</link>
      <guid>https://dev.to/barbara/kubernetes-build-3cll</guid>
      <description>&lt;p&gt;This post sums up the steps to build a Kubernetes application.&lt;/p&gt;

&lt;h2&gt;
  
  
  CRI - Container Runtime Interface
&lt;/h2&gt;

&lt;p&gt;Kubernetes is designed to work with many different container runtimes like Docker, CRI-O, containerd, rkt and others. The CRI allows easy integration of various container runtimes with kubelet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Containerizing an application
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The more stateless and transient the better an app is suited for containerization&lt;/li&gt;
&lt;li&gt;Environmental configuration needs to be provided via configMaps and secrets&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What it is
&lt;/h3&gt;

&lt;p&gt;A dockerfile is a list of commands from whit an image can be build.&lt;br&gt;
An image is a binary file that includes everything needed to be run as a container. The images are usually stored in the container registry.&lt;br&gt;
A container is a running instance of an Image.&lt;/p&gt;
&lt;h3&gt;
  
  
  Sample with Docker
&lt;/h3&gt;

&lt;p&gt;After you wrote your docker file do:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo docker build -t yourapp # build the container
sudo docker images # verify the image 
sudo docker run yourapp #execute the image
sudo docker push # push to the repository
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;you can keep your docker images local, on a repo or on a container registry of a cloud provider, like azure container registry, google, aws.&lt;br&gt;
Every container in a POD shares a single IP address and namespace. Every container has equal potential storage given to the pod.&lt;/p&gt;
&lt;h2&gt;
  
  
  Probes
&lt;/h2&gt;

&lt;p&gt;Three different types of probes help to ensure that applications are ready for traffic and healthy within Kubernetes.&lt;/p&gt;
&lt;h3&gt;
  
  
  readinessProbe
&lt;/h3&gt;

&lt;p&gt;If your application needs to be initizalized or configured in order to accept traffic, you can use the readinessProbe. &lt;br&gt;
The container will not accept traffic until the probe returns a healthy state.&lt;/p&gt;
&lt;h3&gt;
  
  
  livenessProbe
&lt;/h3&gt;

&lt;p&gt;It checks if the container is in a healthy state, while running. If it fails, the container is terminated and a replacement would be spawned.&lt;/p&gt;
&lt;h3&gt;
  
  
  startupProbe
&lt;/h3&gt;

&lt;p&gt;This probe is used to test an application that takes a long time to start. The duration until a container is considered to have failed is determined by the &lt;code&gt;failureThreshold x periodSeconds&lt;/code&gt;. If the periodSeconds is set to 5 seconds and the failureThreshold set, it would check every 5 seconds and fail after a total of 50 Seconds.&lt;/p&gt;
&lt;h3&gt;
  
  
  Probes samples
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: v1
kind: Pod
metadata:
  name: your-app
spec:
  containers:
    - name: your-container
      image: your-image:latest
      ports:
        - containerPort: 8080
      startupProbe:
        httpGet:
          path: /areyouready
          port: 8080
        initialDelaySeconds: 10
        periodSeconds: 5
      livenessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 15
        periodSeconds: 10
      # Define a custom configuration probe
      # readinessProbe:
      #  exec:
      #    command:
      #     - cat
      #     - /app/config/config.yaml
      #  initialDelaySeconds: 20
      #  periodSeconds: 30
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Create a POD
&lt;/h2&gt;

&lt;p&gt;With the following command, you can create a pod as defined in a file called &lt;code&gt;your-pod.yaml&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl apply -f your-pod.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Testing
&lt;/h2&gt;

&lt;p&gt;To see if everything is working as expected, you can use the describe functionality or get the logs of a pod.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl describe pod &amp;lt;pod-name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl logs &amp;lt;pod-name&amp;gt; -c &amp;lt;container-name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the next post I will show you how to write a declarative configuration to define the desired state of an containerized piece of code - in Kubernetes it is called DEPLOYMENT. See you there.&lt;/p&gt;

&lt;p&gt;Further reading:&lt;br&gt;
Container runtimes: &lt;a href="https://github.com/containers"&gt;https://github.com/containers&lt;/a&gt;&lt;br&gt;
Helm: &lt;a href="https://helm.sh/"&gt;https://helm.sh/&lt;/a&gt;&lt;br&gt;
ArtifactHub: &lt;a href="https://artifacthub.io/"&gt;https://artifacthub.io/&lt;/a&gt;&lt;br&gt;
Resource: &lt;a href="https://training.linuxfoundation.org/training/kubernetes-for-developers/"&gt;https://training.linuxfoundation.org/training/kubernetes-for-developers/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>docker</category>
      <category>probe</category>
      <category>test</category>
    </item>
    <item>
      <title>Delta Live Tables</title>
      <dc:creator>Barbara</dc:creator>
      <pubDate>Fri, 29 Sep 2023 14:21:28 +0000</pubDate>
      <link>https://dev.to/barbara/delta-live-tables-1bi5</link>
      <guid>https://dev.to/barbara/delta-live-tables-1bi5</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR - Delta Live Tables aka DLT
&lt;/h2&gt;

&lt;p&gt;DLT is a framework on top of a Delta Lake, and does magic simsalabim out of the box, so you can process big amounts of data without having any knowledge about the mechanics used. But you also have the possibility to configure it very fine grained via a json file, when creating.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key features of Delta Live Tables
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Different data sets&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dataset type&lt;/th&gt;
&lt;th&gt;How is the data processed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Streaming table&lt;/td&gt;
&lt;td&gt;Each record is processed exactly once. This assumes an append-only source.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Materialized views&lt;/td&gt;
&lt;td&gt;Records are processed as required to return accurate results for the current data state. Materialized views should be used for data sources with updates, deletions, or aggregations, and for change data capture processing (CDC).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Views&lt;/td&gt;
&lt;td&gt;Records are processed each time the view is queried. Use views for intermediate transformations and data quality checks that should not be published to public datasets.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;You can write DTL in &lt;strong&gt;Python or SQL&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;You can use &lt;strong&gt;different editions&lt;/strong&gt; "core", "pro" and "advanced".&lt;/li&gt;
&lt;li&gt;You can use it to &lt;strong&gt;orchestrate tasks&lt;/strong&gt; and build pipelines in a very fast way and with a lot less code.&lt;/li&gt;
&lt;li&gt;it takes care of the &lt;strong&gt;cluster management&lt;/strong&gt; by itself, but you can also configure it yourself with a .json if needed.&lt;/li&gt;
&lt;li&gt;You have inbuilt &lt;strong&gt;monitoring&lt;/strong&gt;. Within the delta live tables user interface, you can see Pipeline status, latency, throughput, error rates and the data quality as defined by you.&lt;/li&gt;
&lt;li&gt;you can add &lt;strong&gt;data quality benchmarking&lt;/strong&gt; in a very simple way. But it is only enabled in the "advanced" edition.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@dlt.expect("valid_user_name", "user_name IS NOT NULL")
@dlt.expect_or_fail("valid_count", "click_count &amp;gt; 0")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Sample: Medallion Architecture done with Delta Live Tables
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import dlt 
# bumms - magic imported
# expected to run as a part of a delta table pipeline
from pyspark.sql.functions import *

# if you want to ingest json data
json_path = "your_path"

# STEP 1 - Bronze Layer
# alias for creating table function
@dlt.table(
    comment="ingests raw data from wherever you want"
    # you could assign a different table name in here, if you don't want the table to be the function name, like
    # name= "my_bronze_layer" 
)

# function name is the name of the DLT
# this function always needs to follow after the table creation with @dlt.table
def bronze_layer():
    """
    This function ingest raw data from a given source and stores it to a table called "bronze_layer"
    """
    # df = spark. read...whatever you want, like filter data as long as you return a DataFrame
    return (spark.read.format("json").load(json_path)) # a dataframe


# STEP 2 - Silver Layer
@dlt.table(
  comment="Create a silver layer with selected, quality-checked data"
)

@dlt.expect("valid_user_name", "user_name IS NOT NULL")
@dlt.expect_or_fail("valid_count", "click_count &amp;gt; 0")

# new table creation
def silver_layer():
  return (
    # live table depending on the table built in STEP 1
    dlt.read("bronze_layer") # after this you can go ahead with spark as usual
      .withColumn("click_count", expr("CAST(n AS INT)"))
      .withColumnRenamed("user_name", "user")
      .withColumnRenamed("prev_title", "previous_page_title")
      .select("user", "click_count", "previous_page_title")
  )

# STEP 3 - Gold Layer
@dlt.table(
  comment="A table containing the top pages linking to the checkout page."
)
def gold_layer():
  return (
    dlt.read("silver_layer")
      .filter(expr("current_page_title == 'Checkout'"))
      .withColumnRenamed("previous_page_title", "referrer")
      .sort(desc("click_count"))
      .select("referrer", "click_count")
      .limit(10)
  )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Further reading:&lt;br&gt;
&lt;a href="https://docs.databricks.com/api/workspace/pipelines"&gt;Delta Live Tables&lt;/a&gt;&lt;br&gt;
&lt;a href="https://docs.databricks.com/en/delta/index.html"&gt;Delta Lake&lt;/a&gt;&lt;br&gt;
&lt;a href="https://docs.databricks.com/en/introduction/index.html#:~:text=Databricks%20is%20a%20unified%2C%20open,Data%20warehousing%2C%20analytics%2C%20and%20BI"&gt;Databricks&lt;/a&gt;&lt;/p&gt;

</description>
      <category>databricks</category>
      <category>dlt</category>
      <category>dataengineering</category>
      <category>cheatsheet</category>
    </item>
    <item>
      <title>Kubernetes Architecture</title>
      <dc:creator>Barbara</dc:creator>
      <pubDate>Tue, 15 Aug 2023 10:30:49 +0000</pubDate>
      <link>https://dev.to/barbara/kubernetes-architecture-1e9p</link>
      <guid>https://dev.to/barbara/kubernetes-architecture-1e9p</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://kubernetes.io/" rel="noopener noreferrer"&gt;Kubernetes&lt;/a&gt; is an open-source system for automating deployment, scaling and management of containerized applications. &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In this post, you will get an overview of the K8s architecture. If you are coming from software engineering and want to get a first understanding of how K8s works: this post is for you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Terminology.
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Namespace.
&lt;/h3&gt;

&lt;p&gt;a group of resources. for every group resource quotas can be set with the &lt;strong&gt;LimitRange&lt;/strong&gt; admission. Also, user permissions can be applied.&lt;br&gt;
K8s clusters can be created in a namespace or cluster-scoped.&lt;br&gt;
Two objects can not have the same &lt;strong&gt;Name&lt;/strong&gt; value in a namespace&lt;/p&gt;

&lt;h3&gt;
  
  
  Context.
&lt;/h3&gt;

&lt;p&gt;This consists of the user, cluster name (eg dev and prod) and namespace. It is used to switch between permissions and restrictions.&lt;br&gt;
The context information is stored in &lt;code&gt;~/.kube/config&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Resource limits.
&lt;/h3&gt;

&lt;p&gt;limits can be set per namespace and pod. The namespace limits have priority over pod spec.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pod security admission.
&lt;/h3&gt;

&lt;p&gt;There are 3 profiles: &lt;strong&gt;privileged&lt;/strong&gt;, &lt;strong&gt;baseline&lt;/strong&gt;, &lt;strong&gt;restricted policies&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Network policies.
&lt;/h3&gt;

&lt;p&gt;Ingress and Egress traffic can be limited according to namespaces and labels or addresses.&lt;/p&gt;

&lt;h2&gt;
  
  
  K8s API Flow.
&lt;/h2&gt;

&lt;p&gt;In the following sections, the parts of the control plane and the worker nodes will be explained. &lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8zzk93ezbmww0somcw5t.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8zzk93ezbmww0somcw5t.jpg" alt="K8s API Flow"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Control plane node components
&lt;/h2&gt;

&lt;h3&gt;
  
  
  kube-apiserver
&lt;/h3&gt;

&lt;p&gt;is the central part of a K8s cluster. All calls are handled on this server. Every API call passes three steps: authentications, authorization, and admission controllers.&lt;br&gt;
Only the kube-apiserver connects to the etcd database. &lt;/p&gt;

&lt;h3&gt;
  
  
  kube-scheduler
&lt;/h3&gt;

&lt;p&gt;scans available resources (like CPU, memory utilization, node health and workload distributions) and makes informed decisions which node will host a Pod of containers. &lt;br&gt;
It monitors the cluster and makes decisions based on the current state&lt;/p&gt;

&lt;h3&gt;
  
  
  etcd database
&lt;/h3&gt;

&lt;p&gt;is a key-value store that stores the state of the cluster, networking and persistent information is stored.&lt;/p&gt;

&lt;h3&gt;
  
  
  kube-controller-manager
&lt;/h3&gt;

&lt;p&gt;is a core control loop daemon that interacts with the kube-api-server to determine the state of the cluster. If the state does not match, the manager contacts the necessary controllers to match the desired state. There are several controllers in use, like endpoints, namespace and replication.&lt;/p&gt;

&lt;h3&gt;
  
  
  cloud-controller-manager
&lt;/h3&gt;

&lt;p&gt;can interact with agents outside of the cloud. It allows faster changes without altering the core K8s control process (see kube-apiserver). &lt;br&gt;
Each kubelet must use the --cloud-provider-external settings passed to the binary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Worker node components
&lt;/h2&gt;

&lt;h3&gt;
  
  
  kubelet
&lt;/h3&gt;

&lt;p&gt;interacts with the underlying Docker Engine (installed on all nodes) and ensures that all containers are running as desired.&lt;br&gt;
It accepts the API calls for Pod specifications and configures the local node until the specification has been met.&lt;br&gt;
For example, if a Pod needs access to storage, Secrets or ConfigMaps, the kubelet will make this happen.&lt;br&gt;
It sends back the status to the kube-apiserver to be persistent in the etcd.&lt;/p&gt;

&lt;h3&gt;
  
  
  kube-proxy
&lt;/h3&gt;

&lt;p&gt;manages the network connectivity to the containers via iptables (IPv4 and IPv6). A 'userspace mode' monitors Services and Endpoints.&lt;/p&gt;

&lt;h3&gt;
  
  
  logging
&lt;/h3&gt;

&lt;p&gt;Currently, there is no cluster-wide logging. &lt;a href="https://www.fluentd.org/" rel="noopener noreferrer"&gt;Fluentd&lt;/a&gt; can be used to have a unified logging layer for the cluster.&lt;/p&gt;

&lt;h3&gt;
  
  
  metrics
&lt;/h3&gt;

&lt;p&gt;run &lt;code&gt;kubectl top&lt;/code&gt; to get the metrics of a K8s component.&lt;br&gt;
If needed &lt;a href="https://prometheus.io/" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt; can be deployed to gather metrics from nodes and applications.&lt;/p&gt;

&lt;h3&gt;
  
  
  container engine
&lt;/h3&gt;

&lt;p&gt;A container engine for the management of containerized applications, like containerd or cri-o.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pods
&lt;/h2&gt;

&lt;p&gt;are the smallest units we can work with on K8s. The design of a pod follows a &lt;code&gt;one-process-per-container&lt;/code&gt; architecture. A pod represents a group of co-located containers with some associated data volumes. &lt;br&gt;
Containers in a pod start in parallel by default. &lt;/p&gt;

&lt;h3&gt;
  
  
  special containers:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;initContainers: if we want to wait for a container to start before another.&lt;/li&gt;
&lt;li&gt;sidecar: used to perform helper tasks, like logging.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  single IP per Pod
&lt;/h3&gt;

&lt;p&gt;All containers in a pod share the same network namespace.  You can not see those containers on K8s level, only on the pod level.&lt;br&gt;
The containers use the loopback interface, write to files on a common filesystem or via inter-process communication (IPC).&lt;/p&gt;

&lt;h2&gt;
  
  
  Services
&lt;/h2&gt;

&lt;p&gt;are flexible and scalable operators that connect resources. Each service is a microservice handling a particular bit of traffic, like a &lt;strong&gt;NodePort&lt;/strong&gt; or a &lt;strong&gt;LoadBalancer&lt;/strong&gt; to distribute requests. They are also used for resource control and security.&lt;br&gt;
They use selectors to know which objects to connect. These selectors can be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;equality-based: =, ==, 1=&lt;/li&gt;
&lt;li&gt;set-based: in, notin exists&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Operators
&lt;/h2&gt;

&lt;p&gt;aka watch-loops aka controllers query the current state against the given spec and execute code to meet the spec.&lt;br&gt;
A &lt;strong&gt;DeltaFIFO&lt;/strong&gt; queue is used. The loop process only ends if the delta is the type &lt;strong&gt;Deleted&lt;/strong&gt;. &lt;/p&gt;

&lt;h2&gt;
  
  
  Networking Setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ClusterIP
&lt;/h3&gt;

&lt;p&gt;is used for the traffic within the cluster.&lt;/p&gt;

&lt;h3&gt;
  
  
  NodePort
&lt;/h3&gt;

&lt;p&gt;creates first a ClusterIP and then associates a port of the node to that new ClusterIp.&lt;/p&gt;

&lt;h3&gt;
  
  
  LoadBalancer
&lt;/h3&gt;

&lt;p&gt;if a LoadBalancer Service is used, it will first create a ClusterIP and then a NodePort. Then it will make an async request for an external load balancer. If the external is not configured to respond, it will stay in &lt;strong&gt;pending&lt;/strong&gt; state.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ingress Controller
&lt;/h3&gt;

&lt;p&gt;acts as a reverse proxy to route external traffic to the assigned services based on the configuration. So its key responsibilities are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Routing and Load Balancing&lt;/li&gt;
&lt;li&gt;TLS Termination&lt;/li&gt;
&lt;li&gt;Path-Based Routing&lt;/li&gt;
&lt;li&gt;Virtual Hosts&lt;/li&gt;
&lt;li&gt;Authentication and Authorization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Video about the K8s API: &lt;a href="https://www.youtube.com/watch?v=YsmgB2QDaUg" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=YsmgB2QDaUg&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Container Network Interface (CNI) Configuration File
&lt;/h2&gt;

&lt;p&gt;It is the default networking interface mechanism used by kubeadm, which is the K8s cluster bootstrapping tool.&lt;/p&gt;

&lt;p&gt;It is a specification to configure container networking communications, provide a single IP per pod and remove resources when a container is deleted.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/containernetworking/cni" rel="noopener noreferrer"&gt;CNI&lt;/a&gt; is language-agnostic and there are many different plugins available.&lt;/p&gt;

&lt;p&gt;Now you have a first overview of the architecture of K8s.&lt;br&gt;
You learned about the difference between the control plane and the worker nodes and its components. With the knowledge of this terminology in place, you can start to get into the details and run a cluster yourself.&lt;br&gt;
If you already working with the K8s API, remember for now &lt;code&gt;kubectl --help&lt;/code&gt; is your best friend. As kubectl offers more than 40 arguments you can explore all of these with the &lt;code&gt;--help&lt;/code&gt; flag. For example &lt;code&gt;kubectl taint --help&lt;/code&gt;. You will get your information faster there because chatgpt and bard tend to talk a lot and say so little.&lt;/p&gt;

&lt;p&gt;In the next post of this series, I will write about how to build a K8s cluster. See you there.&lt;/p&gt;

&lt;p&gt;Dig deeper:&lt;br&gt;
&lt;a href="https://kubernetes.io/" rel="noopener noreferrer"&gt;official documentation&lt;/a&gt;&lt;br&gt;
&lt;a href="https://www.youtube.com/watch?v=YsmgB2QDaUg" rel="noopener noreferrer"&gt;K8s API Flow explained in a beautiful video&lt;/a&gt;&lt;br&gt;
&lt;a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/" rel="noopener noreferrer"&gt;concepts of cluster networking&lt;/a&gt;&lt;br&gt;
Resource: &lt;a href="https://training.linuxfoundation.org/training/kubernetes-for-developers/" rel="noopener noreferrer"&gt;https://training.linuxfoundation.org/training/kubernetes-for-developers/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>architecture</category>
      <category>devops</category>
      <category>learning</category>
    </item>
    <item>
      <title>Data Pipelines explained with Airflow</title>
      <dc:creator>Barbara</dc:creator>
      <pubDate>Tue, 25 Jan 2022 12:06:00 +0000</pubDate>
      <link>https://dev.to/barbara/data-pipelines-explained-with-airflow-6e7</link>
      <guid>https://dev.to/barbara/data-pipelines-explained-with-airflow-6e7</guid>
      <description>&lt;p&gt;In the following lines I am doing a write-up about everything I learned about data pipelines at the Udacity online class. It gives a general overview about data pipelines and provides also the core concepts of Airflow and some links to code examples on &lt;a href="https://github.com/BarbaraJoebstl/data-engineering-nd/tree/master/data-pipelines"&gt;github&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  WHAT - A series
&lt;/h2&gt;

&lt;p&gt;A data pipeline is a series of steps in which data is processed, mostly &lt;a href="https://www.integrate.io/blog/etl-vs-elt/"&gt;ETL or ELT&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;Data pipelines provide a set of logical guidelines and a common set of terminology.&lt;br&gt;
The conceptual framework of data pipelines will help you better organize and execute everyday data engineering tasks.&lt;/p&gt;

&lt;p&gt;Examples of use cases are automate marketing emails, real time pricing or targeted advertising based on the browsing history.&lt;/p&gt;
&lt;h2&gt;
  
  
  WHY - Data Quality
&lt;/h2&gt;

&lt;p&gt;We want to provide high quality data. &lt;br&gt;
There can be different requirements how to measure data quality based on the use case. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data must be a certain size&lt;/li&gt;
&lt;li&gt;Data must be accurate to some margin of error&lt;/li&gt;
&lt;li&gt;Data must arrive within a given timeframe fro the start of the execution &lt;/li&gt;
&lt;li&gt;Pipelines must run on a particular schedule&lt;/li&gt;
&lt;li&gt;Data must not contain any sensitive information&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Data Validation
&lt;/h3&gt;

&lt;p&gt;is the process of ensuring that data is present, correct and meaningful. Ensuring the quality of your data through automated validation checks is a critical step when working with data.&lt;/p&gt;
&lt;h3&gt;
  
  
  Data Lineage
&lt;/h3&gt;

&lt;p&gt;of a dataset describes the discrete steps involved in the creation, movement and calculation of a dataset. It is important for the following points:&lt;/p&gt;
&lt;h4&gt;
  
  
  Gain Confidence
&lt;/h4&gt;

&lt;p&gt;If we can describe the data lineage of a dataset or analysis is building confidence in our data consumers like Engineers, Analyst, Data Scientists, Stakeholders.&lt;br&gt;&lt;br&gt;
Else if the data lineage is unclear it is very likely that our data consumers do not trust or want to use the data.&lt;/p&gt;
&lt;h4&gt;
  
  
  Defining Metrics
&lt;/h4&gt;

&lt;p&gt;If we can surface data lineage, everyone in the company is able to agree on the definition of how a particular metric is calculated.&lt;/p&gt;
&lt;h4&gt;
  
  
  Debugging
&lt;/h4&gt;

&lt;p&gt;If each step of the data movement and transformation process is well described, it's easy to find problems if they occur.&lt;/p&gt;

&lt;p&gt;Airflow DAGs are a natural representation for the movement and transformation of data. The components can be used to track data lineage: the rendered code tab for a task, the graph view for a DAG, historical runs under the tree view.&lt;/p&gt;
&lt;h3&gt;
  
  
  Schedules
&lt;/h3&gt;

&lt;p&gt;allow us to make assumption about the &lt;em&gt;scope&lt;/em&gt; of the data. The scope of a pipeline run can be defined as the time of the current execution until the end of the last execution.&lt;/p&gt;

&lt;p&gt;Schedules improve data quality by limiting our analysis to relevant data to a time period. If we use schedules appropriately, they are also a form of &lt;em&gt;data partitioning&lt;/em&gt;, which can increase the speed of our pipeline runs.&lt;br&gt;
With the help of schedules we also can leverage already completed work. For example we only would need the aggregation of the current month and add it to the existing totals instead of aggregating data of all times.&lt;/p&gt;
&lt;h4&gt;
  
  
  How to schedule
&lt;/h4&gt;

&lt;p&gt;If we answer the below questions, we can find an appropriate schedule for our pipelines.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What is the average size of the data for a time period? The more data we have, the more often the pipeline needs to be scheduled&lt;/li&gt;
&lt;li&gt;How frequently is data arriving and how often do we need to perform analysis? If the company needs data on a daily basis, that is the driving factor in determining the schedule. &lt;/li&gt;
&lt;li&gt;What is the frequency on related datasets? A rule of thumb is that the frequency of a pipeline's schedule should be determined by the dataset in our pipeline, that requires the most frequent analysis. &lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  Data Partitioning
&lt;/h3&gt;

&lt;p&gt;This is the process of isolating data to be analyzed by one or more attributes, such as time - &lt;em&gt;schedule partitioning&lt;/em&gt;, conceptually related data into discrete groups - &lt;em&gt;logical partitioning&lt;/em&gt;, data size -  &lt;em&gt;size partitioning&lt;/em&gt; or location. &lt;br&gt;
This will lead to faster and more reliable pipelines. As smaller datasets, time periods and related concepts are easier to debug than big amounts of data and unrelated concepts. There will also be fewer dependencies. &lt;br&gt;
Tasks operating on partitioned data my be more easy &lt;em&gt;parallelized&lt;/em&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  HOW does a pipeline work
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkf0kezke9e480sft8e7q.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkf0kezke9e480sft8e7q.jpg" alt="What is a DAG" width="800" height="166"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  DAGs - Directed Acyclic Graphs
&lt;/h3&gt;

&lt;p&gt;A DAG is a collection of nodes and edges that describe the order of operations for a data pipeline.&lt;br&gt;
The conceptual framework of data pipelines help us to better organize and execute everyday data engineering tasks.&lt;/p&gt;
&lt;h3&gt;
  
  
  NODE
&lt;/h3&gt;

&lt;p&gt;A node is a step in a data pipeline process.&lt;/p&gt;
&lt;h3&gt;
  
  
  EDGE
&lt;/h3&gt;

&lt;p&gt;The dependencies or relationships other between nodes.&lt;/p&gt;
&lt;h3&gt;
  
  
  GRAPH
&lt;/h3&gt;

&lt;p&gt;A graph describes entities and relationships between the DAGS&lt;/p&gt;

&lt;p&gt;In real world it is possible to model a data pipeline that is not a DAG, meaning it contains a cycle within the process. But the majority of pipelines can be described as a DAG. This makes the code more understandable and maintainable.&lt;/p&gt;
&lt;h2&gt;
  
  
  Apache Airflow
&lt;/h2&gt;

&lt;p&gt;is an open-source DAG-based, schedulable, data-pipeline tool that can run mission-critical environments.&lt;br&gt;
It is not a data processing framework, it is a tool that coordinates the movement between other data stores and data processing tools. &lt;br&gt;
Airflow allows users to write DAGs in Python that run on a schedule and/or from an external trigger.&lt;/p&gt;

&lt;p&gt;The advantage of defining pipelines in code are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;maintainability&lt;/li&gt;
&lt;li&gt;versionable&lt;/li&gt;
&lt;li&gt;testable&lt;/li&gt;
&lt;li&gt;collaborative&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Airflow is simple to maintain and can run data analysis itself, trigger external tools (&lt;a href="https://dev.to/barbara/redshift-2l6h"&gt;Redshift&lt;/a&gt;, &lt;a href="https://dev.to/barbara/spark-for-beginners-and-you-24ea"&gt;Spark&lt;/a&gt;, etc). It also provides a web-based UI for users to visuzalize and interact with their data pipelines.&lt;/p&gt;
&lt;h3&gt;
  
  
  Components of Airflow
&lt;/h3&gt;

&lt;p&gt;A &lt;em&gt;Scheduler&lt;/em&gt; for orchestrating the execution of jobs on a trigger or schedule. A &lt;em&gt;Work Queue&lt;/em&gt; which holds the state of the running DAGs and Tasks.&lt;br&gt;
&lt;em&gt;Worker Processes&lt;/em&gt; that execute the operations defined in each DAG. A &lt;em&gt;Database&lt;/em&gt; which saves credentials, connections, history and configuration.&lt;br&gt;
A &lt;em&gt;Web Interface&lt;/em&gt; that provides a control dashboard for users and maintainers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4gcytbnprgjagf570u5g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4gcytbnprgjagf570u5g.png" alt="UI Airflow" width="800" height="374"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  How it works
&lt;/h3&gt;

&lt;p&gt;The scheduler starts a DAG based on time or external triggers.&lt;br&gt;
If a DAG is started, the scheduler looks at the steps within the DAG and determines which steps can run by looking at their dependencies.&lt;br&gt;
The scheduler places runnable steps in the queue.&lt;br&gt;
Workers pick up those tasks and run them&lt;br&gt;
Once the worker has finished running a step, the final status of the task is recorded and additional tasks are placed by the scheduler until all tasks are complete.&lt;br&gt;
Once all tasks have been completed, the DAG is complete.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqokzxmblrze89wvkedvv.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqokzxmblrze89wvkedvv.jpg" alt="How Airflow works" width="800" height="299"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Creating a DAG
&lt;/h3&gt;

&lt;p&gt;To create a &lt;em&gt;DAG&lt;/em&gt; you need a name, a description, a start data and an interval.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from airflow import DAG

my_first_dag = DAG(
  'my_first',
   description='Says hello world',
   start_date=datetime(2022, 1, 22),
   schedule_interval='@daily')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the start date is in the past, Airflow will run your DAG as many times as&lt;br&gt;
 there are schedule intervals between the. start date and the current date. This is called &lt;em&gt;backfill&lt;/em&gt;. If a company has established years of data that may need to be retroactively analyzed, this is useful.&lt;/p&gt;

&lt;p&gt;Schedule intervals are optional and can be defined with cron strings or Airflow presets, like &lt;code&gt;@once&lt;/code&gt;, &lt;code&gt;@hourly&lt;/code&gt;, &lt;code&gt;@daily&lt;/code&gt;, &lt;code&gt;@weekly&lt;/code&gt;, &lt;code&gt;@monthly&lt;/code&gt;, &lt;code&gt;@yearly&lt;/code&gt; or None.&lt;/p&gt;

&lt;p&gt;End date is optional, if it is not specified, the DAG will run until it is disabled or deleted. An end date might be useful to mark the end of life or handling data bounds by two points in time.&lt;/p&gt;
&lt;h3&gt;
  
  
  Operators
&lt;/h3&gt;

&lt;p&gt;define the atomic steps of a work that make up a DAG. Instantiated operators are referred to as &lt;em&gt;Tasks&lt;/em&gt;.&lt;br&gt;
Airflow comes with &lt;a href="https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/index.html"&gt;many operators&lt;/a&gt; that can perform common operations, like &lt;code&gt;S3ToRedshiftOperator&lt;/code&gt; or &lt;code&gt;SimpleHttpOperator&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Task dependencies&lt;/em&gt; can be described programmatically using &lt;code&gt;a &amp;gt;&amp;gt; b&lt;/code&gt; or &lt;code&gt;a.set_downstream(b)&lt;/code&gt;, telling a comes before b or&lt;br&gt;
&lt;code&gt;a &amp;lt;&amp;lt; b&lt;/code&gt; or &lt;code&gt;a.set_upstream(b)&lt;/code&gt;, telling a comes after b&lt;br&gt;
.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from airflow.operators.python_operator import PythonOperator

def hello_world():
  print('Hello World')

def second_step():
  print('Second Step')

my_first_sample_task = PythonOperator(
  task_id='hello_world'
  python_callable=hello_world,
  dag=my_first_dag)

second_step_task = PythonOperator(
  task_id='second_step',
  python_callable=second_step,
  dag=my_first_dag)

my_first_sample_task &amp;gt;&amp;gt; second_step_task
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Task Boundaries
&lt;/h4&gt;

&lt;p&gt;DAG tasks should be &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;atomic and have a single, well defined purpose. The more work a task performs the less clear becomes its purpose. So it will be easy to maintain, understand and run fast.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;Write programs that do one thing and do it well.&lt;/code&gt; - Ken Thompson’s Unix Philosophy&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;maximize parallelism, if a task is scoped properly we can minimize dependencies and enable parallelism. This parallelization can speed up the execution of DAGs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We also can create custom operators as plugins. One example of a custom operator can be a certain data quality check, that is needed more often.&lt;/p&gt;

&lt;p&gt;To create a custom operator we have to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Identify operators that perform similar functions and can be consolidated&lt;/li&gt;
&lt;li&gt;Define a new operator in the plugins folder&lt;/li&gt;
&lt;li&gt;Replace the original operators with your new custom one, re-parameterize, and instantiate them.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You can find a sample for custom operators &lt;a href="https://github.com/BarbaraJoebstl/data-engineering-nd/blob/master/data-pipelines/lesson3_production_pipelines/operator_plugin.py"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  SubDAGs
&lt;/h3&gt;

&lt;p&gt;Commonly repeated series of tasks within DAGs can be captured as reusable SubDAGs. An example would be the "S3ToRedshiftSubDag"&lt;/p&gt;

&lt;h4&gt;
  
  
  Advantages
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;decrease the amount of code we need to write and maintain to create a new DAG&lt;/li&gt;
&lt;li&gt;easier to understand the high level goals of a DAG&lt;/li&gt;
&lt;li&gt;bug fixes, speedups, and other enhancements can be made more quickly and distributed to all DAGs that use that SubDAG
#### Disadvantages&lt;/li&gt;
&lt;li&gt;limited visibility within the AirflowUI&lt;/li&gt;
&lt;li&gt;harder to understand because of the abstraction level&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want you can also use nested subDAGs, but keep in mind that it makes it much harder to understand and maintain.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hooks
&lt;/h3&gt;

&lt;p&gt;Connections can be accessed in code via &lt;em&gt;hooks&lt;/em&gt;. Hooks provide a reusable interface to external systems and databases. Airflow comes with &lt;a href="https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/hooks/index.html"&gt;many hooks&lt;/a&gt;, like &lt;code&gt;HttpHook&lt;/code&gt;, &lt;code&gt;PostgresHook&lt;/code&gt;, &lt;code&gt;SlackHook&lt;/code&gt; etc. We don't have to worry how and where to store connection strings and secrets. You can store those in the Airflow user interface under &lt;code&gt;Admin - Variables&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from airflow import DAG
from airflow.hooks.postgres_hook import PostgresHook
from airflow.operators.python_operator import PythonOperator

def load():
# Create a PostgresHook option using the 'demo' connection
    db_hook = PostgresHook('demo')
    df = db_hook.get_pandas_df('SELECT * FROM my_sample')
    print(f'your sample has {len(df)} records')

load_task = PythonOperator(task_id='load_sample_data', python_callable=hello_world, ...)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Like for operators we also can create custom hooks. &lt;br&gt;
Before creating a new plugin you might want to check &lt;a href="https://github.com/apache/airflow/tree/main/airflow/contrib"&gt;Airflow contrib&lt;/a&gt; to see if there was already a plugin created by community members for your needs. If not you can build one and contribute it to the community.&lt;/p&gt;
&lt;h3&gt;
  
  
  Runtime variables
&lt;/h3&gt;

&lt;p&gt;Another feature is that Airflow provides &lt;a href="https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html"&gt;runtime variables&lt;/a&gt; that can be used. One example is the &lt;code&gt;{{ execution_date }}&lt;/code&gt;, the execution date.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def hello_date(*args, **kwargs):
    print(f“Hello {kwargs[‘execution_date’]}”)

divvy_dag = DAG(...)
task = PythonOperator(
    task_id=’hello_date’,
    python_callable=hello_date,
    provide_context=True,
    dag=my_first_dag)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Monitoring
&lt;/h3&gt;

&lt;p&gt;DAGs can be configured to have a &lt;em&gt;SLA&lt;/em&gt; - Service Level Agreement, which is defined as a time by which a DAG must complete. &lt;br&gt;
We can email a list of missed SLAs or view it in the AirflowUI. Missed SLAs can also be early indicators of performance problems or inidicate that we need to scale up the size of your Airflow cluster.&lt;br&gt;
If you are working on a time sensitive application an SLA would be crucial. &lt;/p&gt;

&lt;p&gt;Airflow can be configured to send emails on DAG and task state changes. These state changes may include successes, failures, or retries. Failure emails can allow you to easily trigger alerts. It is common for alerting systems to accept emails as a source of alerts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Metrics
&lt;/h3&gt;

&lt;p&gt;Airflow comes out of the box with the ability to send system metrics using a metrics aggregator called &lt;code&gt;statsd&lt;/code&gt;. Statsd can be coupled with metrics visualization tools like Grafana to provide you and your team high level insights into the overall performance of your DAGs, jobs, and tasks. These systems can be integrated into your alerting system. These Airflow system-level metrics allow you and your team to stay ahead of issues before they even occur by watching long-term trends.&lt;/p&gt;

&lt;p&gt;You can find code samples to all of the above mentioned topics &lt;a href="https://github.com/BarbaraJoebstl/data-engineering-nd/tree/master/data-pipelines"&gt;here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>dag</category>
      <category>productivity</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
