<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kimani Kanyutu</title>
    <description>The latest articles on DEV Community by Kimani Kanyutu (@kim_kanyutu).</description>
    <link>https://dev.to/kim_kanyutu</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1024071%2F0bcabf09-cb72-4333-872c-4701e9cd4764.jpg</url>
      <title>DEV Community: Kimani Kanyutu</title>
      <link>https://dev.to/kim_kanyutu</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kim_kanyutu"/>
    <language>en</language>
    <item>
      <title>EXPLORATORY DATA ANALYSIS ULTIMATE GUIDE</title>
      <dc:creator>Kimani Kanyutu</dc:creator>
      <pubDate>Sun, 26 Feb 2023 04:45:28 +0000</pubDate>
      <link>https://dev.to/kim_kanyutu/exploratory-data-analysis-ultimate-guide-35e1</link>
      <guid>https://dev.to/kim_kanyutu/exploratory-data-analysis-ultimate-guide-35e1</guid>
      <description>&lt;h2&gt;
  
  
  INTRODUCTION
&lt;/h2&gt;

&lt;p&gt;When buying high values commodities let's say a car, do you do just a random pick or do you take your time to assess the car? Check the engine condition, assess the tyre condition, the interior, the mileage, listen to the engine sound, check the service and maintenance history if there is one and confirm there is proper documentation.&lt;br&gt;
Same case applies to data!&lt;/p&gt;
&lt;h3&gt;
  
  
  What is EDA?
&lt;/h3&gt;

&lt;p&gt;Exploratory Data Analysis is a data analysis concept that involves the initial investigation of the dataset with an aim to establish patterns, detect obvious errors, detect anomalies and outliers or test hypothesis. This is often made possible with the use of statistical or graphical tools. EDA facilitates a better understanding of the data, relationships that exist within the dataset and the different variables present.&lt;br&gt;
It is more often referred to as &lt;strong&gt;descriptive analytics&lt;/strong&gt;. Insights obtained from this step determine the next undertaking in the data analysis process.&lt;br&gt;
EDA can be categorized into either;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Graphical and non-graphical analysis&lt;/strong&gt;&lt;br&gt;
Graphical analysis involves the use of graphical tools such as boxplot, bar graph  and scatter plot to assess the data. Python libraries such as Matplotlib and Seaborn can be used to perform this analysis.&lt;br&gt;
The non-graphical analysis adopts the use of summary and statistical tools to analyze data. In python, there is a variety of functions that can be used for this. On the other hand, tools such as PowerBi can be used to perform both the graphical and non-graphical analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Univariate and Multivariate analysis&lt;/strong&gt;&lt;br&gt;
The univariate analysis involves analysis on a single variable to establish pattern in it. Univariate analysis can employ use of either graphical or non-graphical analysis.&lt;br&gt;
The multivariate analysis on the other hand assess two or more variables and establishes patterns and relationship among them. This can also be in the form of graphical or statistical analysis.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why do we need to perform EDA?
&lt;/h3&gt;

&lt;p&gt;Some of the reasons why exploratory data analysis is crucial is so as to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Make sense of the data&lt;/li&gt;
&lt;li&gt;Remove any outliers and anomalies in the data&lt;/li&gt;
&lt;li&gt;To easily prepare data for the next step of Data analysis&lt;/li&gt;
&lt;li&gt;For better performance and accurate results. &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Remember:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Garbage In Garbage Out(GIGO)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  The CRISP DM process
&lt;/h2&gt;

&lt;p&gt;The CRoss-Industry Standard Procedure for Data Mining (CRISP DM) provides a standard methodology that is acceptable and agreeable for purposes of data mining and data science projects. The SOP can be dated back to the late nineties and is still one of the majorly adopted methodology in the analytics world by providing an approach that is technology and problem neutral.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdxcwjni0ppz0kzufo86t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdxcwjni0ppz0kzufo86t.png" alt="Fig 1 CRISP DM process" width="800" height="801"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The EDA concept is in line with the second step in the CRISP DM framework that involve data understanding. &lt;/p&gt;

&lt;p&gt;Now lets do some hands on work on exploratory data analysis using python. The data used in this case is from kaggle.&lt;a href="https://www.kaggle.com/datasets/parulpandey/2020-it-salary-survey-for-eu-region" rel="noopener noreferrer"&gt;EDA: IT Salary Survey 2020&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  General information
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Importing libraries&lt;/strong&gt;&lt;br&gt;
The pyforest is a Python library that allows one to import all the basic frequently used Python libraries in one line of code, instead of having to import different libraries such as pandas and numpy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pyforest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Loading Dataset&lt;/strong&gt;&lt;br&gt;
The data is available in csv format. We use the &lt;code&gt;read_csv&lt;/code&gt; function in pandas into a dataframe and get the first five rows of the dataframe.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = pd.read_csv('IT Salary Survey EU  2020.csv')
df.head(5)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxz5wjsjlq0617lwc57gx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxz5wjsjlq0617lwc57gx.png" alt="First 5 rows" width="800" height="269"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shape of the Dataset&lt;/strong&gt;&lt;br&gt;
This function gets the dimensions of the dataset in terms of rows and columns. The first value in the tuple is the number of rows while the second value is the number of columns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.shape
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvno20rg3jxwtgptduu4z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvno20rg3jxwtgptduu4z.png" alt="Shape of the dataset" width="157" height="50"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Summary information on the data set and datatypes&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;info()&lt;/code&gt; function gives general information about the columns of the dataset such as; number of non-null values and data types of the column variables.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.info()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8zvx1smit0swik5ozsjt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8zvx1smit0swik5ozsjt.png" alt="Summary info" width="800" height="396"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Column names&lt;/strong&gt;&lt;br&gt;
This enables us to get the different names of the columns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.columns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1nwwcdj5zi7a9wy1sp3l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1nwwcdj5zi7a9wy1sp3l.png" alt="Column names" width="800" height="230"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Some of the column name are very long. We change the column and then replace the white spaces with an underscore (_).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.columns = ["Year", "Age", "Gender","City","Position","Years of experience","Germany experience","Seniority level","Tech program language",
                       "Other Language","Yearly salary","Yearly bonus and stocks", "Salary one year ago","Bonus and stocks last year","Vacation days",
                       "Employment_status","Сontract_duration","Language","Company size","Company type","Job loss COVID","Kurzarbeit","Monetary Support"]

df.columns = df.columns.str.replace(' ', '_')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Unique Value counts for columns&lt;/strong&gt;&lt;br&gt;
Here, we get the number of unique values in each and every column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.nunique(axis=0)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo2hdg9irnbvc5ys9uiby.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo2hdg9irnbvc5ys9uiby.png" alt="Number of unique values" width="299" height="491"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Summary statistics&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;describe()&lt;/code&gt; function gives us summarized statistics of the dataset such as mean, standard deviation, min, max, 25%, 50%, 75% and count.&lt;br&gt;
 Note: This works only for the columns with numerical values&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.describe()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo37bmla1uqe1vwey4sf4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo37bmla1uqe1vwey4sf4.png" alt="Summary Statistics" width="516" height="270"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Missing values&lt;/strong&gt;&lt;br&gt;
Missing values can affect in a great way the performance of a model. There are different ways of handing missing data. One of such ways is dropping them. This is not advised especially in instances where we have a large number of null values. The &lt;code&gt;fillna()&lt;/code&gt; approach can be used to replace the missing values with the mean, median or any other value of that column.&lt;br&gt;
In this case, we first visualize all the missing values in the dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df.isna().sum().sort_values().plot(kind='barh')
plt.title('Distribution of null values')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiqpr45vybpdrpx2jypzb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiqpr45vybpdrpx2jypzb.png" alt="Missing values visualization" width="722" height="433"&gt;&lt;/a&gt;&lt;br&gt;
We then drop missing values in the age, gender, position, years of experience, seniority level, salary one year ago and language columns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = df.dropna(subset=['Age','Gender','Position','Years_of_experience','Seniority_level','Salary_one_year_ago','Language'])

df = df.drop_duplicates()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88nszy4gox0b1ayycwrg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88nszy4gox0b1ayycwrg.png" alt="Dropping Missing Values" width="734" height="433"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Changing Data Types&lt;/strong&gt;&lt;br&gt;
In order to allow for easy analysis and computation, it is important to have the variables in the correct data type. In this case, we change the column age to integer data type as well as extract the year using the datetime from pandas.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df['Age'] = df['Age'].astype(int)

df['Year'] = pd.to_datetime(Idf['Year']).dt.year
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgnxmugzr0bu06ighif2f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgnxmugzr0bu06ighif2f.png" alt="Changing Datatypes" width="447" height="79"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Graphical Visualization
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Univariate analysis
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Data distribution
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Age&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sns.set(font_scale=1.3)
plot=sns.histplot(IT_Filtered.Age,bins=9,kde=True)
plt.title('Age Distribution')
plt.show()

print('Age Descriptive Statistics')
IT_Filtered['Age'].describe().round()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Facsyts0z66d30aqqmlhf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Facsyts0z66d30aqqmlhf.png" alt="Age visualization" width="587" height="471"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From this it is evident that;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Majority of the people survey are between 20 - 40 years&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Workplace City Location&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;city10=IT_Filtered.City.value_counts().iloc[:10].to_frame().reset_index()
city10=city10.rename(columns={"index": "City", "City": "Count"})

sns.barplot(x=city10["Count"],y=city10["City"])
plt.title('Workplace City Location')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv145lb14kbu6f6fti3me.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv145lb14kbu6f6fti3me.png" alt="Location visualization" width="671" height="471"&gt;&lt;/a&gt;&lt;br&gt;
We can conclude that;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Berlin is the city with the highest tech opportunities in Germany.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Gender&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;plt.figure(figsize=(15,8))

IT_Filtered['Gender'].value_counts().plot.pie(autopct="%1.2f%%",fontsize=12,startangle=90, cmap='crest',explode=[0.05] * 3,pctdistance=1.1,labeldistance=1.3,textprops={'fontsize': 15})
plt.ylabel("")
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn9iedet86zlobzdy16qu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn9iedet86zlobzdy16qu.png" alt="Gender" width="636" height="668"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The tech jobs in Germany are male dominated.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Programming Language&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;program=IT_Filtered["Tech_program_language"].value_counts().iloc[:15].reset_index()
program=program.rename(columns={"index": "Program", "Tech_program_language": "Count"})
sns.barplot(data=program, y="Program", x="Count")
plt.title('Top 15 Program Languages')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkqkqdq5of7457as1qweh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkqkqdq5of7457as1qweh.png" alt="Programming Language" width="645" height="471"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It is evident that Python and Java are some of the most popular programming languages.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Multivariate analysis
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Correlation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;IT_Filtered.corr()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2t4gqhcdriq9mc3dd5n3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2t4gqhcdriq9mc3dd5n3.png" alt="Correlation Matrix" width="658" height="207"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;plt.figure(figsize=(15, 12))
sns.heatmap(IT_Filtered.corr(), annot = True, cmap= 'coolwarm')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7n8jodbl0gurmcqk3riq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7n8jodbl0gurmcqk3riq.png" alt="Correlation matrix visual" width="800" height="697"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Correlation matrix shows the relationship between different numerical columns in the dataset.&lt;br&gt;
Correlation is depicted in a scale of -1 to 1. Where -1 shows a high negative correlation while 1 shows a high positive correlation.&lt;br&gt;
The heatmap helps with a better visualization of the correlation with an additional color scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Top 15 programming languages&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;program_top=[]
for x in program['Program']:
     if x not in program_top:
        program_top.append(x)

program_df = IT_Filtered[IT_Filtered['Tech_program_language'].isin(program_top)] 
sns.boxplot(data=program_df, x="Tech_program_language", y="Yearly_salary")
plt.xlabel('Program Language')
plt.ylabel('Yearly Salary')
plt.gcf().set_size_inches(15, 8)
plt.suptitle('Salary by Top 15 Programming Languages')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flq4hvdlt7272t3db5n1o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flq4hvdlt7272t3db5n1o.png" alt="Programming Languages" width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gender and Age&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;plt.figure(figsize=(12,8))
sns.boxplot(x='Gender', y='Age', data=IT_Filtered, palette='Pastel2')
plt.xticks(fontsize=13)
plt.xlabel("")
plt.yticks(fontsize=13)
plt.ylabel("Age", fontsize=14)
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ficuu8guq45e9x9o3erhj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ficuu8guq45e9x9o3erhj.png" alt="Gender and Age" width="800" height="530"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Job Loss due to covid&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sns.histplot(IT_Filtered, x="Job_loss_COVID", stat="percent", multiple="dodge", shrink=.8, hue='Seniority_level')
plt.suptitle('Job Loss due to COVID-19',y=1.01)
plt.xlabel('')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu7nxsd4gjrcwhreqg6dp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu7nxsd4gjrcwhreqg6dp.png" alt="Job loss" width="576" height="484"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Years of experience vs German experience&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;plt.figure(figsize=(20, 5))
sns.histplot(data=IT_Filtered[['Years_of_experience', 'Germany_experience']].melt(), x='value', hue='variable', kde=True)
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa62yd4kbj9lrjwouuzpi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa62yd4kbj9lrjwouuzpi.png" alt="German experience" width="800" height="225"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  References:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Martinez-Plumed, F. et al. (2021) “CRISP-DM twenty years later: 
From data mining processes to data science trajectories,” IEEE 
Transactions on Knowledge and Data Engineering, 33(8), pp. 
3048–3061. Available at: 
&lt;a href="https://doi.org/10.1109/tkde.2019.2962680" rel="noopener noreferrer"&gt;https://doi.org/10.1109/tkde.2019.2962680&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Parul Pandey. IT Salary Survey for EU region(2018-2020).Kaggle. 
Accessed February 11,2022. 
&lt;a href="https://www.kaggle.com/datasets/parulpandey/2020-it-salary-" rel="noopener noreferrer"&gt;https://www.kaggle.com/datasets/parulpandey/2020-it-salary-&lt;/a&gt; 
survey-for-eu-region.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>career</category>
      <category>architecture</category>
      <category>buildinpublic</category>
      <category>saas</category>
    </item>
    <item>
      <title>SQL101: Introduction to SQL for Data Analysis</title>
      <dc:creator>Kimani Kanyutu</dc:creator>
      <pubDate>Tue, 21 Feb 2023 13:19:52 +0000</pubDate>
      <link>https://dev.to/kim_kanyutu/sql101-introduction-to-sql-for-data-analysis-1nfb</link>
      <guid>https://dev.to/kim_kanyutu/sql101-introduction-to-sql-for-data-analysis-1nfb</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;A database is a collection of related data. A database can be used alone or collaboratively with other databases. A database management system(DBMS) is a software consisting of a set of programs that facilitates the storage, modification and extraction of information from a database. DBMS were developed to address the deficiencies in traditional file systems such as;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Data integrity&lt;/li&gt;
&lt;li&gt;Concurrent access by users&lt;/li&gt;
&lt;li&gt;Data redundancy and inconsistency&lt;/li&gt;
&lt;li&gt;Difficulty accessing the data&lt;/li&gt;
&lt;li&gt;Atomicity, consistency, isolation, Durability(ACID) properties&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Database Models
&lt;/h3&gt;

&lt;p&gt;A database model shows the structure of a database. It shows the relationships, constraints of how the data is processed and stored inside a database.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Hierarchical database management system
&lt;/h4&gt;

&lt;p&gt;In this model, data is organized in a tree-like format. It allows for a parent-child like relationship where a parent can have many children but a child can have only one parent.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Network based database management system
&lt;/h4&gt;

&lt;p&gt;The network based model is an advancement on the Hierarchical model. Its creation was to address the lack of flexibility in the hierarchical model.&lt;br&gt;
As a result, it allows for children to have multiple parents, creating room for more complex relationships.&lt;/p&gt;
&lt;h4&gt;
  
  
  3. Relational database management system
&lt;/h4&gt;

&lt;p&gt;The relational model allows for the storage of data in data tables(relations). The data tables have different rows(records) and different columns (attributes).&lt;br&gt;
Relationships between data tables can be developed.  The data tables can have a variety of relationships that range from one-to-one, one-to-many, and many-to-many. This creates room for efficient storage and retrieval of data.&lt;/p&gt;
&lt;h4&gt;
  
  
  4. Object oriented database management system
&lt;/h4&gt;

&lt;p&gt;In object oriented database management systems also known as OODM, data and data relationships are stored together in single entities known as objects.&lt;/p&gt;
&lt;h3&gt;
  
  
  Database Languages
&lt;/h3&gt;

&lt;p&gt;Database components are usually consisted of three main components;&lt;/p&gt;
&lt;h4&gt;
  
  
  I. Data Definition Language
&lt;/h4&gt;

&lt;p&gt;Data definition Language is used to define structures to hold data of specific record types or object types, relationships among them any integrity constraints that needs to be met.&lt;br&gt;
Some of the DDL statements include;&lt;br&gt;
&lt;code&gt;DROP&lt;/code&gt;&lt;br&gt;
&lt;code&gt;CREATE&lt;/code&gt;&lt;br&gt;
&lt;code&gt;ALTER&lt;/code&gt;&lt;br&gt;
&lt;code&gt;TRUNCATE&lt;/code&gt;&lt;br&gt;
&lt;code&gt;RENAME&lt;/code&gt;&lt;br&gt;
&lt;code&gt;COMMENT&lt;/code&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  II. Data Manipulation Language
&lt;/h4&gt;

&lt;p&gt;Data Manipulation Language is used to select, retrieve, store, modify, delete, insert and update entries. Part of DML used to retrieve data is called Query language.&lt;br&gt;
Some of the basic DML statements include; INSERT, DELETE, SELECT, UPDATE&lt;br&gt;
&lt;code&gt;INSERT&lt;/code&gt;&lt;br&gt;
&lt;code&gt;DELETE&lt;/code&gt;&lt;br&gt;
&lt;code&gt;SELECT&lt;/code&gt;&lt;br&gt;
&lt;code&gt;UPDATE&lt;/code&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  III. Data Control Language
&lt;/h4&gt;

&lt;p&gt;Data Control Language is used to control access to data in  a database. This involves giving specific privileges to the users to access data items. An example of DCL statement includes;&lt;br&gt;
&lt;code&gt;GRANT&lt;/code&gt;&lt;br&gt;
&lt;code&gt;REVOKE&lt;/code&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Structured Query Language (SQL)
&lt;/h2&gt;

&lt;p&gt;Structured query language is one of the widely implemented language for relational databases..&lt;/p&gt;
&lt;h4&gt;
  
  
  Concepts in SQL
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;Tables. 
Tables are used to store data. It is a combination of several rows and columns&lt;/li&gt;
&lt;li&gt;Rows. 
A row is a single record in a table&lt;/li&gt;
&lt;li&gt;Columns. 
Columns are used to represent different attributes of the data.&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;
  
  
  Datatypes
&lt;/h4&gt;

&lt;p&gt;Data types are attributes that specify the type of data the object will hold. Such as integers, character, date and time data and binary strings. Some of the common datatypes can be categorized into;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;String data types such as; 
&lt;code&gt;VARCHAR()&lt;/code&gt;
&lt;code&gt;CHAR()&lt;/code&gt;
&lt;code&gt;TEXT&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Numeric data types such as;
&lt;code&gt;INT()&lt;/code&gt;
&lt;code&gt;FLOAT()&lt;/code&gt;
&lt;code&gt;BOOL()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Date and time data types such as;
&lt;code&gt;DATE&lt;/code&gt;
&lt;code&gt;TIME&lt;/code&gt;
&lt;code&gt;TIMESTAMP&lt;/code&gt;
&lt;code&gt;DATETIME&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Basic SQL commands
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Creating databases
&lt;/h3&gt;

&lt;p&gt;To create a database, one can use basic SQL commands. It is extremely important to first understand the type of data that will be stored and the various relationships that need to exist. Plans should be put in place for long term storage. This statements can be used to either create a new database or drop an existing database;&lt;br&gt;
Database creation&lt;br&gt;
&lt;code&gt;CREATE DATABASE _databasename_;&lt;/code&gt;&lt;br&gt;
Dropping and existing database;&lt;br&gt;
&lt;code&gt;DROP DATABASE _databasename_&lt;/code&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Creating Tables
&lt;/h3&gt;

&lt;p&gt;After creating a database, the next thing is to create tables. Here, the name of the table, the column names and column parameters as well as the datatype in the column are specified.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE depts(
   first_name VARCHAR(50),
   department VARCHAR(50)
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Considerations in creating tables&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Primary keys&lt;/strong&gt;. This are unique integer based row identifiers in a table. They are crucial when performing joins on tables.&lt;br&gt;
&lt;strong&gt;Foreign key&lt;/strong&gt;. A foreign key is a reference field for a primary key in another table. The table containing the foreign key is the &lt;strong&gt;referencing/child table&lt;/strong&gt; while the table to which the foreign key references is the &lt;strong&gt;parent table&lt;/strong&gt;.&lt;br&gt;
When creating tables, constraints can be used to define a primary key or attaching a foreign key relationship to anther table.&lt;br&gt;
&lt;strong&gt;Constraints&lt;/strong&gt;&lt;br&gt;
They are used to enforce certain conditions for the data that is entered in columns or table. This is so as to ensure accuracy and consistence of column data. Constraints can be categorized into;&lt;br&gt;
&lt;strong&gt;Column constraints&lt;/strong&gt; That constraints data in a column&lt;br&gt;
&lt;strong&gt;Table constraints&lt;/strong&gt;That constraints data in the entire table.&lt;br&gt;
Some examples of common constraints include;&lt;br&gt;
&lt;code&gt;NOT NULL&lt;/code&gt; , &lt;code&gt;UNIQUE&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE employees(
        emp_id SERIAL PRIMARY KEY,
    first_name VARCHAR(50) NOT NULL,
    last_name VARCHAR(50) NOT NULL,
    birthday DATE CHECK (birthdate &amp;gt; '1900-01-01'),
    hire_date DATE CHECK (hire_date &amp;gt; birthdate),
    salary INTEGER CHECK (salary &amp;gt; 0)
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;INSERT&lt;/strong&gt;&lt;br&gt;
After creating a table and defining constraints, the next step involves the entry of data into the table. The insert command is used. The name of the table is identified as well as the column names.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INSERT INTO employees(
first_name,
last_name,
birthdate,
hire_date,
salary
)
VALUES
('Sammy',
 'Smith',
 '1990-11-03',
 '2010-01-01',
 100
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;UPDATE&lt;/strong&gt;&lt;br&gt;
Allows for the changing of values in a table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;UPDATE account
SET last_login   =  CURRENT_TIMESTAMP;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;DELETE&lt;/strong&gt;&lt;br&gt;
The delete statement is used to remove rows from a table. It can be used together with a conditional statement to delete all that meet/do not meet that condition&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DELETE FROM job
WHERE job_name = 'Farmer'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;ALTER&lt;/strong&gt;&lt;br&gt;
The alter table command is used to add, delete or modify columns in an existing table. It can also be used to add or drop constraints as shown;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALTER VIEW customer_info RENAME to c_info;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DROP&lt;br&gt;
The drop table statement is used to drop an existing table in a database. A condition can also be added to the drop table statement.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DROP IF EXISTS customer_info;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Querying from a table
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;SELECT&lt;/strong&gt;&lt;br&gt;
The select statement is used to get data from a database. In general, &lt;code&gt;SELECT * FROM table_name;&lt;/code&gt; can be used to get all the data in the particular table. On the other hand, to get specific columns from the table, use;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT column1, column2
FROM table_name;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;SELECT DISTINCT&lt;/strong&gt;&lt;br&gt;
The select distinct statement is used to return unique values only. Eliminates duplicate values in the result.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT DISTINCT column1, column2
FROM table_name;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;WHERE&lt;/strong&gt;&lt;br&gt;
When you only need data that meets certain conditions, the where statement is used. The specified column that meet the condition are returned.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT memid,surname, firstname, joindate 
FROM members
WHERE joindate &amp;gt;= '2012-09-01';
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;AND, OR, NOT&lt;/strong&gt;&lt;br&gt;
The AND, OR and NOT are used together with the WHERE statement to curate the output of a table to meet certain conditions.&lt;br&gt;
AND operator is used where all conditions need to be True&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT column1, column2
FROM table_name
WHERE condition1 AND condition2 AND condition3;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OR operator is used to separate conditions where any of them only need to be True&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT column1, column2
FROM table_name
WHERE condition1 OR condition2 OR condition3;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;NOT operator is used to return values when the condition is not True&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT column1, column2
FROM table_name
WHERE NOT condition;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;ORDER BY&lt;/strong&gt;&lt;br&gt;
The order by statement is used to sort the query result in either ascending or descending order.&lt;br&gt;
&lt;code&gt;ASC&lt;/code&gt; Ascending order&lt;br&gt;
&lt;code&gt;DESC&lt;/code&gt; Descending order&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT DISTINCT(surname) 
FROM members
ORDER BY surname ASC;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;LIMIT&lt;/strong&gt;&lt;br&gt;
The limit statement is used to specify the number of rows to be obtained in the results. It more of sets an upper limit.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT DISTINCT(first_name), last_name 
FROM students
ORDER BY first_name ASC
limit 5;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;BETWEEN&lt;/strong&gt;&lt;br&gt;
The between statement is used to give values with in a certain specified range. The first and last value are specified.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT * FROM Products
WHERE Product_name BETWEEN 'Cussons' AND 'Dettol'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;IN, NOT IN&lt;/strong&gt;&lt;br&gt;
The In and not in statement are used similar to the or statement. They are for specification of a number of conditions when using the where operator.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT * 
FROM Customers
WHERE Country IN ('Germany', 'France', 'UK');
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT * 
FROM Customers
WHERE Country NOT IN ('Germany', 'France', 'UK');
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;LIKE, ILIKE&lt;/strong&gt;&lt;br&gt;
The like and ilike are used with the where statement to search for patterns in a column. The Like operator assess both pattern and casing of the individual characters while ilike assess only the pattern.&lt;br&gt;
There are wildcards that are used in conjunction with this two. They include;&lt;br&gt;
% - One or several characters&lt;br&gt;
_ - One single character.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DELETE FROM parent
WHERE last_name ILIKE 'l%';
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT actor_id, first_name, last_name
FROM actor
WHERE last_name LIKE 'B%';
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Groupby&lt;br&gt;
The group by statement is used to group rows that have similar values under certain columns. Most at time, the group by function is used together with aggregate functions such as &lt;code&gt;COUNT()&lt;/code&gt;, &lt;code&gt;MAX()&lt;/code&gt;, &lt;code&gt;MIN()&lt;/code&gt;, &lt;code&gt;SUM()&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT staff_id,customer_id, sum(amount)
FROM payment
GROUP BY staff_id,customer_id
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SQL provides a wide range of commands that can be used to query, add and manipulate data in tables and database. This is not an exhaustive list of the command but a summary of the few basic SQL commands that a beginner is most likely frequently going to interact with.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>sql</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
