<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Joy Ada Uche</title>
    <description>The latest articles on DEV Community by Joy Ada Uche (@joyadauche).</description>
    <link>https://dev.to/joyadauche</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F106713%2F80b529db-4623-42bf-9084-de3a346d5fdb.jpeg</url>
      <title>DEV Community: Joy Ada Uche</title>
      <link>https://dev.to/joyadauche</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/joyadauche"/>
    <language>en</language>
    <item>
      <title>The SQL Savant: Outer Joins in SQL</title>
      <dc:creator>Joy Ada Uche</dc:creator>
      <pubDate>Thu, 31 Dec 2020 18:29:55 +0000</pubDate>
      <link>https://dev.to/joyadauche/the-sql-savant-outer-joins-in-sql-1gfa</link>
      <guid>https://dev.to/joyadauche/the-sql-savant-outer-joins-in-sql-1gfa</guid>
      <description>&lt;p&gt;Amazing New Year!!! 😀 &lt;a href="https://dev.to/joyadauche/the-sql-savant-inner-joins-in-sql-37ak"&gt;So, the series of meetings with the new Javascript teacher&lt;/a&gt; went quite well and we got loads of analysis we gotta do...&lt;/p&gt;

&lt;p&gt;So right now he wants every student's academic detail whether they got a grade or not which can be easily achieved using a Left Outer Join. Hence, let's talk about OUTER JOINS!&lt;/p&gt;

&lt;p&gt;With outer joins, all records from one table are kept even if there are no matches in the other table that it joins on. There are 3 types of outer joins:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Left Joins&lt;/li&gt;
&lt;li&gt;Right Joins&lt;/li&gt;
&lt;li&gt;Full Joins&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With &lt;strong&gt;LEFT JOIN&lt;/strong&gt;, all records from the &lt;strong&gt;left table&lt;/strong&gt; (i.e the left table is the one after the FROM clause) are kept even if there are no matches in the right table (i.e the table after the JOIN type). Remember that from &lt;a href="https://dev.to/joyadauche/the-sql-savant-inner-joins-in-sql-37ak"&gt;here&lt;/a&gt;, the class has a database with the &lt;strong&gt;person&lt;/strong&gt; and &lt;strong&gt;grade&lt;/strong&gt; tables as below:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Applying Left Joins to above, the table with all students whether they have a grade or not looks like below:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;From above:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We can see that all records from the &lt;strong&gt;person&lt;/strong&gt; table, which is the left table, are returned,&lt;/li&gt;
&lt;li&gt;Also, the records with &lt;strong&gt;null&lt;/strong&gt; are those with values in the left table (person table) but have no matching record in the right table (grade table),&lt;/li&gt;
&lt;li&gt;The first 4 records are the same as those when we use an inner join while the last 3 corresponds to students that do not have a grade, hence their grade values are null.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The code for the result of the LEFT JOIN above is below:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Unlike INNER JOINS that keeps just the records corresponding to the id values of &lt;strong&gt;33CC&lt;/strong&gt; and &lt;strong&gt;44DD&lt;/strong&gt;, a LEFT JOIN keeps all of the records in the left table but then marks the values as null in the right table for those that don’t have a match.&lt;/p&gt;

&lt;p&gt;Moving on &lt;strong&gt;RIGHT JOINS&lt;/strong&gt;, which just does the reverse of LEFT JOINS. It matches all records via the key column from the &lt;strong&gt;right table&lt;/strong&gt; even if there are matching records in the left table. let's see the code below:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;From above, the &lt;strong&gt;right&lt;/strong&gt; table is &lt;strong&gt;person&lt;/strong&gt; while the left table is &lt;strong&gt;grade&lt;/strong&gt;. Since the RIGHT JOIN is just the reverse of the LEFT JOIN, the LEFT JOIN is more commonly used.&lt;/p&gt;

&lt;p&gt;Finally, let's talk about &lt;strong&gt;FULL JOINS&lt;/strong&gt;! This type of join combines both the LEFT JOIN and RIGHT JOIN. It combines all the records from the LEFT TABLE and the RIGHT TABLE. For record values that do not match for the left and right tables, the value will be null, as seen in other types of outer joins.&lt;/p&gt;

&lt;p&gt;Note that in our example case, the result for the FULL JOIN will be the same as the LEFT JOIN because: for example, when using &lt;strong&gt;person&lt;/strong&gt; as the left table and &lt;strong&gt;grade&lt;/strong&gt; as the right table, all records in the right table match records from the left table i.e there are no records in the right table that cannot be found in the left table, hence it returns all records as seen in a Left Join. Now, let's see the Full Join code below:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;As you can see from above, we just had to change the join type to FULL JOIN. Also, kindly note that we can do multiple joins with any type of outer joins just like we saw &lt;a href="https://dev.to/joyadauche/the-sql-savant-inner-joins-in-sql-37ak"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Quite simple! so we can share our SQL analysis with the JS teacher and move on to a special kind of join called CROSS JOIN to perform more analyses! Have an amazing and fulfilled week ahead in this new Year! 😉&lt;/p&gt;

</description>
      <category>sql</category>
      <category>datascience</category>
      <category>postgres</category>
      <category>dataanalysis</category>
    </item>
    <item>
      <title>The AI Alpha Geek: It starts with EDA! - Part D</title>
      <dc:creator>Joy Ada Uche</dc:creator>
      <pubDate>Mon, 30 Nov 2020 19:14:18 +0000</pubDate>
      <link>https://dev.to/joyadauche/the-ai-alpha-geek-it-starts-with-eda-part-d-hk9</link>
      <guid>https://dev.to/joyadauche/the-ai-alpha-geek-it-starts-with-eda-part-d-hk9</guid>
      <description>&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;a href="https://dev.to/joyadauche/the-ai-alpha-geek-it-starts-with-eda-part-a-2l1i"&gt;The AI Alpha Geek: It starts with EDA! - Part A&lt;/a&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://dev.to/joyadauche/the-ai-alpha-geek-it-starts-with-eda-part-b-8l3"&gt;The AI Alpha Geek: It starts with EDA! - Part B&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://dev.to/joyadauche/the-ai-alpha-geek-it-starts-with-eda-part-c-2nle"&gt;The AI Alpha Geek: It starts with EDA! - Part C&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now, let's see feature relationships i.e exploring 2 or more features together. Let's look at the code example below - &lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Let's take a look at the &lt;strong&gt;Pclass&lt;/strong&gt; and &lt;strong&gt;Survived&lt;/strong&gt; features below produced from Line 1:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fmr72bctabj2ufs35ykgc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fmr72bctabj2ufs35ykgc.png" alt="Alt Text" width="792" height="546"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From the count plot output above, which is produced from &lt;em&gt;Line 1&lt;/em&gt;, it seems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a lot more people in the lower class, i.e class 3, didn't survive.&lt;/li&gt;
&lt;li&gt;more people in the upper class, i.e class 1, survived.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To know the why's behind the insights above, you can ask questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;could it be that passengers in the upper class had the opportunity to escape because they were situated on the upper deck of the titanic? &lt;/li&gt;
&lt;li&gt;probably when the ship hit the iceberg, the lower deck flooded and some passengers drowned?&lt;/li&gt;
&lt;li&gt;perhaps those at the upper deck were given preferential treatment?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now you must also be wondering what gender had a higher survival rate? Look below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fakm8asx4u1e8z7tyc8t9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fakm8asx4u1e8z7tyc8t9.png" alt="Alt Text" width="786" height="518"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From above, remember in &lt;a href="https://dev.to/joyadauche/the-ai-alpha-geek-it-starts-with-eda-part-b-8l3"&gt;Part B&lt;/a&gt; that the total number of people who survived is &lt;strong&gt;549&lt;/strong&gt; -  So, we can see that a lot more males didn't survive. &lt;br&gt;
The above insight brings more questions comes to mind:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;could it be that women and children were saved first before adult males?&lt;/li&gt;
&lt;li&gt;could it be more males gave their lives for their loved ones?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, let's see how more than 3 features relate below:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fprofrvx5wubfbka3bh9f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fprofrvx5wubfbka3bh9f.png" alt="Alt Text" width="782" height="530"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From the bar plot output above, produced from Lines 7, it seems that: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;for each class, passengers with a younger average age survived.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then, looking at features &lt;strong&gt;Age&lt;/strong&gt;, &lt;strong&gt;Sex&lt;/strong&gt; and &lt;strong&gt;Survived&lt;/strong&gt; below:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fv35z6vn49916ytv9a6a2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fv35z6vn49916ytv9a6a2.png" alt="Alt Text" width="786" height="524"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From the above, it is visually obvious that males and females with an average age of 27.28 and 28.86 respectively survived.&lt;/p&gt;

&lt;p&gt;Let's visually explore the &lt;strong&gt;Fare&lt;/strong&gt; feature output for Lines 13 and 14 below:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Faj4u9wkk4hd2a3nwio17.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Faj4u9wkk4hd2a3nwio17.png" alt="Alt Text" width="794" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fl08c5fir5fd2cdwngm4k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fl08c5fir5fd2cdwngm4k.png" alt="Alt Text" width="794" height="528"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From the above, it seems that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;males who survived paid an average fare of 40.82,&lt;/li&gt;
&lt;li&gt;while females paid an average fare of 51.94.&lt;/li&gt;
&lt;li&gt;those who did not survive had a much lower average fare for both males and females.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When you run other lines of code for other features that include &lt;strong&gt;Embarked&lt;/strong&gt;, &lt;strong&gt;Parch&lt;/strong&gt;, and &lt;strong&gt;SibSp&lt;/strong&gt;, you will draw much more valuable insights from the data.&lt;/p&gt;

&lt;p&gt;We sure got more insights which improves how much we understand our data. So, always try to understand or ask about the &lt;strong&gt;why&lt;/strong&gt; behind insights discovered. Stay tuned for the next part where we collate all insights and valuable patterns and dive into &lt;strong&gt;Feature Engineering&lt;/strong&gt; still using the titanic dataset. Have an amazing December! 😉&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>datavisualization</category>
      <category>eventdriven</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>The AI Alpha Geek: It starts with EDA! - Part C</title>
      <dc:creator>Joy Ada Uche</dc:creator>
      <pubDate>Sat, 31 Oct 2020 18:32:05 +0000</pubDate>
      <link>https://dev.to/joyadauche/the-ai-alpha-geek-it-starts-with-eda-part-c-2nle</link>
      <guid>https://dev.to/joyadauche/the-ai-alpha-geek-it-starts-with-eda-part-c-2nle</guid>
      <description>&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;a href="https://dev.to/joyadauche/the-ai-alpha-geek-it-starts-with-eda-part-a-2l1i"&gt;The AI Alpha Geek: It starts with EDA! - Part A&lt;/a&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://dev.to/joyadauche/the-ai-alpha-geek-it-starts-with-eda-part-b-8l3"&gt;The AI Alpha Geek: It starts with EDA! - Part B&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Here, we would explore the numerical features in this dataset - &lt;strong&gt;Age&lt;/strong&gt; and &lt;strong&gt;Fare&lt;/strong&gt;. Let's look at the code example below - &lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;
 

&lt;p&gt;Looking at the &lt;strong&gt;Age&lt;/strong&gt; feature, the box plot in &lt;strong&gt;Line 1&lt;/strong&gt; above outputs the below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fthb6ofvemjr644jv3zza.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fthb6ofvemjr644jv3zza.png" alt="Alt Text" width="698" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Above is a box plot or sometimes called a &lt;strong&gt;box and whisker&lt;/strong&gt; plot. It is a good way to visualize &lt;strong&gt;the spread of the Age variable&lt;/strong&gt;. Some interesting stuff to note from the plot is -&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It seems most people that boarded the Titanic are between the age of 20 and 40.&lt;/li&gt;
&lt;li&gt;The protruding lines on both sides shaped like T are the &lt;strong&gt;whiskers&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The line that divides the blue box is actually the &lt;strong&gt;Median&lt;/strong&gt; of the dataset - So by looking at the plot, it seems the median Age is close to 30.&lt;/li&gt;
&lt;li&gt;The range of this variable is the largest value in the dataset (&lt;strong&gt;the rightmost point of the whisker&lt;/strong&gt;) minus the lowest value in the dataset (&lt;strong&gt;the leftmost whisker point&lt;/strong&gt;).&lt;/li&gt;
&lt;li&gt;Q1, Q2, Q3, and Q4 are the 1st, 2nd, 3rd and 4th quartiles of the &lt;em&gt;Age&lt;/em&gt; variable - note that &lt;strong&gt;Q1 is the same as the 25th percentile&lt;/strong&gt;, &lt;strong&gt;Q2 is the Median of the dataset, which is also called the 50th percentile&lt;/strong&gt;, while &lt;strong&gt;Q3 is the same as the 75th percentile&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The IQR (interquartile range) is &lt;strong&gt;b - a&lt;/strong&gt;, which is the difference between the Q3 and Q1.&lt;/li&gt;
&lt;li&gt;Any points you see after the whiskers of the box plot above are referred to &lt;strong&gt;outliers&lt;/strong&gt;. Outliers are data points that are extremely high or low, which makes them far away from other data points. Kindly note that not all outliers are bad data - what if they were not incorrectly entered? Hence, there is a need to investigate the reason behind any outliers before actually going ahead to handle them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Essentially, the box plot can be seen as the visual representation of the summary statistics table showed in &lt;a href="https://dev.to/joyadauche/the-ai-alpha-geek-it-starts-with-eda-part-b-8l3"&gt;part b&lt;/a&gt;, as it graphically shows most of the statistics.&lt;/p&gt;

&lt;p&gt;Moving on to the distribution plots for the &lt;strong&gt;Age&lt;/strong&gt; and &lt;strong&gt;Fare&lt;/strong&gt; features produced by Lines 2 and 5 above can be seen below - &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F6d7vh7apsuio1r0hyy0q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F6d7vh7apsuio1r0hyy0q.png" alt="Alt Text" width="784" height="510"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fauy2usj9u2qrktqrc3r4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fauy2usj9u2qrktqrc3r4.png" alt="Alt Text" width="800" height="530"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From above, it seems the &lt;strong&gt;Age&lt;/strong&gt; feature distribution almost looks like the mirror image to itself from both sides of the centre line. The left side of the distribution is roughly a  mirror image of the right side (vice versa). So, it has a &lt;strong&gt;roughly normal distribution&lt;/strong&gt;, which means there are far more data near the mean rather than being far from it - it is a roughly symmetric distribution i.e &lt;strong&gt;the area under the curve is roughly equally distributed on either side of the centre line&lt;/strong&gt;.&lt;br&gt;
The &lt;strong&gt;Fare&lt;/strong&gt; feature, on the other hand, is &lt;strong&gt;skewed to the right&lt;/strong&gt;. We can see that the &lt;strong&gt;tail of the distribution&lt;/strong&gt; is to the right-hand side, hence this is called a &lt;strong&gt;right-skewed distribution&lt;/strong&gt; or a &lt;strong&gt;negatively skewed distribution&lt;/strong&gt;. Here, &lt;strong&gt;the area under the curve is shifted to the left side&lt;/strong&gt;. The &lt;strong&gt;tail of the Fare distribution&lt;/strong&gt; above represents outliers  - Now do you see how skewed distributions and outliers are related? Also, from the distribution plot for the &lt;strong&gt;Fare&lt;/strong&gt; variable, &lt;strong&gt;it seems a lot of people went for the cheaper ticket&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Understanding these distributions also plays a role in efficiently handling missing data. For example, when there are outliers, using the mean to fill in missing values is not ideal because outliers skew the mean, so we may want to go for the median statistic instead.&lt;/p&gt;

&lt;p&gt;Sure hope you enjoyed this piece. Stay tuned for the next part of this series as we go-ahead to explore feature relationships to squeeze out more insights from our data! 😉&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>datavisualization</category>
      <category>eventdriven</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>The AI Alpha Geek: It starts with EDA! - Part B</title>
      <dc:creator>Joy Ada Uche</dc:creator>
      <pubDate>Wed, 30 Sep 2020 20:07:43 +0000</pubDate>
      <link>https://dev.to/joyadauche/the-ai-alpha-geek-it-starts-with-eda-part-b-8l3</link>
      <guid>https://dev.to/joyadauche/the-ai-alpha-geek-it-starts-with-eda-part-b-8l3</guid>
      <description>&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;a href="https://dev.to/joyadauche/the-ai-alpha-geek-it-starts-with-eda-part-a-2l1i"&gt;The AI Alpha Geek: It starts with EDA! - Part A&lt;/a&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Before we start exploring each individual feature, let's take a look at some statistics for the dataset produced by &lt;code&gt;train_df.drop('PassengerId', axis=1).describe()&lt;/code&gt; below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Ffl0lh2knsvkz5h6ym6io.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Ffl0lh2knsvkz5h6ym6io.png" alt="Summary Stats" width="800" height="392"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the summary statistics above, looking at the &lt;strong&gt;Age&lt;/strong&gt; feature for example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the &lt;strong&gt;count&lt;/strong&gt; is 714, which tells us there are 177 missing entries since the total entries are 891 - we would need to deal with this later on when handling missing values,&lt;/li&gt;
&lt;li&gt;the &lt;strong&gt;mean&lt;/strong&gt; age is 29.699, which is the average age of passengers who were aboard i.e the value 29.699 was the typical or normal age of the passengers aboard,&lt;/li&gt;
&lt;li&gt;the &lt;strong&gt;std (standard deviation)&lt;/strong&gt; of 14.526 tells us that most of the passengers are in the age range (29.699-14.526) to (29.699+14.526),&lt;/li&gt;
&lt;li&gt;the &lt;strong&gt;min&lt;/strong&gt; age is 0.42, which tells us the least age is for a baby on board,&lt;/li&gt;
&lt;li&gt;the &lt;strong&gt;25th percentile&lt;/strong&gt; is 20.125 years shows that 25% of passengers is less than 20.125 years, &lt;/li&gt;
&lt;li&gt;the &lt;strong&gt;50th percentile&lt;/strong&gt;, which is the median is 28 years, tells us that half of the passengers onboard are below 28 years old - seems most of the passengers were young,&lt;/li&gt;
&lt;li&gt;the &lt;strong&gt;75th percentile&lt;/strong&gt;, which is 38, tells us that 75% of the passengers are less than 38 years, and&lt;/li&gt;
&lt;li&gt;the &lt;strong&gt;max&lt;/strong&gt; age is 80 years, which is the age of the eldest passenger onboard - luckily, it seems there are no aliens onboard.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, it's time for some &lt;strong&gt;univariate analysis&lt;/strong&gt; - this is just descriptive analysis of one variable at a time which it helps us understand the data distribution for that variable and even detect outliers. Let's start with the &lt;strong&gt;categorical variables&lt;/strong&gt; - &lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;In the code example above, taking a look at the output for the target variable, &lt;strong&gt;Survived&lt;/strong&gt;, below -&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F4nf3mhcbo8nwgn5cbc0o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F4nf3mhcbo8nwgn5cbc0o.png" alt="Output Example" width="800" height="762"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;value_counts()&lt;/strong&gt; is used to get the counts of unique values for this column - and it seems a lot more people did not survive. Note that it is not a perfectly balanced dataset but this is not a case where the number of those who didn't survive is far more significant than those who survived.&lt;/li&gt;
&lt;li&gt;to get the percentages of each class (i.e survived - 1 and deceased - 0), set the &lt;strong&gt;normalize&lt;/strong&gt; parameter of value_counts() to True.&lt;/li&gt;
&lt;li&gt;to have a better view of the count for each class, we use &lt;strong&gt;count plot&lt;/strong&gt; via Seaborn. The &lt;strong&gt;label_chart()&lt;/strong&gt; is just a helper function to label the chart.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's see some insights gathered from the code output from &lt;strong&gt;eda_part_b.py&lt;/strong&gt; above - &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For the &lt;strong&gt;Pclass&lt;/strong&gt; feature, it seems a lot more people that were on board are in class 3 and from &lt;a href="(https://dev.to/joyadauche/the-ai-alpha-geek-it-starts-with-eda-part-a-2l1i)"&gt;Part A&lt;/a&gt; of this series, we saw that these are people in the lower socio-economic class, which seem to mean most onboard got the cheap ticket,&lt;/li&gt;
&lt;li&gt;Seems more males boarded when you look at the &lt;strong&gt;Sex&lt;/strong&gt; feature, as 64.76% of passengers are males,&lt;/li&gt;
&lt;li&gt;Most passengers boarded from the Southampton port, and it seems most passengers came alone since most have 0 siblings and/or travelled with just a nanny.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, all these give us more insights to explore further - Stay tuned for the next parts on this topic, on this same series, where we go-ahead to explore individual numerical variables for patterns. Wish you an awesome October!&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>datavisualization</category>
      <category>eventdriven</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>The AI Alpha Geek: It starts with EDA! - Part A</title>
      <dc:creator>Joy Ada Uche</dc:creator>
      <pubDate>Mon, 31 Aug 2020 17:54:08 +0000</pubDate>
      <link>https://dev.to/joyadauche/the-ai-alpha-geek-it-starts-with-eda-part-a-2l1i</link>
      <guid>https://dev.to/joyadauche/the-ai-alpha-geek-it-starts-with-eda-part-a-2l1i</guid>
      <description>&lt;p&gt;...so just give me all your data and I will quickly use one of those boosting ensemble libraries like XGBoost or LightGBM or probably Catboost and then do some stacking and give you a very high performing model! That's how we roll! -  hmmm, You Wish! 😆&lt;/p&gt;

&lt;p&gt;If you really want to be a &lt;del&gt;good&lt;/del&gt; great AI engineer, you need to first understand your data and build intuition about your data and this is where &lt;strong&gt;Exploratory Data Analysis&lt;/strong&gt; (EDA) comes in!&lt;/p&gt;

&lt;p&gt;The ability to dig into data and derive trends or patterns or relationships is a superpower! 😊. EDA helps you get a better understanding of your data, validate your hypothesis,  derive insights and new features in order to get the best performing model.&lt;/p&gt;

&lt;p&gt;We will go through an example of how EDA is performed using the &lt;a href="https://www.kaggle.com/c/titanic/data" rel="noopener noreferrer"&gt;Titanic Dataset&lt;/a&gt; - &lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;So, from &lt;strong&gt;lines 1 to 11&lt;/strong&gt;, some python packages are imported; and then some display options and styling are set up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For &lt;em&gt;line 2 through 5&lt;/em&gt;, pandas and NumPy for data analysis and numerical computation respectively are imported. Also, we have Matplotlib and Seaborn for data visualization. &lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Lines 7 and 8&lt;/em&gt; help us see all columns and rows when using the &lt;em&gt;head()&lt;/em&gt; method while &lt;em&gt;Line 9&lt;/em&gt; controls the width of the display in characters.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Line 11&lt;/em&gt; just customizes whatever plot we going to create. It sets the aesthetic style of our plots.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then, from &lt;strong&gt;lines 14 - 20&lt;/strong&gt;, the data is read in and the process of data exploration begins:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Line 15&lt;/em&gt; reads in the data while &lt;em&gt;Line 16&lt;/em&gt; gives the number of records(rows) and features(columns) in our dataset.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Line 17&lt;/em&gt; returns the top 10 rows in the data&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Lines 18 through 20&lt;/em&gt; list the features, shows their data types and gives a concise summary of our dataset respectively.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, at this point, we really need to understand what each feature actually represents because it will determine the &lt;strong&gt;data type&lt;/strong&gt; which informs &lt;strong&gt;how that feature is preprocessed&lt;/strong&gt; - and feature preprocessing plays a role in deciding the model to be used in order to gain optimal performance. Also, understanding each feature helps &lt;strong&gt;feature generation&lt;/strong&gt;. You see how all these ties in now right?&lt;/p&gt;

&lt;p&gt;So, we have to do some research (in this case, get some domain knowledge about ship transport as regards the titanic) to know what each feature entails. Here, we explain some features and see that -&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PassengerId&lt;/strong&gt; - is the unique id that identifies a passenger &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Survived&lt;/strong&gt; - tells if the passenger survived or not. 0 is for deceased (No) while 1 is for survived (Yes)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pclass&lt;/strong&gt; - is the passenger's class, which can be 1, 2 or 3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SibSp&lt;/strong&gt; - gives the number of siblings and spouse aboard&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parch&lt;/strong&gt; - tells the number of parents and children aboard&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cabin&lt;/strong&gt; - is the cabin number the passenger is in&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embarked&lt;/strong&gt; - gives the port of embarkation/boarding/departure which can be C (Cherbourg) or Q (Queenstown) or S (Southampton).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's dive into the top 10 rows produced by &lt;strong&gt;train_df.head(10)&lt;/strong&gt; below - &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F39ivki39oqr5ljr67kbi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F39ivki39oqr5ljr67kbi.png" alt="Alt Text" width="800" height="248"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From above, we have mainly Categorical and Numerical features in our dataset - &lt;strong&gt;Categorical features&lt;/strong&gt;, also called &lt;strong&gt;qualitative features&lt;/strong&gt;, are features that can take a limited number of possible values. They can even take on numerical values but you cannot perform math operations on them because they have no meaning mathematically. But we have a kind of categorical feature that has meaning, which is called the &lt;strong&gt;ordinal feature&lt;/strong&gt;. The categorical features in our titanic dataset are &lt;em&gt;Survived&lt;/em&gt;, &lt;em&gt;Name&lt;/em&gt;, &lt;em&gt;Ticket&lt;/em&gt;, &lt;em&gt;SibSp&lt;/em&gt;, &lt;em&gt;Parch&lt;/em&gt;, &lt;em&gt;Sex&lt;/em&gt;, &lt;em&gt;Cabin&lt;/em&gt;, &lt;em&gt;Embarked&lt;/em&gt; and &lt;em&gt;Pclass&lt;/em&gt; - &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Survived&lt;/strong&gt; and &lt;strong&gt;Sex&lt;/strong&gt; are binary categorical features,&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Name&lt;/strong&gt; is a categorical feature which has text values,&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embarked&lt;/strong&gt;, &lt;strong&gt;SibSp&lt;/strong&gt; and &lt;strong&gt;Parch&lt;/strong&gt; is a categorical feature with more than 2 values,&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ticket&lt;/strong&gt; is a categorical feature with a mix of numeric and alphanumeric. This feature value might actually mean something: from my little research, I think it can be used to find potential family members or nannies for each passenger - which can be a new feature generated perhaps, &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cabin&lt;/strong&gt; is a categorical feature that is alphanumeric,&lt;/li&gt;
&lt;li&gt;Although &lt;strong&gt;Pclass&lt;/strong&gt; has a numeric datatype, it is actually is an &lt;strong&gt;ordered categorical feature&lt;/strong&gt; i.e an ordinal feature which is ordered in a meaningful way - 1 is for 1st class; 2 is for 2nd class and 3 is for 3rd class. During your feature description research, you would know these have to do with the passenger's socio-economic status.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Numerical features, on the other hand, also referred to as &lt;strong&gt;quantitative features&lt;/strong&gt; are features that have meaning in terms of measurement (continuous data) or it can be counted (discrete data). The numerical features are &lt;em&gt;Fare&lt;/em&gt; (continuous), &lt;em&gt;Age&lt;/em&gt; (continuous), and &lt;em&gt;PassengerId&lt;/em&gt; (actually just an ID feature that identifies each passenger).&lt;/p&gt;

&lt;p&gt;Now, let's take a look at the concise summary of the dataset produced by &lt;strong&gt;train_df.info()&lt;/strong&gt; below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fh0ct27qc34yc6jeouxil.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fh0ct27qc34yc6jeouxil.png" alt="Alt Text" width="674" height="584"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Looking at the summary of our dataset above, getting some kind of domain knowledge via research on the data would help us in a number of ways -&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;to know if the feature has the correct data type, which when rightly converted helps in feature generation and also help in saving memory. For example, from above, the output tells us that all categorical types are stored as &lt;strong&gt;object&lt;/strong&gt; data types. So, converting some of these to &lt;strong&gt;categorical&lt;/strong&gt; data types would help in saving up some more memory.&lt;/li&gt;
&lt;li&gt;to understand why there are missing values - from above, &lt;em&gt;Age&lt;/em&gt;, &lt;em&gt;Cabin&lt;/em&gt; and &lt;em&gt;Embarked&lt;/em&gt; features have 714, 214, and 889 entries, which is not up to the expected 891 non-null entries - 🤔 so, we can try to find out why this is the case - Yeah! have a curious mindset &lt;/li&gt;
&lt;li&gt;to know if the values of a feature are intuitive and actually contain the expected values. For example, we all know at least in these times (unlike in the days of old), that humans hardly live up to 200 years. So, &lt;strong&gt;when we do further data exploration in part B of this article series&lt;/strong&gt; and see a passenger aged above 200 - it is one of 2 things - &lt;strong&gt;either it is an error or perhaps the titanic had some vampires or aliens onboard!&lt;/strong&gt; 😂 - do not overrule any fact - anything is possible, so do your research and be sure!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, this is just a tiny little bit of the process involved when you are starting out the exploration of your datasets! Stay tuned for the next parts on this topic, on this same series, where we look at what statisticians call the &lt;strong&gt;Five-number summary&lt;/strong&gt;, then we will start ** analyzing each individual feature** and then go ahead to understand feature relationships and more in order to squeeze out all the insights from our data! Now, you make sure you have an amazing week ahead! 😉&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>datavisualization</category>
      <category>eventdriven</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>The Math ML Maestro: Introducing Linear Algebra Applications</title>
      <dc:creator>Joy Ada Uche</dc:creator>
      <pubDate>Fri, 31 Jul 2020 16:55:21 +0000</pubDate>
      <link>https://dev.to/joyadauche/the-math-ml-maestro-introducing-linear-algebra-applications-1imb</link>
      <guid>https://dev.to/joyadauche/the-math-ml-maestro-introducing-linear-algebra-applications-1imb</guid>
      <description>&lt;p&gt;Do you remember back then during school days when you did some vector or matrix operations? Or did some probability and statistics? Or performed some differentiation, integration and then getting confused about what the math question you just finished solving was 😆? Haha! You know right?&lt;/p&gt;

&lt;p&gt;Ermm! hope you were pretty attentive during those Math classes 👀 because when we do Machine Learning, &lt;strong&gt;we are essentially solving a Math&lt;/strong&gt; problem, and one of the Math topics with enormous applications in Machine learning is &lt;strong&gt;Linear Algebra&lt;/strong&gt;!!!&lt;/p&gt;

&lt;p&gt;Linear Algebra plays a huge role in machine learning! Back then, we modelled real-world problems into systems of linear equations and solved these equations via substitution, via elimination or via the graphing method. Take a look at a linear system with 2 unknowns below:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;A linear system is when variables e.g x and y above are raised to a power or an exponent of 1 i.e they are first-degree variables. In &lt;em&gt;1.md&lt;/em&gt; above, there are just 2 unknowns and it can easily be solved via either substitution, elimination or graphing methods. So, Yeah! we can solve that with a pen and paper but machines can't solve that in such form! Just imagine each equation as a row or observation in a dataset where the right side of the equation above is the target or response or dependent variable, and the left side of the equation represents the features or predictors or independent or input variables. Then, think of a system of 1000 equations with 1000 unknowns 🤔 - it will be time and energy-consuming solving these equations using previous methods mentioned. This challenge gives room for &lt;strong&gt;Matrices&lt;/strong&gt;, which is a key data structure in linear algebra and helps handle more data instead of just 2 observations like in &lt;em&gt;1.md&lt;/em&gt; above. To represent the above equation in a matrix form, see below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fegfoopv4x935lsfg5ohj.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fegfoopv4x935lsfg5ohj.jpg" alt="Alt Text" width="395" height="136"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are numerous examples of Linear Algebra in Machine Learning. To do Machine learning (ML), data needs to be in such a way that ML models or ML algorithms can ingest it - Most ML models take numeric inputs - Datasets usually contain a number of observations (rows) characterized by features (columns). The Dataset is a &lt;strong&gt;matrix&lt;/strong&gt; with each column representing a &lt;strong&gt;vector&lt;/strong&gt; - a collection of vectors is a matrix. Matrix Operations, such as addition, subtraction, multiplication, transpose etc are applied in Machine Learning! &lt;/p&gt;

&lt;p&gt;Also, in this light, for an image classification problem, the inputs to the neural network are &lt;strong&gt;tensors&lt;/strong&gt;. For a classic Artificial Neural Network (ANN), given a grayscale image which is a 3D tensor where the 1st dimension is the index of the image, and the other 2 dimensions give the dimension of the arrays that contain the image pixels. To input this image into an ANN, this 3D tensor is flattened to get a 2D tensor, where the 1st dimension tells what row an image corresponds to and the 2nd dimension is a single vector that contains the pixels of the image. Images are represented as tensors in order for computers to process them and guess what? &lt;strong&gt;A tensor is just a generalization of vectors and matrices to higher dimensions potentially&lt;/strong&gt;!&lt;/p&gt;

&lt;p&gt;Furthermore, depending on the prediction task, when we do some data preprocessing using the popular one-hot encoding technique to convert categorical variables, we get a vector - a binary vector, which is a better data format for ML algorithms to be trained on in order to give a better prediction.&lt;/p&gt;

&lt;p&gt;So, it has lots and lots of applications. See Some examples below:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Want to implement &lt;strong&gt;Principal Component Analysis (PCA)&lt;/strong&gt;, which is a popular dimensionality reduction technique to avert the curse of dimensionality? Linear Algebra is used here!&lt;/li&gt;
&lt;li&gt;What about &lt;strong&gt;Word Embeddings&lt;/strong&gt; which is in the field of Natural Language Processing (NLP) that seems like the hottest field in Machine Learning right now? Linear Algebra is applied here too! 😎&lt;/li&gt;
&lt;li&gt;Did you just say &lt;strong&gt;Optimizing Deep Learning Models&lt;/strong&gt;? still Linear Algebra!&lt;/li&gt;
&lt;li&gt;Even in &lt;strong&gt;Computer Vision&lt;/strong&gt;, &lt;strong&gt;Encoding Data&lt;/strong&gt; as I briefly explained above and lots more that I have not even mentioned, Linear Algebra shows itself! Yeah! We are so stuck with it!  😂&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now! Will you keep running away from Math or will you rather just embrace it whole-heartedly? 😄&lt;/p&gt;

&lt;p&gt;So, I just shed some light on how Linear Algebra is used in Machine Learning. Stay tuned on this series where we start dissecting one application at a time, going in a little deeper, starting with &lt;strong&gt;The Maths behind Linear Regression&lt;/strong&gt; in order to understand the ins and outs of it. Have an amazing week ahead!&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>maths</category>
      <category>datascience</category>
      <category>linearalgebra</category>
    </item>
    <item>
      <title>The Big Data Bravura: Introducing Apache Spark</title>
      <dc:creator>Joy Ada Uche</dc:creator>
      <pubDate>Tue, 30 Jun 2020 18:55:10 +0000</pubDate>
      <link>https://dev.to/joyadauche/the-big-data-bravura-introducing-apache-spark-2od</link>
      <guid>https://dev.to/joyadauche/the-big-data-bravura-introducing-apache-spark-2od</guid>
      <description>&lt;p&gt;Did you just say you need to handle a minimum of 100 TB of data (&lt;strong&gt;volume&lt;/strong&gt;) that is generated at high speed (&lt;strong&gt;velocity&lt;/strong&gt;) from different sources consisting of structured data like CSV’s, semistructured data like log files and unstructured data like video files (&lt;strong&gt;variety&lt;/strong&gt;) that are also trustworthy and representative (&lt;strong&gt;veracity&lt;/strong&gt;) and can give insights that can lead to groundbreaking discoveries and reduce costs (&lt;strong&gt;value&lt;/strong&gt;)? 😲 Good Gracious! This is Big Data!&lt;/p&gt;

&lt;p&gt;We would need a cluster of machines and not just a single machine to process big data and this is where Spark comes into play. With Spark, you can distribute data and its computation among nodes in the cluster - with each node having a subset of the data, data processing are done in parallel over the nodes in a cluster. Spark does all these in memory which makes it lightning fast!!!⚡&lt;/p&gt;

&lt;p&gt;Spark is made up of several components. One of them is the &lt;strong&gt;Spark Core&lt;/strong&gt;, which is the heart of Apache Spark and it is the basis for other components. The &lt;em&gt;Spark Core&lt;/em&gt; uses a data structure called the &lt;strong&gt;R&lt;/strong&gt;esilient &lt;strong&gt;D&lt;/strong&gt;istributed &lt;strong&gt;D&lt;/strong&gt;atasets, which is the fundamental data structure in Apache Spark. &lt;/p&gt;

&lt;p&gt;To develop Spark solutions, we can use Scala, Python, Java or R. Here, I would develop an introductory Spark application using Python via &lt;a href="https://spark.apache.org/docs/latest/api/python/index.html" rel="noopener noreferrer"&gt;&lt;strong&gt;Pyspark&lt;/strong&gt;&lt;/a&gt;, which is the Python API that supports Apache Spark. We would take a look at an introductory example using an RDD - The &lt;em&gt;.csv&lt;/em&gt; file used in this example is &lt;a href="https://gist.github.com/joyadauche/a4d1b03cafd224ae2644f26f19ede126#file-student_subject-csv" rel="noopener noreferrer"&gt;here&lt;/a&gt;. Without further ado, as Spark does it ⚡, let's jump right in below -&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;From &lt;em&gt;1.py&lt;/em&gt; above - &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;we compute the average number of subjects by class. In the 
&lt;a href="https://gist.github.com/joyadauche/a4d1b03cafd224ae2644f26f19ede126#file-student_subject-csv" rel="noopener noreferrer"&gt;&lt;strong&gt;student_subject.csv&lt;/strong&gt;&lt;/a&gt;, an example entry 
&lt;strong&gt;s400&lt;/strong&gt;,&lt;strong&gt;c204&lt;/strong&gt;,&lt;strong&gt;10&lt;/strong&gt; represents the &lt;strong&gt;student_id&lt;/strong&gt;, 
&lt;strong&gt;class_id&lt;/strong&gt; and &lt;strong&gt;number of subjects&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lines 1 to 4&lt;/strong&gt; - we import the needed pyspark classes 
and setup the configuration and use it to instantiate the 
SparkContext class. &lt;strong&gt;local&lt;/strong&gt; creates a local cluster with 
only 1 core on your local machine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lines 14&lt;/strong&gt; - we read in the csv file, which now 
becomes an RDD (&lt;strong&gt;student_subject_rdd&lt;/strong&gt;), where every line 
entry is a value.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lines 16&lt;/strong&gt; - we transform the &lt;strong&gt;student_subject_rdd&lt;/strong&gt; 
into an RDD of key-value pairs of &lt;strong&gt;class_id&lt;/strong&gt; and 
&lt;strong&gt;number_of_subjects&lt;/strong&gt; e.g &lt;strong&gt;('c204', 10)&lt;/strong&gt;.

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lines 7 to 11&lt;/strong&gt; - the &lt;strong&gt;get_class_and_subject&lt;/strong&gt; 
function, splits each line entry by a comma, gets the 
needed fields by indexing and then returns them.&lt;/li&gt;
&lt;li&gt;Note that on &lt;strong&gt;Line 10&lt;/strong&gt;, the &lt;strong&gt;number_of_subjects&lt;/strong&gt; is 
cast explicitly into an int.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Lines 18 to 19&lt;/strong&gt; - transforms the RDD further by adding 
1 as part of the values. One key difference between &lt;strong&gt;map&lt;/strong&gt; 
and &lt;strong&gt;mapValues&lt;/strong&gt; is that with &lt;strong&gt;mapValues&lt;/strong&gt;, the keys 
cannot be modified, so it is not even passed in. i.e key- 
value pairs of &lt;strong&gt;('c204', 10)&lt;/strong&gt; passes in just &lt;strong&gt;10&lt;/strong&gt;. So, 
&lt;strong&gt;modified_class_subject_rdd&lt;/strong&gt; will contain something like 
&lt;strong&gt;('c204', (10, 1))&lt;/strong&gt;, where &lt;strong&gt;c204&lt;/strong&gt; is the key and &lt;strong&gt;(10,1)&lt;/strong&gt; is the value.&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Lines 21 to 23&lt;/strong&gt; - &lt;strong&gt;reduceByKey&lt;/strong&gt; combines items 
together for the same key.

&lt;ul&gt;
&lt;li&gt;remember &lt;strong&gt;modified_class_subject_rdd&lt;/strong&gt;, can return 
multiple items for the same key. e.g &lt;strong&gt;[('c204', 
(10, 1)), ('c204', (8, 1)), ('c204', (7, 1)), ('c204', 
(6, 1)), ('c204', (7, 1))]&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;with the reduceBykey action, it becomes &lt;strong&gt;[('c204', (38, 
5))]&lt;/strong&gt;, where &lt;strong&gt;c204&lt;/strong&gt; is the key and &lt;strong&gt;(38, 
5)&lt;/strong&gt; is the value representing &lt;strong&gt;sum total of subjects 
done&lt;/strong&gt; and &lt;strong&gt;frequency count&lt;/strong&gt; respectively for 
class_id &lt;strong&gt;c204&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Lines 25 to 26&lt;/strong&gt; - computes the average by class while 
&lt;strong&gt;Line 28 to 32&lt;/strong&gt; produces an array and prints the 
results.&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Awesome! so from above, we cooked up an example Spark application using the RDD. Methods called off an RDD can either be a &lt;strong&gt;transformation&lt;/strong&gt; or an &lt;strong&gt;action&lt;/strong&gt;. A transformation like &lt;strong&gt;mapValues&lt;/strong&gt; just produces another RDD while an action like &lt;strong&gt;collect&lt;/strong&gt; products the result. Essentially, a &lt;strong&gt;transformation&lt;/strong&gt; on an RDD happens when an &lt;strong&gt;action&lt;/strong&gt; is called. This concept of &lt;strong&gt;Lazy Evaluation&lt;/strong&gt; increases speed since execution will not start until an &lt;strong&gt;action&lt;/strong&gt; is triggered.&lt;/p&gt;

&lt;p&gt;Spark is amazing and I know you are being Sparked up in becoming a Big Data Bravura. Stay tuned on this series for my next article on &lt;strong&gt;Introducing Spark Dataframes&lt;/strong&gt;, which is a data structure built off the RDD and is much easier to use than the core RDD data structure. Have an amazing Sparked up Week! 😉&lt;/p&gt;

</description>
      <category>bigdata</category>
      <category>spark</category>
      <category>pyspark</category>
      <category>datascience</category>
    </item>
    <item>
      <title>The ML Maven: Introducing the Confusion Matrix</title>
      <dc:creator>Joy Ada Uche</dc:creator>
      <pubDate>Sun, 31 May 2020 19:30:26 +0000</pubDate>
      <link>https://dev.to/joyadauche/the-ml-maven-introducing-the-confusion-matrix-1de7</link>
      <guid>https://dev.to/joyadauche/the-ml-maven-introducing-the-confusion-matrix-1de7</guid>
      <description>&lt;p&gt;&lt;strong&gt;Random Friend&lt;/strong&gt;: OMG! you won’t believe this - I got a high accuracy value of 88%!!!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me&lt;/strong&gt;: oh really? Sounds interesting!!! From the said metric, it seems to be a classification problem. So, what specific problem is your classifier trying to solve?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Random Friend&lt;/strong&gt;: so, my model helps predict whether we are going to have an earthquake or not in Wakanda using data sourced from the Government of Wakanda’s website on earthquake happenings over a period of time. Here are some visual EDA on the training dataset. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me&lt;/strong&gt;: Wawuu! Can I take a look at your confusion matrix?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Random Friend&lt;/strong&gt;: Does that really matter? Besides I think I got a high accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me&lt;/strong&gt;: Smiles! It definitely matters especially with the fact that it looks like you are dealing with imbalanced classes as shown by the visualization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Random Friend&lt;/strong&gt;: oh I see! Then, let me quickly generate that for you… Here it is below -&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Me&lt;/strong&gt;: Alrightee! Let’s see what we have here and explain certain useful metrics below - &lt;/p&gt;

&lt;p&gt;So, having an accuracy of 88% means that your model is correct 88% of the time but incorrect 12% of the time. Well, since this sounds like a life-death situation, it does not seem good enough. Just imagine the number of lives that can be lost 12% of the time the model incorrectly makes a prediction 😨!&lt;/p&gt;

&lt;p&gt;It is a very common scenario where one class is more than the other class, like in your case, the class of &lt;em&gt;NO earthquakes occurrences&lt;/em&gt; are more frequent or have more instances than the other class of &lt;em&gt;YES earthquake occurrences&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;So, Accuracy is not enough to evaluate the performance of your model, Hence the need for a &lt;strong&gt;Confusion Matrix&lt;/strong&gt;. It summarizes a model’s predictive performance and we can use it to describe the performance of your model - &lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Note that the Positive class is usually &lt;em&gt;1&lt;/em&gt; or a &lt;em&gt;YES&lt;/em&gt; case or &lt;em&gt;it is what we are trying to detect&lt;/em&gt;. The negative class is usually &lt;em&gt;0&lt;/em&gt; or a &lt;em&gt;NO case&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;As you can see from &lt;em&gt;1.md&lt;/em&gt;,&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;True Negatives (TN)&lt;/strong&gt;: &lt;em&gt;60&lt;/em&gt; gives the number of NO earthquake occurrences &lt;em&gt;correctly predicted&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;True Positives (TP)&lt;/strong&gt;: &lt;em&gt;150&lt;/em&gt; gives the number of YES earthquake occurrences &lt;em&gt;correctly predicted&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;False Negatives (FN)&lt;/strong&gt;: &lt;em&gt;10&lt;/em&gt; gives the number of YES earthquake occurrences &lt;em&gt;incorrectly predicted as NO earthquake occurrences&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;False Positives (FP)&lt;/strong&gt;: &lt;em&gt;20&lt;/em&gt; gives the number of NO earthquake occurrences &lt;em&gt;incorrectly predicted as YES earthquake occurrences&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A &lt;strong&gt;False Positive&lt;/strong&gt; is also known as a &lt;strong&gt;Type 1 error&lt;/strong&gt;. This is when the model predicts &lt;em&gt;that there will be an earthquake but actually there is not&lt;/em&gt;. This is a False alarm! So, this will make the people of Wakanda panic which will cause the government of Wakanda to do all that it can to save lives from the possible earthquake. This can lead to a waste of resources - Perhaps the government had to move people to a different geographic location where they will be catered for by the government.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;False Negative&lt;/strong&gt; is referred to as a &lt;strong&gt;Type 2 error&lt;/strong&gt;. This is when the model predicts &lt;em&gt;that there will be no earthquake but actually there is&lt;/em&gt;. This is catastrophic! The people of Wakanda will be chilling and suddenly an earthquake will greet them 😭!&lt;/p&gt;

&lt;p&gt;We can also calculate some useful metrics like - &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Recall&lt;/strong&gt;: This is also called &lt;strong&gt;True Positive Rate&lt;/strong&gt;, or &lt;strong&gt;Sensitivity&lt;/strong&gt; or &lt;strong&gt;Hit Rate&lt;/strong&gt;. It is the probability that an actual positive would be predicted positive i.e It tells us how often an actual YES earthquake occurrence will be predicted a YES earthquake occurrence - what proportion of YES earthquake occurrences are correctly classified or predicted. From &lt;em&gt;1.md&lt;/em&gt;, it is calculated as below - 
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;High recall&lt;/em&gt; means that this model has a &lt;strong&gt;low false-negative rate&lt;/strong&gt; i.e not many actual YES earthquake occurrences were classified or predicted as NO earthquake occurrences - the classifier predicted most YES earthquake occurrences correctly.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Specificity&lt;/strong&gt;: This is also called the &lt;strong&gt;True Negative Rate&lt;/strong&gt;.  It is the probability that an actual negative would be predicted negative i.e It tells us how often an actual NO earthquake occurrence will be predicted a NO earthquake occurrence - what proportion of NO earthquake occurrences are correctly classified or predicted. From &lt;em&gt;1.md&lt;/em&gt;, it is calculated as below -
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Positive Predictive Value&lt;/strong&gt;: This is also called the &lt;strong&gt;Precision&lt;/strong&gt;. It is the probability that a predicted YES is correct or true i.e It tells us how often the prediction of a YES earthquake occurrence is correct. From &lt;em&gt;1.md&lt;/em&gt;, it is calculated as below -
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;High precision&lt;/em&gt; means that this model has a &lt;strong&gt;low false-positive rate&lt;/strong&gt; i.e not many actual NO earthquake occurrences were classified or predicted as YES earthquake occurrences.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Negative Predictive value&lt;/strong&gt;: It is the probability that a predicted NO is correct or true i.e It tells us how often the prediction of a NO earthquake occurrence is correct. From &lt;em&gt;1.md&lt;/em&gt;, it is calculated as below -
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;F1 Score&lt;/strong&gt;: this is also known as the &lt;strong&gt;Harmonic Mean&lt;/strong&gt; of precision and recall. From &lt;em&gt;1.md&lt;/em&gt;, it is calculated as below -
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With the problem we are trying to solve, perhaps we should be much more concerned with reducing the &lt;strong&gt;False Negatives&lt;/strong&gt; or &lt;strong&gt;Type 2 errors&lt;/strong&gt; i.e when the model predicts there will be no earthquakes but there are actually. This is much more dangerous than the type 1 error in this case. The model incorrectly classifies 10 cases in which earthquakes occurred by saying that it did not occur. &lt;strong&gt;Just imagine chilling with some fresh orange juice and watching Black Panther on Netflix and suddenly the ground starts shaking suddenly 😱!!!&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;The metric you choose to optimize depends on the problem being solved.   Let us take a look at some scenarios below - &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If the &lt;strong&gt;occurrence of false negatives is unaccepted&lt;/strong&gt;, then choose to optimize &lt;strong&gt;Recall&lt;/strong&gt; - like we would want to do for the earthquake occurrence problem. Here, we won’t mind getting extra False positives just to reduce the number of false negatives i.e we would rather say that an earthquake will occur when it will not RATHER than say an earthquake will not occur and it does occur.&lt;/li&gt;
&lt;li&gt;If the &lt;strong&gt;occurrence of false positives is unaccepted&lt;/strong&gt;, then choose to optimize &lt;strong&gt;Specificity&lt;/strong&gt;. Let me give an example where False Positives should not be overlooked - If I am trying to predict if a patient has coronavirus: Since I am trying to detect coronavirus, so having coronavirus (a yes or 1) would represent the positive class while being healthy would represent the negative class. So, if I carry out a test where a patient that is detected or predicted as positive (i.e have coronavirus) would be quarantined, I would want to make sure a healthy person is not detected as having coronavirus. In this case, we would not accept any false positives.&lt;/li&gt;
&lt;li&gt;If you want to be &lt;strong&gt;extra sure about the true positives&lt;/strong&gt;, choose to optimize &lt;strong&gt;Precision&lt;/strong&gt;. For example, if we are detecting coronavirus, testing centres would want to be very confident that a patient classified or predicted as having the virus truly has it.&lt;/li&gt;
&lt;li&gt;Choose to optimize &lt;strong&gt;F1 score&lt;/strong&gt; if you need a balance between &lt;em&gt;Precision&lt;/em&gt; and &lt;em&gt;Recall&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Random Friend&lt;/strong&gt;: Amazinggg! With these explanations, I will definitely work on improving my model’s performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me&lt;/strong&gt;: You are always welcome! Excited you are becoming a Machine Learning Maven! Stay tuned on this series on &lt;strong&gt;Introducing the ROC Curve!&lt;/strong&gt; Have an amazing and fulfilled week ahead!&lt;/p&gt;

</description>
      <category>modelevaluation</category>
      <category>machinelearning</category>
      <category>classification</category>
      <category>datascience</category>
    </item>
    <item>
      <title>The Data Viz Wiz: Introducing Matplotlib</title>
      <dc:creator>Joy Ada Uche</dc:creator>
      <pubDate>Thu, 30 Apr 2020 18:20:56 +0000</pubDate>
      <link>https://dev.to/joyadauche/the-data-viz-wiz-introducing-matplotlib-54g5</link>
      <guid>https://dev.to/joyadauche/the-data-viz-wiz-introducing-matplotlib-54g5</guid>
      <description>&lt;p&gt;A picture is worth a thousand words 😀! You can &lt;em&gt;derive valuable insights&lt;/em&gt; from data and also &lt;em&gt;communicate these insights&lt;/em&gt; via data visualization.&lt;/p&gt;

&lt;p&gt;We would clearly see trends and derive insights via Python’s &lt;strong&gt;Matplotlib&lt;/strong&gt; library, which is the foundational library used by many visualization tools. There are 3 main layers in Matplotlib’s architecture and from top to bottom in terms of the high-level commands, they are - &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;matplotlib.pyplot&lt;/strong&gt; module  - the &lt;em&gt;scripting layer&lt;/em&gt; which is often called &lt;strong&gt;procedural plotting&lt;/strong&gt; and is used when you want to quickly create plots and get done with it. This layer is designed to work like a MATLAB script.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;matplotlib.artist&lt;/strong&gt; module  - the &lt;em&gt;artist layer&lt;/em&gt; which is often called &lt;strong&gt;objected-oriented plotting&lt;/strong&gt; and you can do a lot more customizations because you have much more control. &lt;strong&gt;&lt;em&gt;Note that this layer also uses the pyplot module for a few functions like creating the figure - we would see in the examples below that even in the object-oriented approach, pyplot is still used in creating the figure, which holds anything plotted, as you would see below&lt;/em&gt;&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;matplotlib.backend_bases&lt;/strong&gt; module  - the &lt;em&gt;backend layer&lt;/em&gt; - Matplotlib can be used in many ways and also have different outputs formats e.g Matplotlib can be run from the python shell and we have plotting windows pop up; or it is run via Jupyter notebooks and plots are drawn inline. So, the backend layer exists to support these several use cases and outputs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, with what we have above, there are essentially 2 ways to create plots in Matplotlib - &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The procedural way - this is where we mostly do &lt;em&gt;plt.xxx&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;The object-oriented way - this is where we mostly do &lt;em&gt;ax.xxx&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything we plot in Matplotlib is contained in a &lt;em&gt;figure object&lt;/em&gt; which can contain one or more &lt;em&gt;axes&lt;/em&gt;. Take a  look at &lt;em&gt;figure anatomy image&lt;/em&gt; below -&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F86qv27z6t7ys5ok9br9q.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F86qv27z6t7ys5ok9br9q.jpg" alt="Alt Text" width="499" height="481"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We would focus on the object-oriented way in this piece since we can do a lot more customizations with it. Also, note that there are different ways to create an axes - &lt;em&gt;an axes is contained in a figure&lt;/em&gt; as seen above. These different ways of creating an axes still produce the same results - I will highlight the different ways. Now, let’s kick off some Matplotlib plotting by taking a look at &lt;em&gt;1.py&lt;/em&gt; below:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;em&gt;1.py&lt;/em&gt; produces a &lt;em&gt;figure with 1 axes&lt;/em&gt; image below -&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F0uki7m4zgfp5jcgw47t0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F0uki7m4zgfp5jcgw47t0.png" alt="Alt Text" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In &lt;em&gt;1.py&lt;/em&gt; above - &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Line 1 = pyplot is a module in Matplotlib which will help us in plotting. It is conventionally imported with the alias &lt;em&gt;plt&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Line 3 = when &lt;em&gt;plt.subplots()&lt;/em&gt; is called without any arguments, it creates 2 objects - a &lt;strong&gt;Figure&lt;/strong&gt; object and an &lt;strong&gt;Axes&lt;/strong&gt; object. 
The &lt;em&gt;Figure&lt;/em&gt; object is like a container that holds the axes and a &lt;em&gt;figure can contain multiple axes&lt;/em&gt;.
The &lt;em&gt;Axes&lt;/em&gt; object is where we plot our data to visualize it&lt;/li&gt;
&lt;li&gt;Line 4 = displays the plot - which is a &lt;em&gt;figure with empty axes&lt;/em&gt; because no data has been added yet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are different ways used in creating an axes in Matplotlib - &lt;em&gt;plt.subplot()&lt;/em&gt;, &lt;em&gt;plt.subplots()&lt;/em&gt; and &lt;em&gt;plt.axes()&lt;/em&gt; are all from the &lt;em&gt;scripting layer&lt;/em&gt; and it corresponds to &lt;em&gt;fig.add_subplot()&lt;/em&gt;, &lt;em&gt;fig.subplots()&lt;/em&gt; and &lt;em&gt;fig.add_axes()&lt;/em&gt; from the &lt;em&gt;artist layer&lt;/em&gt;. &lt;em&gt;Lines 6-44&lt;/em&gt; shows other ways of creating an axes which produce the same result.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Line 16 - &lt;strong&gt;fig.subplot(1, 1, 1)&lt;/strong&gt; means 1 row, 1 column and the last argument gives the position of the subplot which is the 1st subplot in this case - &lt;em&gt;the last argument has to be less than or equal to the product of the 1st and 2nd arguments&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Line 28 - &lt;strong&gt;fig.subplots(1, 1)&lt;/strong&gt; means 1 row and 1 column&lt;/li&gt;
&lt;li&gt;Lines 39 and 43, the common arguments &lt;strong&gt;[0.1, 0.1, 0.8, 0.8]&lt;/strong&gt; makes the axes 10% from the left of the figure, 10% from the bottom of the figure, 80% width of the figure, and 80% height of the figure.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now let’s add some &lt;strong&gt;fictional data&lt;/strong&gt; to our figure! See &lt;em&gt;2.py&lt;/em&gt; below - &lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;em&gt;2.py&lt;/em&gt; produces a &lt;strong&gt;&lt;em&gt;line plot of Lagos average monthly temperature&lt;/em&gt;&lt;/strong&gt; below -&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fh13gxwwd3blpsol16d2m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fh13gxwwd3blpsol16d2m.png" alt="Alt Text" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the line plot above, we can clearly see the temperature pattern as it increases from &lt;em&gt;Jan&lt;/em&gt; to &lt;em&gt;Jul&lt;/em&gt; then it starts decreasing from &lt;em&gt;Aug&lt;/em&gt; to &lt;em&gt;Nov&lt;/em&gt; then it starts increasing again. Imagine you have lots of data, would you rather go through the pain of reading off average temperatures from a table or use the line plot which shows clearer trends in the data?&lt;/p&gt;

&lt;p&gt;We can even add more &lt;strong&gt;fictional data&lt;/strong&gt; like in &lt;em&gt;3.py&lt;/em&gt; below -&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;em&gt;3.py&lt;/em&gt; above produces a &lt;strong&gt;&lt;em&gt;line plot of Lagos and Abuja average monthly temperature&lt;/em&gt;&lt;/strong&gt; we see below - &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fxesynhe1rkpndioqh3ue.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fxesynhe1rkpndioqh3ue.png" alt="Alt Text" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the image above, we can clearly see -  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Abuja is warmer than Lagos for the first 8 months&lt;/strong&gt; - perhaps you might prefer chilling in Lagos for the first 8 months of the year 😉?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Abuja has a drop in temperature in September which is even lower than that of Lagos for the same month&lt;/strong&gt; - seems you might want to travel back to Abuja this time perhaps to meet with family 😉?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;But hey chill! looks like for the rest of the year, there is a rise in temperature which is higher than that of Lagos&lt;/strong&gt; - Ermmm! I think you might want to just chill in Lagos for a bit and monitor the trends for a while 😉?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The line plots we have seen so far shows us the monthly trends for the average temperature across different cities but &lt;strong&gt;it does not communicate the data properly in a way that it can be easily understood&lt;/strong&gt;. This is where we have to &lt;strong&gt;&lt;em&gt;customize&lt;/em&gt;&lt;/strong&gt; our plot in order to communicate the information more clearly. Let’s see &lt;em&gt;4.py&lt;/em&gt; below for some customizations -&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;em&gt;4.py&lt;/em&gt; above produces a &lt;strong&gt;&lt;em&gt;customized line plot of Lagos average monthly temperature values&lt;/em&gt;&lt;/strong&gt; below - &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fboiiuqenco7mfvsg1glt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fboiiuqenco7mfvsg1glt.png" alt="Alt Text" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In &lt;em&gt;4.py&lt;/em&gt; above - &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In lines 12-14, we added these arguments:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;marker&lt;/strong&gt; which shows the actual data points&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;markersize&lt;/strong&gt;, &lt;strong&gt;makerfacecolor&lt;/strong&gt;, &lt;strong&gt;markeredgewidth&lt;/strong&gt;, &lt;strong&gt;markeredgecolor&lt;/strong&gt; which customizes the marker by &lt;em&gt;increasing its size&lt;/em&gt;, &lt;em&gt;adding colour to its fill&lt;/em&gt; and giving the marker outline &lt;em&gt;width&lt;/em&gt; and &lt;em&gt;colour&lt;/em&gt; respectively&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;linestyle&lt;/strong&gt;, &lt;strong&gt;linewidth&lt;/strong&gt; and &lt;strong&gt;color&lt;/strong&gt; which gives the line in the plot its style, width and color. &lt;em&gt;linestyle&lt;/em&gt; can be shortened to be &lt;em&gt;ls&lt;/em&gt; while &lt;em&gt;linewidth&lt;/em&gt; can be shortened to be &lt;em&gt;lw&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Line 16 - sets the label for the x-axis&lt;/li&gt;
&lt;li&gt;Line 17 - sets the label for the y-axis&lt;/li&gt;
&lt;li&gt;Line 18 - sets the title for the line plot and this provides &lt;em&gt;context for our visualization&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Note that every &lt;em&gt;customization&lt;/em&gt; is done before a call to &lt;em&gt;plt.show()&lt;/em&gt; is made.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To customize the &lt;strong&gt;&lt;em&gt;line plot of Lagos and Abuja average monthly temperature&lt;/em&gt;&lt;/strong&gt;, see &lt;em&gt;5.py&lt;/em&gt; below - &lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;em&gt;5.py&lt;/em&gt; above produces a &lt;em&gt;&lt;strong&gt;customized line plot of Lagos and Abuja average monthly temperature values&lt;/strong&gt;&lt;/em&gt; below -&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F1i9a1iqdq7zxamomljd2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F1i9a1iqdq7zxamomljd2.png" alt="Alt Text" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Sometimes, when we add more data to a plot like in &lt;em&gt;customized line plot of Lagos and Abuja average monthly temperature values&lt;/em&gt; above, it makes it look so busy and it becomes a big mess which conceals patterns or trends in data rather than conveying them. The solution to this is to use &lt;strong&gt;subplots&lt;/strong&gt;. Subplots are several small plots which show identical data under different conditions e.g temperature values data for different cities. Let’s see &lt;em&gt;6.py&lt;/em&gt; below for an example -&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;em&gt;6.py&lt;/em&gt; above produces &lt;strong&gt;&lt;em&gt;subplots of the average monthly temperature across cities&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F95974z0nl3nur9srq0lc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F95974z0nl3nur9srq0lc.png" alt="Alt Text" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In &lt;em&gt;6.py&lt;/em&gt; above - &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Line 16 - &lt;em&gt;sharey = True&lt;/em&gt; ensures subplots have the same range on the y-axis based on the data from both datasets&lt;/li&gt;
&lt;li&gt;Line 22 - since the subplots are on top of each other, we can just add the x-axis label to the bottom plot&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the example given in &lt;em&gt;6.py&lt;/em&gt; above, we got a 1-dimensional axes object since one of the dimensions is 1. For a 2-dimensional array, we can access the object in several ways like we see in &lt;em&gt;lines 29-69&lt;/em&gt; - &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1a, 1b and 1c show different ways we can access the axes object: via &lt;strong&gt;regular indexing we do in Python&lt;/strong&gt;, via &lt;strong&gt;flattening the 2D array&lt;/strong&gt; and via &lt;strong&gt;tuple unpacking&lt;/strong&gt; respectively.&lt;/li&gt;
&lt;li&gt;The rest shows the different usage patterns when &lt;em&gt;ax&lt;/em&gt; is an array of axes objects.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As seen above you can create axes via different ways e.g. by adding an axes to the figure via &lt;em&gt;fig.add_axes()&lt;/em&gt;, this would not be a subplot, but an axes which is an object of &lt;em&gt;matplotlib.axes._axes.Axes&lt;/em&gt;. An axes, which is created via the subplot way is a &lt;em&gt;matplotlib.axes._subplots.AxesSubplot&lt;/em&gt;. This class &lt;em&gt;derives&lt;/em&gt; from &lt;em&gt;matplotlib.axes._axes.Axes&lt;/em&gt;, thus this &lt;em&gt;subplot is an axes&lt;/em&gt;. Hence, &lt;strong&gt;every subplot is an axes object but not every Axes object is a AxesSubplot object&lt;/strong&gt;. An axes contain the x-axis and the y-axis. Be it singular or plural, it is still called &lt;em&gt;axes&lt;/em&gt;. &lt;/p&gt;

&lt;p&gt;Alrightee!!! Glad we are gradually demystifying the mystical Matplotlib library. I know you will eventually get the hang of it and begin your tour in becoming a Data Viz Wiz! Stay tuned to this series for my next article on &lt;strong&gt;Visualizing categorical and quantitative variables via Matplotlib&lt;/strong&gt;! Have an amazing and fulfilled week ahead!&lt;/p&gt;

</description>
      <category>python</category>
      <category>matplotlib</category>
      <category>datavisualization</category>
      <category>datascience</category>
    </item>
    <item>
      <title>The Pandas Pundit: Accessing Data in DataFrames</title>
      <dc:creator>Joy Ada Uche</dc:creator>
      <pubDate>Tue, 31 Mar 2020 18:56:10 +0000</pubDate>
      <link>https://dev.to/joyadauche/the-pandas-pundit-accessing-data-in-dataframes-4164</link>
      <guid>https://dev.to/joyadauche/the-pandas-pundit-accessing-data-in-dataframes-4164</guid>
      <description>&lt;p&gt;Congratulations -  You have just landed a new job as a Data Scientist 😀! In your first month, you need to start analyzing &lt;em&gt;tons of data&lt;/em&gt;. But before you start unlocking insights and predicting future trends, you need to &lt;em&gt;access&lt;/em&gt; these data in order to explore it. Yikes! Did you wish you just started predicting future trends right away? Smiles, You are not Doctor Fate!&lt;/p&gt;

&lt;p&gt;These tons of data can vary greatly in form and they are commonly seen in a tabular structure, where we have rows (also known as records, observations etc) and columns (also known as features, variables, fields etc) - like in &lt;em&gt;1.md&lt;/em&gt; below:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Data in .csv and .xlsx files have a tabular-like structure and in order to work efficiently with this kind of data in &lt;em&gt;Python&lt;/em&gt;, we need to use the &lt;em&gt;Pandas&lt;/em&gt; package. In Pandas, there is a data structure that can handle tabular-like structure of data - this data structure is called the &lt;strong&gt;DataFrame&lt;/strong&gt;. Look at &lt;em&gt;2.md&lt;/em&gt; below to see the DataFrame version of the &lt;em&gt;1.md&lt;/em&gt;:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;In &lt;em&gt;2.md&lt;/em&gt;, you can see a similar structure like in &lt;em&gt;1.md&lt;/em&gt; - we also have rows and columns - each row has a unique row label - &lt;em&gt;NG&lt;/em&gt;, &lt;em&gt;CA&lt;/em&gt;, &lt;em&gt;BR&lt;/em&gt;, &lt;em&gt;CH&lt;/em&gt;, &lt;em&gt;FR&lt;/em&gt;. The columns also have labels - &lt;em&gt;country&lt;/em&gt;, &lt;em&gt;capital&lt;/em&gt;, &lt;em&gt;population_millions&lt;/em&gt;. So, how do you put this data in a DataFrame to start exploring? Also for you to explore it well, &lt;strong&gt;what are the different ways to access the data in this DataFrame&lt;/strong&gt;? Cheers! You are about to start your journey on becoming a Pandas Pundit!&lt;/p&gt;

&lt;p&gt;You are given tons of data in a CSV file as seen below in &lt;em&gt;3.csv&lt;/em&gt;:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;First of all, to get the data in &lt;em&gt;3.csv&lt;/em&gt; into a DataFrame, look at &lt;em&gt;4.py&lt;/em&gt; below:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;which returns the DataFrame in &lt;em&gt;5.txt&lt;/em&gt; below:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Now that we have our data in a DataFrame, it is time to access it. There are several ways to access or select or index or subset or slice data in DataFrames - Data can be accessed via:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;square brackets: [ ]&lt;/li&gt;
&lt;li&gt;loc: label-based &lt;/li&gt;
&lt;li&gt;iloc: position-based&lt;/li&gt;
&lt;li&gt;at: label-based&lt;/li&gt;
&lt;li&gt;iat: position-based&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let’s see how you can access data in  &lt;em&gt;columns only&lt;/em&gt;, &lt;em&gt;rows only&lt;/em&gt; and &lt;em&gt;both rows and columns&lt;/em&gt; from the Dataframe in &lt;strong&gt;5.txt&lt;/strong&gt; using the 5 ways above:&lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;strong&gt;square brackets [ ]&lt;/strong&gt; - 1
&lt;/h1&gt;

&lt;p&gt;Let's look at &lt;strong&gt;column access&lt;/strong&gt; and &lt;strong&gt;row access&lt;/strong&gt; using []:&lt;br&gt;
&lt;strong&gt;1.1&lt;/strong&gt;: &lt;strong&gt;Column Access&lt;/strong&gt;:&lt;br&gt;
We have &lt;em&gt;single column access&lt;/em&gt; and &lt;em&gt;multiple column access&lt;/em&gt;.&lt;br&gt;
&lt;strong&gt;1.1.1&lt;/strong&gt;: &lt;strong&gt;&lt;em&gt;single column access&lt;/em&gt;&lt;/strong&gt;:&lt;br&gt;
To access data in the &lt;em&gt;Country&lt;/em&gt; column in &lt;em&gt;5.txt&lt;/em&gt; above, for example, we do:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;which returns:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;In &lt;em&gt;7.txt&lt;/em&gt; above, the dtype (datatype) of the what is returned is an object. The type of object returned can be known using:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;In &lt;em&gt;8.py&lt;/em&gt; above, it is a pandas &lt;em&gt;Series&lt;/em&gt; object. A &lt;em&gt;pandas series&lt;/em&gt; is a 1D (1-Dimensional array) that can be labelled just like the DataFrame, a series has row labels/indexes. So, with this, it shows that &lt;strong&gt;a collection of series creates a DataFrame&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;This series object returned can also be accessed using the square brackets. For example, to grab the value &lt;em&gt;Nigeria&lt;/em&gt; in &lt;em&gt;7.txt&lt;/em&gt; above, see &lt;em&gt;9.py&lt;/em&gt; below:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Also, note that I can use the &lt;em&gt;dot notation&lt;/em&gt; as seen in &lt;em&gt;lines 19-20&lt;/em&gt; above. Use the dot notation when &lt;strong&gt;the column name does not contain any special characters or spaces and is not a keyword in python&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;However, if you want a DataFrame returned and not a series - use double square brackets as seen in &lt;em&gt;10.py&lt;/em&gt; below:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;1.1.2&lt;/strong&gt;: &lt;strong&gt;&lt;em&gt;multiple column access&lt;/em&gt;&lt;/strong&gt;:&lt;br&gt;
To access more than one column in &lt;em&gt;5.txt&lt;/em&gt;, we do:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;which returns:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;1.2&lt;/strong&gt;: &lt;strong&gt;Row Access&lt;/strong&gt;:&lt;br&gt;
The only way of accessing rows in a DataFrame using the square brackets way is by specifying a slice on the row. Specifying a slicing index takes the form - &lt;strong&gt;start:stop:step/stride&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Indices are either &lt;strong&gt;numeric&lt;/strong&gt; which is the default or &lt;strong&gt;labelled&lt;/strong&gt;. Let's dive deeper below:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1.2.1&lt;/strong&gt;: &lt;strong&gt;&lt;em&gt;default numeric indices&lt;/em&gt;&lt;/strong&gt;:&lt;br&gt;
Given &lt;strong&gt;&lt;em&gt;[x:y:z]&lt;/em&gt;&lt;/strong&gt; as a slicing index, it means &lt;strong&gt;count in increments of &lt;em&gt;z&lt;/em&gt; starting at &lt;em&gt;x&lt;/em&gt; inclusive, up to &lt;em&gt;y&lt;/em&gt; exclusive&lt;/strong&gt; - for numeric indexes, &lt;em&gt;the stop index is always exclusive&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Take a look at this figure below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fks6435xuikd2xzs5bwff.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fks6435xuikd2xzs5bwff.png" alt="Alt Text" width="800" height="221"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the figure above, the direction in which my rows are returned is determined by the &lt;strong&gt;&lt;em&gt;sign of the step/stride&lt;/em&gt;&lt;/strong&gt; i.e z, given &lt;em&gt;[x:y:z]&lt;/em&gt;. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If the step/stride is positive, start from the specified index position of the DataFrame and go the &lt;strong&gt;&lt;em&gt;downward/forward&lt;/em&gt;&lt;/strong&gt; direction when returning rows*.&lt;/li&gt;
&lt;li&gt;If it is negative, start from the specified index position of the DataFrame and move &lt;strong&gt;&lt;em&gt;upwards/backwards&lt;/em&gt;&lt;/strong&gt; when returning rows.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Given [:y:-z] as a slicing index, start from the last row in the DataFrame and go backwards/upwards but If [:y:z] is given, start from the first row in the DataFrame and go forward/downward&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In the forward/downward or in the backward/upward direction, the start index should always come before the stop index else no rows will be returned.&lt;/p&gt;

&lt;p&gt;Let us take a look below at how &lt;em&gt;positive&lt;/em&gt; and &lt;em&gt;negative&lt;/em&gt; strides works:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1.2.1.1&lt;/strong&gt;: &lt;strong&gt;&lt;em&gt;positive step(s)/stride(s)&lt;/em&gt;&lt;/strong&gt; :&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;In &lt;em&gt;13.py, lines 3-9&lt;/em&gt; above, we use the default numeric index. It has this structure &lt;em&gt;start:stop:[step or stride]&lt;/em&gt; - [step or stride] in square brackets means it is optional. Given &lt;em&gt;countries[1:3]&lt;/em&gt;, &lt;em&gt;1&lt;/em&gt; is the start while &lt;em&gt;3&lt;/em&gt; is the stop. When the step or stride is not specified like it is the case here, it has a default value of 1.&lt;/p&gt;

&lt;p&gt;In &lt;em&gt;13.py, lines 11-18&lt;/em&gt; above - given &lt;em&gt;countries[2:]&lt;/em&gt; - note that the stop index is omitted i.e the explicit end index position is omitted. &lt;em&gt;countries[2:]&lt;/em&gt; returns rows starting with the row with index position 2 till the last row in the DataFrame inclusive as seen above. &lt;strong&gt;:&lt;/strong&gt; is a universal slice. If its left-endpoint (start) is omitted then the rows returned starts from the very first row in the DataFrame but if the right-endpoint is omitted, the row returned is till the very last row in the DataFrame inclusive.&lt;/p&gt;

&lt;p&gt;In &lt;em&gt;13.py, lines 20-27&lt;/em&gt; above - given &lt;em&gt;countries[:3]&lt;/em&gt; - note that the start index is omitted i.e the explicit start index position is omitted.&lt;br&gt;
&lt;em&gt;countries[:3]&lt;/em&gt; returns rows starting with the first row till the row with index position 2 inclusive - remember that a &lt;em&gt;numeric stop index&lt;/em&gt; is exclusive: so, the row at index position 3 is not returned as seen above.&lt;/p&gt;

&lt;p&gt;In &lt;em&gt;13.py, lines 29-38&lt;/em&gt; above, given &lt;em&gt;countries[:]&lt;/em&gt; - note that both the start and stop indices are omitted i.e the explicit start and stop indices positions are omitted. &lt;em&gt;countries[:]&lt;/em&gt; returns rows starting from the start of the row till the last row inclusive i.e it returns every row.&lt;/p&gt;

&lt;p&gt;In &lt;em&gt;13.py, line 40-47&lt;/em&gt; above, given &lt;em&gt;countries[::2]&lt;/em&gt;, using the formula above, this means &lt;strong&gt;counts in &lt;em&gt;increment/steps/strides&lt;/em&gt; of 2 &lt;em&gt;starting&lt;/em&gt; from the first row &lt;em&gt;up to&lt;/em&gt; the last row inclusive&lt;/strong&gt;  i.e it returns every 2nd row. So, this is how it works - it return rows&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;starting from the first row which has index 0, &lt;/li&gt;
&lt;li&gt;add steps of 2 i.e index 0 + 2 = index 2; then index 2 + 2 = index 4 &lt;/li&gt;
&lt;li&gt;so, we have rows with index positions 0, 2, 4 returned&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In &lt;em&gt;13.py, lines 49-54&lt;/em&gt; above, given &lt;em&gt;countries[-1:]&lt;/em&gt; - note the &lt;br&gt;
following below -&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the DataFrame, countries, has rows labelled &lt;em&gt;NG&lt;/em&gt;, &lt;em&gt;CA&lt;/em&gt;, &lt;em&gt;BR&lt;/em&gt;, &lt;em&gt;CH&lt;/em&gt;, &lt;em&gt;FR&lt;/em&gt;. These rows also have default numeric indices which can be positive &lt;em&gt;0&lt;/em&gt;, &lt;em&gt;1&lt;/em&gt;, &lt;em&gt;2&lt;/em&gt;, &lt;em&gt;3&lt;/em&gt;, &lt;em&gt;4&lt;/em&gt; or negative &lt;em&gt;-5&lt;/em&gt;, &lt;em&gt;-4&lt;/em&gt;, &lt;em&gt;-3&lt;/em&gt;, &lt;em&gt;-2&lt;/em&gt;, &lt;em&gt;-1&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;countries[-1:]&lt;/em&gt; is the same as &lt;em&gt;countries[-1::1]&lt;/em&gt; - so we start from the last row and go downwards - downwards since the step/stride is positive&lt;/li&gt;
&lt;li&gt;this returns only the last row because there is no other row downwards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In &lt;em&gt;13.py, lines 56-64&lt;/em&gt; above, given &lt;em&gt;countries[:-1]&lt;/em&gt; - it returns rows &lt;br&gt;
starting from the first row but excludes the last row.&lt;/p&gt;

&lt;p&gt;In &lt;em&gt;13.py, lines 66-72&lt;/em&gt; above, given &lt;em&gt;countries[-2:]&lt;/em&gt; - it returns rows starting from the last but one row till the last row inclusive.&lt;/p&gt;

&lt;p&gt;In &lt;em&gt;13.py, lines 74-81&lt;/em&gt; above, given &lt;em&gt;countries[:-2]&lt;/em&gt; - it returns rows &lt;br&gt;
starting from the first row till it includes the row before last but one &lt;br&gt;
row.&lt;/p&gt;

&lt;p&gt;In &lt;em&gt;13.py, lines 83-88&lt;/em&gt; above, given &lt;em&gt;countries[-1:-1]&lt;/em&gt; - it returns &lt;br&gt;
an empty DataFrame - starts from the last row but excludes the last &lt;br&gt;
Row, hence nothing is returned.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1.2.1.2&lt;/strong&gt;: &lt;strong&gt;&lt;em&gt;negative step(s)/stride(s)&lt;/em&gt;&lt;/strong&gt; :&lt;br&gt;
With negative steps, rows get returned backwards. Let us see some examples in 14.py below:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;In &lt;em&gt;14.py, lines 3-8&lt;/em&gt; above, given &lt;em&gt;countries[3:-4:-1]&lt;/em&gt; - note the following below:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the stride/step is negative, so rows get returned backwards/upwards - when you look at the tabular-like structure in &lt;em&gt;5.txt&lt;/em&gt; above, we start getting rows from the end depending on the specified index and then go upwards/backwards i.e &lt;em&gt;from the row with label CH and then move upwards&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;the stop has a negative value of -4 which is the row labelled &lt;em&gt;CA&lt;/em&gt;. So, this row and the ones after it is not be included in the rows returned&lt;/li&gt;
&lt;li&gt;hence we have just two rows labelled &lt;em&gt;CH&lt;/em&gt; and &lt;em&gt;BR&lt;/em&gt; returned.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In &lt;em&gt;14.py, lines 11-17&lt;/em&gt; above, given &lt;em&gt;countries[4:-4:-2]&lt;/em&gt; - note the following below:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the stride/step is negative and in steps of 2&lt;/li&gt;
&lt;li&gt;so we return rows starting from index position 4, then go upwards/backwards i.e &lt;em&gt;from row with label FR and then move upwards/backwards&lt;/em&gt; but &lt;em&gt;excludes row with index position 4 with all other rows after it too&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In &lt;em&gt;14.py, lines 19-24&lt;/em&gt; above, given &lt;em&gt;countries[0: -1: -1]&lt;/em&gt;, it returns an &lt;br&gt;
empty DataFrame. Here is why:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Since the step/stride is negative we start at 0 and then go upwards or backwards - but it seems the &lt;em&gt;stop index comes before the start index&lt;/em&gt; - so this cannot work, hence an empty DataFrame is returned&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;1.2.1&lt;/strong&gt;: &lt;strong&gt;&lt;em&gt;Labelled indexes&lt;/em&gt;&lt;/strong&gt;:&lt;br&gt;
With labelled indexes, it also takes the form &lt;em&gt;start:stop:step/stride&lt;/em&gt;. But here are some things to take note of below:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;The stop label is inclusive&lt;/em&gt;&lt;/strong&gt;, unlike numeric indices where the stop index is exclusive.&lt;/li&gt;
&lt;li&gt;Pairing labelled and numeric indices in the start or stop indices are not allowed e.g countries[‘NG’: 4] would give an error&lt;/li&gt;
&lt;li&gt;The step/stride is still numeric and can also be positive or negative&lt;/li&gt;
&lt;li&gt;and of course, all rules that go with having positive or negative step/stride applies here too.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s see some examples in &lt;em&gt;15.py&lt;/em&gt; below:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;In &lt;em&gt;15.py, lines 3-10&lt;/em&gt; above, given &lt;em&gt;countries[‘CA’:’CH’]&lt;/em&gt;, we can see that the rows returned includes the row with the label &lt;em&gt;CH&lt;/em&gt; which is the stop index label specified.&lt;/p&gt;

&lt;p&gt;In &lt;em&gt;15.py, lines 12-20&lt;/em&gt; above, given &lt;em&gt;countries[:’CH’]&lt;/em&gt;, the rows returned starts from the first row in the DataFrame till it includes the row labelled &lt;em&gt;CH&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;In &lt;em&gt;15.py, lines 22-30&lt;/em&gt; above, given &lt;em&gt;countries[‘CA’:]&lt;/em&gt;, the rows returned starts from the row labelled ‘CA’ in the DataFrame till it includes the last row.&lt;/p&gt;

&lt;p&gt;In &lt;em&gt;15.py, lines 32-39&lt;/em&gt; above, given &lt;em&gt;countries[‘NG’::2]&lt;/em&gt;, the rows returned -  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;starts from the row labelled ‘NG’ in the DataFrame&lt;/li&gt;
&lt;li&gt;add steps of 2 - index NG has a default index position of 0, so 0 + 2 = &lt;em&gt;index 2&lt;/em&gt; which is the row labelled BR; then index 2 + 2 = &lt;em&gt;index 4&lt;/em&gt;, which is the row labelled FR&lt;/li&gt;
&lt;li&gt;so, we have rows labelled &lt;em&gt;NG&lt;/em&gt;, &lt;em&gt;BR&lt;/em&gt;, &lt;em&gt;FR&lt;/em&gt; which have default numeric index positions 0, 2, 4 returned.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In &lt;em&gt;15.py, lines 41-49&lt;/em&gt; above, given &lt;em&gt;countries[:‘CA’:-1]&lt;/em&gt;, the rows &lt;br&gt;
returned -&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;starts from the last row labelled &lt;em&gt;FR&lt;/em&gt; in the dataframe&lt;/li&gt;
&lt;li&gt;add steps of -1 : 

&lt;ul&gt;
&lt;li&gt;index &lt;em&gt;FR&lt;/em&gt; has a default negative index position of &lt;em&gt;-1&lt;/em&gt;, so -1 + -1 
= &lt;em&gt;index -2&lt;/em&gt; which is the row labelled &lt;em&gt;CH&lt;/em&gt; &lt;/li&gt;
&lt;li&gt;then index -2 + (-1) = &lt;em&gt;index -3&lt;/em&gt;, which is the row labelled &lt;em&gt;BR&lt;/em&gt; &lt;/li&gt;
&lt;li&gt;then index -3 + (-1) = &lt;em&gt;index -4&lt;/em&gt;, which is the row labelled &lt;em&gt;CA&lt;/em&gt; &lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;so, we have rows labelled &lt;em&gt;FR&lt;/em&gt;, &lt;em&gt;CH&lt;/em&gt;, &lt;em&gt;BR&lt;/em&gt;, &lt;em&gt;CA&lt;/em&gt; which have default numeric index positions -1, -2, -3, -4 returned.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Using square brackets, [ ] has its limitations like the ability to select several rows and columns at the same time. So, let's jump into loc and iloc to see its awesomeness!&lt;/p&gt;
&lt;h1&gt;
  
  
  &lt;strong&gt;loc&lt;/strong&gt; - 2
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;2.1&lt;/strong&gt;: &lt;strong&gt;Row Access&lt;/strong&gt;:&lt;br&gt;
By default, loc accesses rows. Loc is label-based, so I just need to specify the row label. Let us see some examples in 16.py below:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;In &lt;em&gt;16.py, lines 3-16&lt;/em&gt; above, given &lt;em&gt;countries.loc[‘FR’]&lt;/em&gt; or &lt;em&gt;countries.loc[[‘FR’]]&lt;/em&gt;, the row with label &lt;em&gt;FR&lt;/em&gt; returns a series or dataframe respectively.&lt;/p&gt;

&lt;p&gt;In &lt;em&gt;16.py, lines 18-25&lt;/em&gt; above, given &lt;em&gt;countries.loc[[‘CA’, ‘BR’, ‘CH’]]&lt;/em&gt;, the rows with labels listed in the square brackets are returned&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2.2&lt;/strong&gt;: &lt;strong&gt;Row and Column Access&lt;/strong&gt;:&lt;br&gt;
We can simultaneously access rows and columns using loc. It takes the form &lt;em&gt;countries.loc[row, column]&lt;/em&gt;. Let us see some examples below in &lt;em&gt;17.py&lt;/em&gt;:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;In &lt;em&gt;17.py, lines 3-10&lt;/em&gt; above, given &lt;em&gt;countries.loc[['CA', 'BR', 'CH'], ['Country', &lt;br&gt;
'Capital']]&lt;/em&gt; - we specify a list of row labels and also a list of column labels we want to be returned.&lt;/p&gt;

&lt;p&gt;In &lt;em&gt;17.py, lines 12-50&lt;/em&gt; above, we can see slices can also be specified. Everything we have seen so far in relation to positive or negative steps/strides also applies here as seen in the examples given.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2.3&lt;/strong&gt;: &lt;strong&gt;Column Access&lt;/strong&gt;:&lt;br&gt;
We can also select the specific columns we need while we select all rows as seen below in &lt;em&gt;18.py&lt;/em&gt;:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;h1&gt;
  
  
  &lt;strong&gt;iloc&lt;/strong&gt; - 3
&lt;/h1&gt;

&lt;p&gt;Just like loc, iloc is row-based by default. The difference is that iloc is position-based. let's see some examples in &lt;em&gt;19.py&lt;/em&gt; below:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;em&gt;19.py&lt;/em&gt; above gives the iloc version of the examples given for loc for &lt;em&gt;row access&lt;/em&gt;, &lt;em&gt;row and column access&lt;/em&gt; and &lt;em&gt;column access&lt;/em&gt;. As I earlier said, the only difference is that iloc is position-based while loc is label-based.&lt;/p&gt;

&lt;p&gt;Kindly note that &lt;em&gt;ix&lt;/em&gt;, which is also a way to accessing data in a DataFrame, is deprecated in favour of &lt;em&gt;loc&lt;/em&gt; and &lt;em&gt;iloc&lt;/em&gt;, so it is not advisable to use it the &lt;em&gt;ix&lt;/em&gt; indexer.  &lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;strong&gt;at&lt;/strong&gt; - 4
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;at&lt;/em&gt; is label-based like loc but it is used only to access a single value in a DataFrame. Unlike &lt;em&gt;loc&lt;/em&gt;, which can not just get a single value but also several values as we have seen above.&lt;/p&gt;

&lt;p&gt;See some examples below in &lt;em&gt;20.py&lt;/em&gt;:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;h1&gt;
  
  
  &lt;strong&gt;iat&lt;/strong&gt; - 5
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;iat&lt;/em&gt; is position-based like iloc but also accesses a single value like &lt;em&gt;at&lt;/em&gt;. See some examples below in &lt;em&gt;21.py&lt;/em&gt;:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Now, let's summarize the key points together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Accessing data using indices takes the form, &lt;em&gt;[start:stop:step]&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Numeric stop indices are exclusive&lt;/li&gt;
&lt;li&gt;Label stop indices are inclusive&lt;/li&gt;
&lt;li&gt;A step can be positive or negative - When it is positive, start at the specified index then go downwards/forward but when it is negative, start at the specified index then go upwards/backwards.&lt;/li&gt;
&lt;li&gt;Given &lt;em&gt;[:y:-z]&lt;/em&gt; as a slicing index, start from the last row in the DataFrame and go backwards/upwards.
&lt;/li&gt;
&lt;li&gt;Given &lt;em&gt;[:y:z]&lt;/em&gt; is given, start from the first row in the DataFrame and go forward/downward*.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;loc&lt;/em&gt; - label-based&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;iloc&lt;/em&gt; - position-based&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;at&lt;/em&gt; - label-based but returns a single value&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;iat&lt;/em&gt; - position-based but returns a single value&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Wowww! That was so much to take in. But you are on your way in becoming a Pandas Pundit! Stay tuned on this series for my next article on &lt;strong&gt;Filtering Data in Dataframes&lt;/strong&gt;! Have an amazing and fulfilled week ahead!&lt;/p&gt;

</description>
      <category>python</category>
      <category>pandas</category>
      <category>datascience</category>
    </item>
    <item>
      <title>The Proficient Pythonista: List Comprehensions</title>
      <dc:creator>Joy Ada Uche</dc:creator>
      <pubDate>Sat, 29 Feb 2020 18:06:16 +0000</pubDate>
      <link>https://dev.to/joyadauche/the-proficient-pythonista-list-comprehensions-3c3</link>
      <guid>https://dev.to/joyadauche/the-proficient-pythonista-list-comprehensions-3c3</guid>
      <description>&lt;p&gt;&lt;em&gt;For&lt;/em&gt; loops are a thing! But if you could sometimes use a construct that is way more concise and efficient won’t you go for it? Hell Yeah!!!&lt;/p&gt;

&lt;p&gt;List comprehensions give us a succinct way to create lists based on existing lists.&lt;/p&gt;

&lt;p&gt;To create a new list of numbers from an existing list using a &lt;em&gt;for loop&lt;/em&gt; construct:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;What if I told you what we have in the loop above can be done in just a single line code! - This is the power of list comprehensions. Take a look below in &lt;em&gt;2.py&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;The structure for the code written in &lt;em&gt;2.py&lt;/em&gt; above is &lt;strong&gt;[ output expression &lt;em&gt;for&lt;/em&gt; iterator variable &lt;em&gt;in&lt;/em&gt; iterable ]&lt;/strong&gt;: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;number + 3&lt;/em&gt; is the output expression - the result returned&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;number&lt;/em&gt; is the iterator variable&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;numbers&lt;/em&gt; is the iterable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;List comprehension can be written over any iterable like a range object and not just lists as below:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;We can also have conditionals in list comprehensions which helps filter what we get returned - it controls what items from an existing list are returned. This conditional logic can be on the &lt;strong&gt;iterator variable&lt;/strong&gt; or the &lt;strong&gt;output expression&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In &lt;em&gt;4.py&lt;/em&gt; below is an example of a conditional logic used on the iterator variable, where it returns only items that are not equivalent to the string &lt;em&gt;cisco&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;In &lt;em&gt;4.py&lt;/em&gt; above, the structure for the code written is &lt;strong&gt;[output expression &lt;em&gt;for&lt;/em&gt; iterator variable &lt;em&gt;in&lt;/em&gt; iterable &lt;em&gt;if&lt;/em&gt; predicate expression ]&lt;/strong&gt; -&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;If flash != ‘cisco’&lt;/em&gt;  - the conditional logic which is an &lt;strong&gt;if predicate expression&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s take it further by having nested if predicate expressions within a list comprehension:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;In &lt;em&gt;5.py&lt;/em&gt; above, it checks to see if the number &lt;em&gt;num&lt;/em&gt; is divisible by 2 and then 4, and prints it out if it satisfies both conditions.&lt;/p&gt;

&lt;p&gt;Moving on to &lt;em&gt;conditionals on the output expression&lt;/em&gt;, let us see an example below:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;The structure for the code written in &lt;em&gt;6.py&lt;/em&gt; above is  &lt;strong&gt;[conditional on output expression  &lt;em&gt;for&lt;/em&gt; iterator variable &lt;em&gt;in&lt;/em&gt; iterable]&lt;/strong&gt; - &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;hero if len(hero)&amp;gt;=8 else ''&lt;/em&gt; - the conditional logic - outputs &lt;em&gt;hero&lt;/em&gt;
if the length of the string hero is less than 8, else it
outputs an empty string.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Like list comprehensions, we can also have a &lt;strong&gt;dictionary comprehension&lt;/strong&gt; or even a &lt;strong&gt;set comprehension&lt;/strong&gt;. This is getting quite interesting right?&lt;/p&gt;

&lt;p&gt;With set comprehensions, the output returned unlike a list comprehension contains no duplicates. Let’s see an example in &lt;em&gt;7.py&lt;/em&gt; below:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;In &lt;em&gt;7.py&lt;/em&gt;, no duplicates are returned. Also, note that sets like dictionaries but unlike lists are inherently unordered - the order of items is not important - this is why &lt;em&gt;o&lt;/em&gt; comes first in the set even if &lt;em&gt;e&lt;/em&gt; is the first character in &lt;em&gt;cool_quote&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Talking about dictionaries, let’s see an example of how we can use a dictionary comprehension to create a dictionary below:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;In &lt;em&gt;8.py&lt;/em&gt; above, we have a &lt;em&gt;key and value&lt;/em&gt; pair separated by a colon (:) - the keys are the items of the list while the values are the length of the items in the list.&lt;/p&gt;

&lt;p&gt;Nested loops are also something we do sometimes. See an example below:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;The code above is multiplying the items in the first list by the items in the second list. To write this in a list comprehension, it is written as below:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;In &lt;em&gt;10.py&lt;/em&gt; above, the outer list comprehension &lt;strong&gt;[&lt;em&gt;... for i in range(5, 8)&lt;/em&gt;]&lt;/strong&gt; creates 3 rows, while the inner list comprehension &lt;strong&gt;[&lt;em&gt;i*j for j in range(1,3)&lt;/em&gt;]&lt;/strong&gt; fills these rows with values i.e i*j.&lt;/p&gt;

&lt;p&gt;Very good use of list comprehension is to flatten a list consisting of multiple lists as seen below:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;List comprehensions are superb! But I think you can still use for loops in certain places where code readability is key because it is important to also write what your team can read easily. I hope you are excited about your journey to becoming a Proficient Pythonista!!! 😉&lt;/p&gt;

&lt;p&gt;Stay tuned to this series for my next article on &lt;strong&gt;GENERATORS&lt;/strong&gt;! Have an amazing and fulfilled week ahead! &lt;/p&gt;

</description>
      <category>python</category>
      <category>codequality</category>
      <category>datascience</category>
    </item>
    <item>
      <title>The SQL Savant: Inner Joins in SQL</title>
      <dc:creator>Joy Ada Uche</dc:creator>
      <pubDate>Sun, 05 Jan 2020 05:15:09 +0000</pubDate>
      <link>https://dev.to/joyadauche/the-sql-savant-inner-joins-in-sql-37ak</link>
      <guid>https://dev.to/joyadauche/the-sql-savant-inner-joins-in-sql-37ak</guid>
      <description>&lt;p&gt;Ever tried retrieving the data you need from just one table but suddenly realised you need more detail or information about these data which you must get from another table? Joins to the rescue!!! &lt;br&gt;
You can get that additional detail you need using the power of Joins. With Joins in SQL, you can retrieve or access the information you need from two or more tables. &lt;/p&gt;

&lt;p&gt;Let’s say we are at a Javascript college and they just hired a new teacher who wishes to carry every student along in his class and then he requested the academic detail of every student and these details are located in different tables. This certainly sounds like a task for Joins right? &lt;/p&gt;

&lt;p&gt;Assuming this class maintains a database consisting of 3 tables such as &lt;em&gt;person&lt;/em&gt;, &lt;em&gt;grade&lt;/em&gt;, and &lt;em&gt;activity&lt;/em&gt;. We would most likely have to get the information we need by looking up data in the &lt;strong&gt;person&lt;/strong&gt; and &lt;strong&gt;grade&lt;/strong&gt; tables using joins.&lt;/p&gt;

&lt;p&gt;But here comes a pool of questions. How do we get the needed information using joins? What if there are students without grades? Does the teacher want to see only students with grades or all students whether there is a grade or not? Well, it actually depends on how this new Javascript teacher wants it right? So, we had a meeting with him and he wants to see &lt;em&gt;only students with grades&lt;/em&gt;. It is in this light we introduce the main types of Joins. They are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The inner joins and&lt;/li&gt;
&lt;li&gt;The outer joins&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This class has a database with the &lt;strong&gt;person&lt;/strong&gt; and &lt;strong&gt;grade&lt;/strong&gt; tables as below:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;THE INNER JOIN&lt;/strong&gt; (which is the most common type of join), also referred to as JOIN, is a type of JOIN that returns all rows from both participating tables where the key record of one table is equal to the key records of another table. Using an inner join on the &lt;strong&gt;person&lt;/strong&gt; and &lt;strong&gt;grade&lt;/strong&gt; table will return &lt;em&gt;only students with grades&lt;/em&gt; like below:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;In &lt;em&gt;2.md&lt;/em&gt; above, you can see that the inner join has combined both tables ON key columns (which has values) that are common to both tables. It then returns records that contain the selected columns in the SELECT clause - You can see in &lt;em&gt;1.md&lt;/em&gt;, that the &lt;em&gt;&lt;strong&gt;id&lt;/strong&gt;&lt;/em&gt; field for &lt;em&gt;&lt;strong&gt;person table&lt;/strong&gt;&lt;/em&gt; and &lt;em&gt;&lt;strong&gt;person_id&lt;/strong&gt;&lt;/em&gt; field for &lt;em&gt;&lt;strong&gt;grade table&lt;/strong&gt;&lt;/em&gt; matches for values of &lt;em&gt;33CC&lt;/em&gt; and &lt;em&gt;44DD&lt;/em&gt; only.&lt;/p&gt;

&lt;p&gt;The code for completing an INNER JOIN from the &lt;em&gt;person&lt;/em&gt; table to the &lt;strong&gt;grade&lt;/strong&gt; table based on the common values of the key field is shown below:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;In &lt;em&gt;3.sql&lt;/em&gt; and &lt;em&gt;2.md&lt;/em&gt;, please note the following:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;3.sql&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In the &lt;strong&gt;SELECT&lt;/strong&gt; clause, the fields or columns you want to be returned from the tables are listed.&lt;/li&gt;
&lt;li&gt;The &lt;em&gt;person&lt;/em&gt; table is conventionally called the &lt;em&gt;left table&lt;/em&gt; because it is the first table in the SELECT clause.&lt;/li&gt;
&lt;li&gt;The &lt;em&gt;grade&lt;/em&gt; table is conventionally called the &lt;em&gt;right table&lt;/em&gt; because it is the table you are joining on.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;INNER JOIN&lt;/strong&gt; keyword can be replaced with the &lt;strong&gt;JOIN&lt;/strong&gt; keyword - INNER JOIN is the default if you don't specify the type when you use the word JOIN.&lt;/li&gt;
&lt;li&gt;The ON clause introduces the key fields from each table we would be joining on. The ON clause is used to specify a &lt;em&gt;join condition&lt;/em&gt; or a &lt;em&gt;join-predicate&lt;/em&gt;. The &lt;em&gt;name&lt;/em&gt; of a key field or key column can be the same or vary from one table to another - the key field name in the &lt;em&gt;grade&lt;/em&gt; table, &lt;strong&gt;person_id&lt;/strong&gt;, can also be named &lt;strong&gt;id&lt;/strong&gt; like in the &lt;em&gt;person&lt;/em&gt; table as long as it does not conflict with other column names in the grade table.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;2.md&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Regardless of whether the value of the key field appears multiple times in a table, as far as it appears in both tables, the record would be included in the result e.g &lt;em&gt;33CC&lt;/em&gt; is returned twice for the year 2019 and 2020 respectively.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, we can gladly use the inner join above and give the new teacher what he asked for.&lt;/p&gt;

&lt;p&gt;An ad-hoc request comes in! The new teacher wants to see &lt;em&gt;students with grades along with their yearly activities&lt;/em&gt;. Herein lies the awesomeness of SQL-  &lt;em&gt;the ability to combine multiple joins in a single query&lt;/em&gt; - Multiple Joins. Ring a bell? Let us see an example below:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;In &lt;em&gt;4.md&lt;/em&gt; above, we have 3 tables - person, grade and activity - an additional table, activity, which shows the hobbies a student partake in yearly. So we came up with the query below:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;In &lt;em&gt;5.sql&lt;/em&gt; above, &lt;strong&gt;activity.year&lt;/strong&gt; i.e &lt;em&gt;table_name.column_name&lt;/em&gt; is used because the field &lt;strong&gt;year&lt;/strong&gt; is common to the &lt;em&gt;grade&lt;/em&gt; and &lt;em&gt;activity&lt;/em&gt; tables. If the &lt;em&gt;table_name&lt;/em&gt; is not specified, an error would be thrown saying it is &lt;em&gt;ambiguous&lt;/em&gt;. So, &lt;em&gt;5.sql&lt;/em&gt; returns:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Take a second look at this result! Do you observe something wrong with it? Some &lt;em&gt;score&lt;/em&gt; field values are wrongly paired with activity and year values - &lt;em&gt;Zamani&lt;/em&gt; for example, &lt;em&gt;&lt;strong&gt;the 3rd and 4th rows are incorrect&lt;/strong&gt;&lt;/em&gt;. This is so because we did not join on an extra key field, &lt;em&gt;year&lt;/em&gt;, which is common to the &lt;em&gt;activity&lt;/em&gt; and &lt;em&gt;grade&lt;/em&gt; tables. So we modify the query to be:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;which returns the desired result below:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;The thought of how multiple inner joins work can be confusing right? Well, what actually happens like in &lt;em&gt;7.sql&lt;/em&gt; above is that &lt;strong&gt;every single join produces a single derived table&lt;/strong&gt; which is then joined to the next table and on and on. Using the &lt;em&gt;7.sql&lt;/em&gt; example above:&lt;br&gt;
&lt;strong&gt;JOIN 1&lt;/strong&gt;: This is the inner join between the &lt;em&gt;person&lt;/em&gt; table and the &lt;em&gt;grade&lt;/em&gt; tables. Let’s call this result derived table one (DT1)&lt;br&gt;
&lt;strong&gt;JOIN 2&lt;/strong&gt;: This is another inner join between DT1  and the activity table. The result gotten here is the final result returned by this query&lt;/p&gt;

&lt;p&gt;So having delivered what is needed in time, let us refactor what we have written for a healthy codebase. There are several ways the query in &lt;em&gt;7.sql&lt;/em&gt; could have been written. For example:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;The &lt;strong&gt;AS&lt;/strong&gt; keyword is used for creating an &lt;em&gt;alias&lt;/em&gt; - an alias is a temporary name that only exists for the duration of the query. Good use cases for aliases are when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want to make column names or table names more readable&lt;/li&gt;
&lt;li&gt;You want to write less because the names are long. - Example: there is more than one table in your query - so you write &lt;strong&gt;p.id = a.person_id&lt;/strong&gt; instead of &lt;strong&gt;person.id = activity.person_id&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the key field in &lt;em&gt;grade&lt;/em&gt; and &lt;em&gt;activity&lt;/em&gt; tables is also &lt;strong&gt;id&lt;/strong&gt; (and not person_id), the SQL code would rather be:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;You have seen the ON clause so far but if the key field you are joining on has the same name in both tables, use the &lt;strong&gt;USING&lt;/strong&gt; keyword instead.&lt;/p&gt;

&lt;p&gt;So, I believe you are now becoming an SQL savant! Oh Gosh! This new Javascript teacher is so demanding! Now, he is asking for &lt;em&gt;all students' academic details regardless of whether they have a grade or not&lt;/em&gt;. Well! Body no be wood! 😁  So, we have to push this request to another sprint right? 😉 We would set up a meeting with him in the near future to talk about how he wants this because this sounds like a job for the &lt;strong&gt;OUTER JOINS&lt;/strong&gt; and there are 3 kinds.&lt;/p&gt;

&lt;p&gt;Stay tuned to this series for my next article on &lt;strong&gt;OUTER JOINS&lt;/strong&gt;! Have an amazing and fulfilled week ahead! &lt;/p&gt;

</description>
      <category>sql</category>
      <category>datascience</category>
      <category>postgres</category>
      <category>dataanalysis</category>
    </item>
  </channel>
</rss>
