DEV Community: Joy Ada Uche

The SQL Savant: Outer Joins in SQL

Joy Ada Uche — Thu, 31 Dec 2020 18:29:55 +0000

Amazing New Year!!! 😀 So, the series of meetings with the new Javascript teacher went quite well and we got loads of analysis we gotta do...

So right now he wants every student's academic detail whether they got a grade or not which can be easily achieved using a Left Outer Join. Hence, let's talk about OUTER JOINS!

With outer joins, all records from one table are kept even if there are no matches in the other table that it joins on. There are 3 types of outer joins:

Left Joins
Right Joins
Full Joins

With LEFT JOIN, all records from the left table (i.e the left table is the one after the FROM clause) are kept even if there are no matches in the right table (i.e the table after the JOIN type). Remember that from here, the class has a database with the person and grade tables as below:

Applying Left Joins to above, the table with all students whether they have a grade or not looks like below:

From above:

We can see that all records from the person table, which is the left table, are returned,
Also, the records with null are those with values in the left table (person table) but have no matching record in the right table (grade table),
The first 4 records are the same as those when we use an inner join while the last 3 corresponds to students that do not have a grade, hence their grade values are null.

The code for the result of the LEFT JOIN above is below:

Unlike INNER JOINS that keeps just the records corresponding to the id values of 33CC and 44DD, a LEFT JOIN keeps all of the records in the left table but then marks the values as null in the right table for those that don’t have a match.

Moving on RIGHT JOINS, which just does the reverse of LEFT JOINS. It matches all records via the key column from the right table even if there are matching records in the left table. let's see the code below:

From above, the right table is person while the left table is grade. Since the RIGHT JOIN is just the reverse of the LEFT JOIN, the LEFT JOIN is more commonly used.

Finally, let's talk about FULL JOINS! This type of join combines both the LEFT JOIN and RIGHT JOIN. It combines all the records from the LEFT TABLE and the RIGHT TABLE. For record values that do not match for the left and right tables, the value will be null, as seen in other types of outer joins.

Note that in our example case, the result for the FULL JOIN will be the same as the LEFT JOIN because: for example, when using person as the left table and grade as the right table, all records in the right table match records from the left table i.e there are no records in the right table that cannot be found in the left table, hence it returns all records as seen in a Left Join. Now, let's see the Full Join code below:

As you can see from above, we just had to change the join type to FULL JOIN. Also, kindly note that we can do multiple joins with any type of outer joins just like we saw here.

Quite simple! so we can share our SQL analysis with the JS teacher and move on to a special kind of join called CROSS JOIN to perform more analyses! Have an amazing and fulfilled week ahead in this new Year! 😉

The AI Alpha Geek: It starts with EDA! - Part D

Joy Ada Uche — Mon, 30 Nov 2020 19:14:18 +0000

The AI Alpha Geek: It starts with EDA! - Part A
The AI Alpha Geek: It starts with EDA! - Part B
The AI Alpha Geek: It starts with EDA! - Part C

Now, let's see feature relationships i.e exploring 2 or more features together. Let's look at the code example below -

Let's take a look at the Pclass and Survived features below produced from Line 1:

From the count plot output above, which is produced from Line 1, it seems:

a lot more people in the lower class, i.e class 3, didn't survive.
more people in the upper class, i.e class 1, survived.

To know the why's behind the insights above, you can ask questions like:

could it be that passengers in the upper class had the opportunity to escape because they were situated on the upper deck of the titanic?
probably when the ship hit the iceberg, the lower deck flooded and some passengers drowned?
perhaps those at the upper deck were given preferential treatment?

Now you must also be wondering what gender had a higher survival rate? Look below:

From above, remember in Part B that the total number of people who survived is 549 - So, we can see that a lot more males didn't survive.
The above insight brings more questions comes to mind:

could it be that women and children were saved first before adult males?
could it be more males gave their lives for their loved ones?

Now, let's see how more than 3 features relate below:

From the bar plot output above, produced from Lines 7, it seems that:

for each class, passengers with a younger average age survived.

Then, looking at features Age, Sex and Survived below:

From the above, it is visually obvious that males and females with an average age of 27.28 and 28.86 respectively survived.

Let's visually explore the Fare feature output for Lines 13 and 14 below:

From the above, it seems that:

males who survived paid an average fare of 40.82,
while females paid an average fare of 51.94.
those who did not survive had a much lower average fare for both males and females.

When you run other lines of code for other features that include Embarked, Parch, and SibSp, you will draw much more valuable insights from the data.

We sure got more insights which improves how much we understand our data. So, always try to understand or ask about the why behind insights discovered. Stay tuned for the next part where we collate all insights and valuable patterns and dive into Feature Engineering still using the titanic dataset. Have an amazing December! 😉

The AI Alpha Geek: It starts with EDA! - Part C

Joy Ada Uche — Sat, 31 Oct 2020 18:32:05 +0000

The AI Alpha Geek: It starts with EDA! - Part A
The AI Alpha Geek: It starts with EDA! - Part B

Here, we would explore the numerical features in this dataset - Age and Fare. Let's look at the code example below -

Looking at the Age feature, the box plot in Line 1 above outputs the below:

Above is a box plot or sometimes called a box and whisker plot. It is a good way to visualize the spread of the Age variable. Some interesting stuff to note from the plot is -

It seems most people that boarded the Titanic are between the age of 20 and 40.
The protruding lines on both sides shaped like T are the whiskers.
The line that divides the blue box is actually the Median of the dataset - So by looking at the plot, it seems the median Age is close to 30.
The range of this variable is the largest value in the dataset (the rightmost point of the whisker) minus the lowest value in the dataset (the leftmost whisker point).
Q1, Q2, Q3, and Q4 are the 1st, 2nd, 3rd and 4th quartiles of the Age variable - note that Q1 is the same as the 25th percentile, Q2 is the Median of the dataset, which is also called the 50th percentile, while Q3 is the same as the 75th percentile.
The IQR (interquartile range) is b - a, which is the difference between the Q3 and Q1.
Any points you see after the whiskers of the box plot above are referred to outliers. Outliers are data points that are extremely high or low, which makes them far away from other data points. Kindly note that not all outliers are bad data - what if they were not incorrectly entered? Hence, there is a need to investigate the reason behind any outliers before actually going ahead to handle them.

Essentially, the box plot can be seen as the visual representation of the summary statistics table showed in part b, as it graphically shows most of the statistics.

Moving on to the distribution plots for the Age and Fare features produced by Lines 2 and 5 above can be seen below -

From above, it seems the Age feature distribution almost looks like the mirror image to itself from both sides of the centre line. The left side of the distribution is roughly a mirror image of the right side (vice versa). So, it has a roughly normal distribution, which means there are far more data near the mean rather than being far from it - it is a roughly symmetric distribution i.e the area under the curve is roughly equally distributed on either side of the centre line.
The Fare feature, on the other hand, is skewed to the right. We can see that the tail of the distribution is to the right-hand side, hence this is called a right-skewed distribution or a negatively skewed distribution. Here, the area under the curve is shifted to the left side. The tail of the Fare distribution above represents outliers - Now do you see how skewed distributions and outliers are related? Also, from the distribution plot for the Fare variable, it seems a lot of people went for the cheaper ticket.

Understanding these distributions also plays a role in efficiently handling missing data. For example, when there are outliers, using the mean to fill in missing values is not ideal because outliers skew the mean, so we may want to go for the median statistic instead.

Sure hope you enjoyed this piece. Stay tuned for the next part of this series as we go-ahead to explore feature relationships to squeeze out more insights from our data! 😉

The AI Alpha Geek: It starts with EDA! - Part B

Joy Ada Uche — Wed, 30 Sep 2020 20:07:43 +0000

The AI Alpha Geek: It starts with EDA! - Part A

Before we start exploring each individual feature, let's take a look at some statistics for the dataset produced by train_df.drop('PassengerId', axis=1).describe() below:

In the summary statistics above, looking at the Age feature for example:

the count is 714, which tells us there are 177 missing entries since the total entries are 891 - we would need to deal with this later on when handling missing values,
the mean age is 29.699, which is the average age of passengers who were aboard i.e the value 29.699 was the typical or normal age of the passengers aboard,
the std (standard deviation) of 14.526 tells us that most of the passengers are in the age range (29.699-14.526) to (29.699+14.526),
the min age is 0.42, which tells us the least age is for a baby on board,
the 25th percentile is 20.125 years shows that 25% of passengers is less than 20.125 years,
the 50th percentile, which is the median is 28 years, tells us that half of the passengers onboard are below 28 years old - seems most of the passengers were young,
the 75th percentile, which is 38, tells us that 75% of the passengers are less than 38 years, and
the max age is 80 years, which is the age of the eldest passenger onboard - luckily, it seems there are no aliens onboard.

Now, it's time for some univariate analysis - this is just descriptive analysis of one variable at a time which it helps us understand the data distribution for that variable and even detect outliers. Let's start with the categorical variables -

In the code example above, taking a look at the output for the target variable, Survived, below -

value_counts() is used to get the counts of unique values for this column - and it seems a lot more people did not survive. Note that it is not a perfectly balanced dataset but this is not a case where the number of those who didn't survive is far more significant than those who survived.
to get the percentages of each class (i.e survived - 1 and deceased - 0), set the normalize parameter of value_counts() to True.
to have a better view of the count for each class, we use count plot via Seaborn. The label_chart() is just a helper function to label the chart.

Let's see some insights gathered from the code output from eda_part_b.py above -

For the Pclass feature, it seems a lot more people that were on board are in class 3 and from Part A of this series, we saw that these are people in the lower socio-economic class, which seem to mean most onboard got the cheap ticket,
Seems more males boarded when you look at the Sex feature, as 64.76% of passengers are males,
Most passengers boarded from the Southampton port, and it seems most passengers came alone since most have 0 siblings and/or travelled with just a nanny.

So, all these give us more insights to explore further - Stay tuned for the next parts on this topic, on this same series, where we go-ahead to explore individual numerical variables for patterns. Wish you an awesome October!

The AI Alpha Geek: It starts with EDA! - Part A

Joy Ada Uche — Mon, 31 Aug 2020 17:54:08 +0000

...so just give me all your data and I will quickly use one of those boosting ensemble libraries like XGBoost or LightGBM or probably Catboost and then do some stacking and give you a very high performing model! That's how we roll! - hmmm, You Wish! 😆

If you really want to be a ~~good~~ great AI engineer, you need to first understand your data and build intuition about your data and this is where Exploratory Data Analysis (EDA) comes in!

The ability to dig into data and derive trends or patterns or relationships is a superpower! 😊. EDA helps you get a better understanding of your data, validate your hypothesis, derive insights and new features in order to get the best performing model.

We will go through an example of how EDA is performed using the Titanic Dataset -

So, from lines 1 to 11, some python packages are imported; and then some display options and styling are set up:

For line 2 through 5, pandas and NumPy for data analysis and numerical computation respectively are imported. Also, we have Matplotlib and Seaborn for data visualization.
Lines 7 and 8 help us see all columns and rows when using the head() method while Line 9 controls the width of the display in characters.
Line 11 just customizes whatever plot we going to create. It sets the aesthetic style of our plots.

Then, from lines 14 - 20, the data is read in and the process of data exploration begins:

Line 15 reads in the data while Line 16 gives the number of records(rows) and features(columns) in our dataset.
Line 17 returns the top 10 rows in the data
Lines 18 through 20 list the features, shows their data types and gives a concise summary of our dataset respectively.

Now, at this point, we really need to understand what each feature actually represents because it will determine the data type which informs how that feature is preprocessed - and feature preprocessing plays a role in deciding the model to be used in order to gain optimal performance. Also, understanding each feature helps feature generation. You see how all these ties in now right?

So, we have to do some research (in this case, get some domain knowledge about ship transport as regards the titanic) to know what each feature entails. Here, we explain some features and see that -

PassengerId - is the unique id that identifies a passenger
Survived - tells if the passenger survived or not. 0 is for deceased (No) while 1 is for survived (Yes)
Pclass - is the passenger's class, which can be 1, 2 or 3
SibSp - gives the number of siblings and spouse aboard
Parch - tells the number of parents and children aboard
Cabin - is the cabin number the passenger is in
Embarked - gives the port of embarkation/boarding/departure which can be C (Cherbourg) or Q (Queenstown) or S (Southampton).

Let's dive into the top 10 rows produced by train_df.head(10) below -

From above, we have mainly Categorical and Numerical features in our dataset - Categorical features, also called qualitative features, are features that can take a limited number of possible values. They can even take on numerical values but you cannot perform math operations on them because they have no meaning mathematically. But we have a kind of categorical feature that has meaning, which is called the ordinal feature. The categorical features in our titanic dataset are Survived, Name, Ticket, SibSp, Parch, Sex, Cabin, Embarked and Pclass -

Survived and Sex are binary categorical features,
Name is a categorical feature which has text values,
Embarked, SibSp and Parch is a categorical feature with more than 2 values,
Ticket is a categorical feature with a mix of numeric and alphanumeric. This feature value might actually mean something: from my little research, I think it can be used to find potential family members or nannies for each passenger - which can be a new feature generated perhaps,
Cabin is a categorical feature that is alphanumeric,
Although Pclass has a numeric datatype, it is actually is an ordered categorical feature i.e an ordinal feature which is ordered in a meaningful way - 1 is for 1st class; 2 is for 2nd class and 3 is for 3rd class. During your feature description research, you would know these have to do with the passenger's socio-economic status.

Numerical features, on the other hand, also referred to as quantitative features are features that have meaning in terms of measurement (continuous data) or it can be counted (discrete data). The numerical features are Fare (continuous), Age (continuous), and PassengerId (actually just an ID feature that identifies each passenger).

Now, let's take a look at the concise summary of the dataset produced by train_df.info() below:

Looking at the summary of our dataset above, getting some kind of domain knowledge via research on the data would help us in a number of ways -

to know if the feature has the correct data type, which when rightly converted helps in feature generation and also help in saving memory. For example, from above, the output tells us that all categorical types are stored as object data types. So, converting some of these to categorical data types would help in saving up some more memory.
to understand why there are missing values - from above, Age, Cabin and Embarked features have 714, 214, and 889 entries, which is not up to the expected 891 non-null entries - 🤔 so, we can try to find out why this is the case - Yeah! have a curious mindset
to know if the values of a feature are intuitive and actually contain the expected values. For example, we all know at least in these times (unlike in the days of old), that humans hardly live up to 200 years. So, when we do further data exploration in part B of this article series and see a passenger aged above 200 - it is one of 2 things - either it is an error or perhaps the titanic had some vampires or aliens onboard! 😂 - do not overrule any fact - anything is possible, so do your research and be sure!

So, this is just a tiny little bit of the process involved when you are starting out the exploration of your datasets! Stay tuned for the next parts on this topic, on this same series, where we look at what statisticians call the Five-number summary, then we will start ** analyzing each individual feature** and then go ahead to understand feature relationships and more in order to squeeze out all the insights from our data! Now, you make sure you have an amazing week ahead! 😉

The Math ML Maestro: Introducing Linear Algebra Applications

Joy Ada Uche — Fri, 31 Jul 2020 16:55:21 +0000

Do you remember back then during school days when you did some vector or matrix operations? Or did some probability and statistics? Or performed some differentiation, integration and then getting confused about what the math question you just finished solving was 😆? Haha! You know right?

Ermm! hope you were pretty attentive during those Math classes 👀 because when we do Machine Learning, we are essentially solving a Math problem, and one of the Math topics with enormous applications in Machine learning is Linear Algebra!!!

Linear Algebra plays a huge role in machine learning! Back then, we modelled real-world problems into systems of linear equations and solved these equations via substitution, via elimination or via the graphing method. Take a look at a linear system with 2 unknowns below:

A linear system is when variables e.g x and y above are raised to a power or an exponent of 1 i.e they are first-degree variables. In 1.md above, there are just 2 unknowns and it can easily be solved via either substitution, elimination or graphing methods. So, Yeah! we can solve that with a pen and paper but machines can't solve that in such form! Just imagine each equation as a row or observation in a dataset where the right side of the equation above is the target or response or dependent variable, and the left side of the equation represents the features or predictors or independent or input variables. Then, think of a system of 1000 equations with 1000 unknowns 🤔 - it will be time and energy-consuming solving these equations using previous methods mentioned. This challenge gives room for Matrices, which is a key data structure in linear algebra and helps handle more data instead of just 2 observations like in 1.md above. To represent the above equation in a matrix form, see below:

There are numerous examples of Linear Algebra in Machine Learning. To do Machine learning (ML), data needs to be in such a way that ML models or ML algorithms can ingest it - Most ML models take numeric inputs - Datasets usually contain a number of observations (rows) characterized by features (columns). The Dataset is a matrix with each column representing a vector - a collection of vectors is a matrix. Matrix Operations, such as addition, subtraction, multiplication, transpose etc are applied in Machine Learning!

Also, in this light, for an image classification problem, the inputs to the neural network are tensors. For a classic Artificial Neural Network (ANN), given a grayscale image which is a 3D tensor where the 1st dimension is the index of the image, and the other 2 dimensions give the dimension of the arrays that contain the image pixels. To input this image into an ANN, this 3D tensor is flattened to get a 2D tensor, where the 1st dimension tells what row an image corresponds to and the 2nd dimension is a single vector that contains the pixels of the image. Images are represented as tensors in order for computers to process them and guess what? A tensor is just a generalization of vectors and matrices to higher dimensions potentially!

Furthermore, depending on the prediction task, when we do some data preprocessing using the popular one-hot encoding technique to convert categorical variables, we get a vector - a binary vector, which is a better data format for ML algorithms to be trained on in order to give a better prediction.

So, it has lots and lots of applications. See Some examples below:

Want to implement Principal Component Analysis (PCA), which is a popular dimensionality reduction technique to avert the curse of dimensionality? Linear Algebra is used here!
What about Word Embeddings which is in the field of Natural Language Processing (NLP) that seems like the hottest field in Machine Learning right now? Linear Algebra is applied here too! 😎
Did you just say Optimizing Deep Learning Models? still Linear Algebra!
Even in Computer Vision, Encoding Data as I briefly explained above and lots more that I have not even mentioned, Linear Algebra shows itself! Yeah! We are so stuck with it! 😂

Now! Will you keep running away from Math or will you rather just embrace it whole-heartedly? 😄

So, I just shed some light on how Linear Algebra is used in Machine Learning. Stay tuned on this series where we start dissecting one application at a time, going in a little deeper, starting with The Maths behind Linear Regression in order to understand the ins and outs of it. Have an amazing week ahead!

The Big Data Bravura: Introducing Apache Spark

Joy Ada Uche — Tue, 30 Jun 2020 18:55:10 +0000

Did you just say you need to handle a minimum of 100 TB of data (volume) that is generated at high speed (velocity) from different sources consisting of structured data like CSV’s, semistructured data like log files and unstructured data like video files (variety) that are also trustworthy and representative (veracity) and can give insights that can lead to groundbreaking discoveries and reduce costs (value)? 😲 Good Gracious! This is Big Data!

We would need a cluster of machines and not just a single machine to process big data and this is where Spark comes into play. With Spark, you can distribute data and its computation among nodes in the cluster - with each node having a subset of the data, data processing are done in parallel over the nodes in a cluster. Spark does all these in memory which makes it lightning fast!!!⚡

Spark is made up of several components. One of them is the Spark Core, which is the heart of Apache Spark and it is the basis for other components. The Spark Core uses a data structure called the Resilient Distributed Datasets, which is the fundamental data structure in Apache Spark.

To develop Spark solutions, we can use Scala, Python, Java or R. Here, I would develop an introductory Spark application using Python via Pyspark, which is the Python API that supports Apache Spark. We would take a look at an introductory example using an RDD - The .csv file used in this example is here. Without further ado, as Spark does it ⚡, let's jump right in below -

From 1.py above -

we compute the average number of subjects by class. In the student_subject.csv, an example entry s400,c204,10 represents the student_id, class_id and number of subjects.
Lines 1 to 4 - we import the needed pyspark classes and setup the configuration and use it to instantiate the SparkContext class. local creates a local cluster with only 1 core on your local machine.
Lines 14 - we read in the csv file, which now becomes an RDD (student_subject_rdd), where every line entry is a value.
Lines 16 - we transform the student_subject_rdd into an RDD of key-value pairs of class_id and number_of_subjects e.g ('c204', 10).
- Lines 7 to 11 - the get_class_and_subject function, splits each line entry by a comma, gets the needed fields by indexing and then returns them.
- Note that on Line 10, the number_of_subjects is cast explicitly into an int.
Lines 18 to 19 - transforms the RDD further by adding 1 as part of the values. One key difference between map and mapValues is that with mapValues, the keys cannot be modified, so it is not even passed in. i.e key- value pairs of ('c204', 10) passes in just 10. So, modified_class_subject_rdd will contain something like ('c204', (10, 1)), where c204 is the key and (10,1) is the value.
Lines 21 to 23 - reduceByKey combines items together for the same key.
- remember modified_class_subject_rdd, can return multiple items for the same key. e.g [('c204', (10, 1)), ('c204', (8, 1)), ('c204', (7, 1)), ('c204', (6, 1)), ('c204', (7, 1))]
- with the reduceBykey action, it becomes [('c204', (38, 5))], where c204 is the key and (38, 5) is the value representing sum total of subjects done and frequency count respectively for class_id c204.
Lines 25 to 26 - computes the average by class while Line 28 to 32 produces an array and prints the results.

Awesome! so from above, we cooked up an example Spark application using the RDD. Methods called off an RDD can either be a transformation or an action. A transformation like mapValues just produces another RDD while an action like collect products the result. Essentially, a transformation on an RDD happens when an action is called. This concept of Lazy Evaluation increases speed since execution will not start until an action is triggered.

Spark is amazing and I know you are being Sparked up in becoming a Big Data Bravura. Stay tuned on this series for my next article on Introducing Spark Dataframes, which is a data structure built off the RDD and is much easier to use than the core RDD data structure. Have an amazing Sparked up Week! 😉

The ML Maven: Introducing the Confusion Matrix

Joy Ada Uche — Sun, 31 May 2020 19:30:26 +0000

Random Friend: OMG! you won’t believe this - I got a high accuracy value of 88%!!!

Me: oh really? Sounds interesting!!! From the said metric, it seems to be a classification problem. So, what specific problem is your classifier trying to solve?

Random Friend: so, my model helps predict whether we are going to have an earthquake or not in Wakanda using data sourced from the Government of Wakanda’s website on earthquake happenings over a period of time. Here are some visual EDA on the training dataset.

Me: Wawuu! Can I take a look at your confusion matrix?

Random Friend: Does that really matter? Besides I think I got a high accuracy.

Me: Smiles! It definitely matters especially with the fact that it looks like you are dealing with imbalanced classes as shown by the visualization.

Random Friend: oh I see! Then, let me quickly generate that for you… Here it is below -

Me: Alrightee! Let’s see what we have here and explain certain useful metrics below -

So, having an accuracy of 88% means that your model is correct 88% of the time but incorrect 12% of the time. Well, since this sounds like a life-death situation, it does not seem good enough. Just imagine the number of lives that can be lost 12% of the time the model incorrectly makes a prediction 😨!

It is a very common scenario where one class is more than the other class, like in your case, the class of NO earthquakes occurrences are more frequent or have more instances than the other class of YES earthquake occurrences.

So, Accuracy is not enough to evaluate the performance of your model, Hence the need for a Confusion Matrix. It summarizes a model’s predictive performance and we can use it to describe the performance of your model -

Note that the Positive class is usually 1 or a YES case or it is what we are trying to detect. The negative class is usually 0 or a NO case.

As you can see from 1.md,

True Negatives (TN): 60 gives the number of NO earthquake occurrences correctly predicted
True Positives (TP): 150 gives the number of YES earthquake occurrences correctly predicted
False Negatives (FN): 10 gives the number of YES earthquake occurrences incorrectly predicted as NO earthquake occurrences.
False Positives (FP): 20 gives the number of NO earthquake occurrences incorrectly predicted as YES earthquake occurrences

A False Positive is also known as a Type 1 error. This is when the model predicts that there will be an earthquake but actually there is not. This is a False alarm! So, this will make the people of Wakanda panic which will cause the government of Wakanda to do all that it can to save lives from the possible earthquake. This can lead to a waste of resources - Perhaps the government had to move people to a different geographic location where they will be catered for by the government.

A False Negative is referred to as a Type 2 error. This is when the model predicts that there will be no earthquake but actually there is. This is catastrophic! The people of Wakanda will be chilling and suddenly an earthquake will greet them 😭!

We can also calculate some useful metrics like -

Recall: This is also called True Positive Rate, or Sensitivity or Hit Rate. It is the probability that an actual positive would be predicted positive i.e It tells us how often an actual YES earthquake occurrence will be predicted a YES earthquake occurrence - what proportion of YES earthquake occurrences are correctly classified or predicted. From 1.md, it is calculated as below -

High recall means that this model has a low false-negative rate i.e not many actual YES earthquake occurrences were classified or predicted as NO earthquake occurrences - the classifier predicted most YES earthquake occurrences correctly.

Specificity: This is also called the True Negative Rate. It is the probability that an actual negative would be predicted negative i.e It tells us how often an actual NO earthquake occurrence will be predicted a NO earthquake occurrence - what proportion of NO earthquake occurrences are correctly classified or predicted. From 1.md, it is calculated as below -
Positive Predictive Value: This is also called the Precision. It is the probability that a predicted YES is correct or true i.e It tells us how often the prediction of a YES earthquake occurrence is correct. From 1.md, it is calculated as below -

High precision means that this model has a low false-positive rate i.e not many actual NO earthquake occurrences were classified or predicted as YES earthquake occurrences.

Negative Predictive value: It is the probability that a predicted NO is correct or true i.e It tells us how often the prediction of a NO earthquake occurrence is correct. From 1.md, it is calculated as below -
F1 Score: this is also known as the Harmonic Mean of precision and recall. From 1.md, it is calculated as below -

With the problem we are trying to solve, perhaps we should be much more concerned with reducing the False Negatives or Type 2 errors i.e when the model predicts there will be no earthquakes but there are actually. This is much more dangerous than the type 1 error in this case. The model incorrectly classifies 10 cases in which earthquakes occurred by saying that it did not occur. Just imagine chilling with some fresh orange juice and watching Black Panther on Netflix and suddenly the ground starts shaking suddenly 😱!!!

The metric you choose to optimize depends on the problem being solved. Let us take a look at some scenarios below -

If the occurrence of false negatives is unaccepted, then choose to optimize Recall - like we would want to do for the earthquake occurrence problem. Here, we won’t mind getting extra False positives just to reduce the number of false negatives i.e we would rather say that an earthquake will occur when it will not RATHER than say an earthquake will not occur and it does occur.
If the occurrence of false positives is unaccepted, then choose to optimize Specificity. Let me give an example where False Positives should not be overlooked - If I am trying to predict if a patient has coronavirus: Since I am trying to detect coronavirus, so having coronavirus (a yes or 1) would represent the positive class while being healthy would represent the negative class. So, if I carry out a test where a patient that is detected or predicted as positive (i.e have coronavirus) would be quarantined, I would want to make sure a healthy person is not detected as having coronavirus. In this case, we would not accept any false positives.
If you want to be extra sure about the true positives, choose to optimize Precision. For example, if we are detecting coronavirus, testing centres would want to be very confident that a patient classified or predicted as having the virus truly has it.
Choose to optimize F1 score if you need a balance between Precision and Recall.

Random Friend: Amazinggg! With these explanations, I will definitely work on improving my model’s performance.

Me: You are always welcome! Excited you are becoming a Machine Learning Maven! Stay tuned on this series on Introducing the ROC Curve! Have an amazing and fulfilled week ahead!

The Data Viz Wiz: Introducing Matplotlib

Joy Ada Uche — Thu, 30 Apr 2020 18:20:56 +0000

A picture is worth a thousand words 😀! You can derive valuable insights from data and also communicate these insights via data visualization.

We would clearly see trends and derive insights via Python’s Matplotlib library, which is the foundational library used by many visualization tools. There are 3 main layers in Matplotlib’s architecture and from top to bottom in terms of the high-level commands, they are -

matplotlib.pyplot module - the scripting layer which is often called procedural plotting and is used when you want to quickly create plots and get done with it. This layer is designed to work like a MATLAB script.
matplotlib.artist module - the artist layer which is often called objected-oriented plotting and you can do a lot more customizations because you have much more control. Note that this layer also uses the pyplot module for a few functions like creating the figure - we would see in the examples below that even in the object-oriented approach, pyplot is still used in creating the figure, which holds anything plotted, as you would see below.
matplotlib.backend_bases module - the backend layer - Matplotlib can be used in many ways and also have different outputs formats e.g Matplotlib can be run from the python shell and we have plotting windows pop up; or it is run via Jupyter notebooks and plots are drawn inline. So, the backend layer exists to support these several use cases and outputs.

So, with what we have above, there are essentially 2 ways to create plots in Matplotlib -

The procedural way - this is where we mostly do plt.xxx
The object-oriented way - this is where we mostly do ax.xxx

Everything we plot in Matplotlib is contained in a figure object which can contain one or more axes. Take a look at figure anatomy image below -

We would focus on the object-oriented way in this piece since we can do a lot more customizations with it. Also, note that there are different ways to create an axes - an axes is contained in a figure as seen above. These different ways of creating an axes still produce the same results - I will highlight the different ways. Now, let’s kick off some Matplotlib plotting by taking a look at 1.py below:

1.py produces a figure with 1 axes image below -

In 1.py above -

Line 1 = pyplot is a module in Matplotlib which will help us in plotting. It is conventionally imported with the alias plt
Line 3 = when plt.subplots() is called without any arguments, it creates 2 objects - a Figure object and an Axes object. The Figure object is like a container that holds the axes and a figure can contain multiple axes. The Axes object is where we plot our data to visualize it
Line 4 = displays the plot - which is a figure with empty axes because no data has been added yet

There are different ways used in creating an axes in Matplotlib - plt.subplot(), plt.subplots() and plt.axes() are all from the scripting layer and it corresponds to fig.add_subplot(), fig.subplots() and fig.add_axes() from the artist layer. Lines 6-44 shows other ways of creating an axes which produce the same result.

Line 16 - fig.subplot(1, 1, 1) means 1 row, 1 column and the last argument gives the position of the subplot which is the 1st subplot in this case - the last argument has to be less than or equal to the product of the 1st and 2nd arguments
Line 28 - fig.subplots(1, 1) means 1 row and 1 column
Lines 39 and 43, the common arguments [0.1, 0.1, 0.8, 0.8] makes the axes 10% from the left of the figure, 10% from the bottom of the figure, 80% width of the figure, and 80% height of the figure.

Now let’s add some fictional data to our figure! See 2.py below -

2.py produces a line plot of Lagos average monthly temperature below -

In the line plot above, we can clearly see the temperature pattern as it increases from Jan to Jul then it starts decreasing from Aug to Nov then it starts increasing again. Imagine you have lots of data, would you rather go through the pain of reading off average temperatures from a table or use the line plot which shows clearer trends in the data?

We can even add more fictional data like in 3.py below -

3.py above produces a line plot of Lagos and Abuja average monthly temperature we see below -

In the image above, we can clearly see -

Abuja is warmer than Lagos for the first 8 months - perhaps you might prefer chilling in Lagos for the first 8 months of the year 😉?
Abuja has a drop in temperature in September which is even lower than that of Lagos for the same month - seems you might want to travel back to Abuja this time perhaps to meet with family 😉?
But hey chill! looks like for the rest of the year, there is a rise in temperature which is higher than that of Lagos - Ermmm! I think you might want to just chill in Lagos for a bit and monitor the trends for a while 😉?

The line plots we have seen so far shows us the monthly trends for the average temperature across different cities but it does not communicate the data properly in a way that it can be easily understood. This is where we have to customize our plot in order to communicate the information more clearly. Let’s see 4.py below for some customizations -

4.py above produces a customized line plot of Lagos average monthly temperature values below -

In 4.py above -

In lines 12-14, we added these arguments:
- marker which shows the actual data points
- markersize, makerfacecolor, markeredgewidth, markeredgecolor which customizes the marker by increasing its size, adding colour to its fill and giving the marker outline width and colour respectively
- linestyle, linewidth and color which gives the line in the plot its style, width and color. linestyle can be shortened to be ls while linewidth can be shortened to be lw
Line 16 - sets the label for the x-axis
Line 17 - sets the label for the y-axis
Line 18 - sets the title for the line plot and this provides context for our visualization
Note that every customization is done before a call to plt.show() is made.

To customize the line plot of Lagos and Abuja average monthly temperature, see 5.py below -

5.py above produces a customized line plot of Lagos and Abuja average monthly temperature values below -

Sometimes, when we add more data to a plot like in customized line plot of Lagos and Abuja average monthly temperature values above, it makes it look so busy and it becomes a big mess which conceals patterns or trends in data rather than conveying them. The solution to this is to use subplots. Subplots are several small plots which show identical data under different conditions e.g temperature values data for different cities. Let’s see 6.py below for an example -

6.py above produces subplots of the average monthly temperature across cities

In 6.py above -

Line 16 - sharey = True ensures subplots have the same range on the y-axis based on the data from both datasets
Line 22 - since the subplots are on top of each other, we can just add the x-axis label to the bottom plot

In the example given in 6.py above, we got a 1-dimensional axes object since one of the dimensions is 1. For a 2-dimensional array, we can access the object in several ways like we see in lines 29-69 -

1a, 1b and 1c show different ways we can access the axes object: via regular indexing we do in Python, via flattening the 2D array and via tuple unpacking respectively.
The rest shows the different usage patterns when ax is an array of axes objects.

As seen above you can create axes via different ways e.g. by adding an axes to the figure via fig.add_axes(), this would not be a subplot, but an axes which is an object of matplotlib.axes._axes.Axes. An axes, which is created via the subplot way is a matplotlib.axes._subplots.AxesSubplot. This class derives from matplotlib.axes._axes.Axes, thus this subplot is an axes. Hence, every subplot is an axes object but not every Axes object is a AxesSubplot object. An axes contain the x-axis and the y-axis. Be it singular or plural, it is still called axes.

Alrightee!!! Glad we are gradually demystifying the mystical Matplotlib library. I know you will eventually get the hang of it and begin your tour in becoming a Data Viz Wiz! Stay tuned to this series for my next article on Visualizing categorical and quantitative variables via Matplotlib! Have an amazing and fulfilled week ahead!

The Pandas Pundit: Accessing Data in DataFrames

Joy Ada Uche — Tue, 31 Mar 2020 18:56:10 +0000

Congratulations - You have just landed a new job as a Data Scientist 😀! In your first month, you need to start analyzing tons of data. But before you start unlocking insights and predicting future trends, you need to access these data in order to explore it. Yikes! Did you wish you just started predicting future trends right away? Smiles, You are not Doctor Fate!

These tons of data can vary greatly in form and they are commonly seen in a tabular structure, where we have rows (also known as records, observations etc) and columns (also known as features, variables, fields etc) - like in 1.md below:

Data in .csv and .xlsx files have a tabular-like structure and in order to work efficiently with this kind of data in Python, we need to use the Pandas package. In Pandas, there is a data structure that can handle tabular-like structure of data - this data structure is called the DataFrame. Look at 2.md below to see the DataFrame version of the 1.md:

In 2.md, you can see a similar structure like in 1.md - we also have rows and columns - each row has a unique row label - NG, CA, BR, CH, FR. The columns also have labels - country, capital, population_millions. So, how do you put this data in a DataFrame to start exploring? Also for you to explore it well, what are the different ways to access the data in this DataFrame? Cheers! You are about to start your journey on becoming a Pandas Pundit!

You are given tons of data in a CSV file as seen below in 3.csv:

First of all, to get the data in 3.csv into a DataFrame, look at 4.py below:

which returns the DataFrame in 5.txt below:

Now that we have our data in a DataFrame, it is time to access it. There are several ways to access or select or index or subset or slice data in DataFrames - Data can be accessed via:

square brackets: [ ]
loc: label-based
iloc: position-based
at: label-based
iat: position-based

Let’s see how you can access data in columns only, rows only and both rows and columns from the Dataframe in 5.txt using the 5 ways above:

square brackets [ ] - 1

Let's look at column access and row access using []:
1.1: Column Access:
We have single column access and multiple column access.
1.1.1: single column access:
To access data in the Country column in 5.txt above, for example, we do:

which returns:

In 7.txt above, the dtype (datatype) of the what is returned is an object. The type of object returned can be known using:

In 8.py above, it is a pandas Series object. A pandas series is a 1D (1-Dimensional array) that can be labelled just like the DataFrame, a series has row labels/indexes. So, with this, it shows that a collection of series creates a DataFrame.

This series object returned can also be accessed using the square brackets. For example, to grab the value Nigeria in 7.txt above, see 9.py below:

Also, note that I can use the dot notation as seen in lines 19-20 above. Use the dot notation when the column name does not contain any special characters or spaces and is not a keyword in python.

However, if you want a DataFrame returned and not a series - use double square brackets as seen in 10.py below:

1.1.2: multiple column access:
To access more than one column in 5.txt, we do:

which returns:

1.2: Row Access:
The only way of accessing rows in a DataFrame using the square brackets way is by specifying a slice on the row. Specifying a slicing index takes the form - start:stop:step/stride

Indices are either numeric which is the default or labelled. Let's dive deeper below:

1.2.1: default numeric indices:
Given [x:y:z] as a slicing index, it means count in increments of z starting at x inclusive, up to y exclusive - for numeric indexes, the stop index is always exclusive.

Take a look at this figure below:

In the figure above, the direction in which my rows are returned is determined by the sign of the step/stride i.e z, given [x:y:z].

If the step/stride is positive, start from the specified index position of the DataFrame and go the downward/forward direction when returning rows*.
If it is negative, start from the specified index position of the DataFrame and move upwards/backwards when returning rows.

Given [:y:-z] as a slicing index, start from the last row in the DataFrame and go backwards/upwards but If [:y:z] is given, start from the first row in the DataFrame and go forward/downward.

In the forward/downward or in the backward/upward direction, the start index should always come before the stop index else no rows will be returned.

Let us take a look below at how positive and negative strides works:

1.2.1.1: positive step(s)/stride(s) :

In 13.py, lines 3-9 above, we use the default numeric index. It has this structure start:stop:[step or stride] - [step or stride] in square brackets means it is optional. Given countries[1:3], 1 is the start while 3 is the stop. When the step or stride is not specified like it is the case here, it has a default value of 1.

In 13.py, lines 11-18 above - given countries[2:] - note that the stop index is omitted i.e the explicit end index position is omitted. countries[2:] returns rows starting with the row with index position 2 till the last row in the DataFrame inclusive as seen above. : is a universal slice. If its left-endpoint (start) is omitted then the rows returned starts from the very first row in the DataFrame but if the right-endpoint is omitted, the row returned is till the very last row in the DataFrame inclusive.

In 13.py, lines 20-27 above - given countries[:3] - note that the start index is omitted i.e the explicit start index position is omitted.
countries[:3] returns rows starting with the first row till the row with index position 2 inclusive - remember that a numeric stop index is exclusive: so, the row at index position 3 is not returned as seen above.

In 13.py, lines 29-38 above, given countries[:] - note that both the start and stop indices are omitted i.e the explicit start and stop indices positions are omitted. countries[:] returns rows starting from the start of the row till the last row inclusive i.e it returns every row.

In 13.py, line 40-47 above, given countries[::2], using the formula above, this means counts in increment/steps/strides of 2 starting from the first row up to the last row inclusive i.e it returns every 2nd row. So, this is how it works - it return rows

starting from the first row which has index 0,
add steps of 2 i.e index 0 + 2 = index 2; then index 2 + 2 = index 4
so, we have rows with index positions 0, 2, 4 returned

In 13.py, lines 49-54 above, given countries[-1:] - note the
following below -

the DataFrame, countries, has rows labelled NG, CA, BR, CH, FR. These rows also have default numeric indices which can be positive 0, 1, 2, 3, 4 or negative -5, -4, -3, -2, -1.
countries[-1:] is the same as countries[-1::1] - so we start from the last row and go downwards - downwards since the step/stride is positive
this returns only the last row because there is no other row downwards

In 13.py, lines 56-64 above, given countries[:-1] - it returns rows
starting from the first row but excludes the last row.

In 13.py, lines 66-72 above, given countries[-2:] - it returns rows starting from the last but one row till the last row inclusive.

In 13.py, lines 74-81 above, given countries[:-2] - it returns rows
starting from the first row till it includes the row before last but one
row.

In 13.py, lines 83-88 above, given countries[-1:-1] - it returns
an empty DataFrame - starts from the last row but excludes the last
Row, hence nothing is returned.

1.2.1.2: negative step(s)/stride(s) :
With negative steps, rows get returned backwards. Let us see some examples in 14.py below:

In 14.py, lines 3-8 above, given countries[3:-4:-1] - note the following below:

the stride/step is negative, so rows get returned backwards/upwards - when you look at the tabular-like structure in 5.txt above, we start getting rows from the end depending on the specified index and then go upwards/backwards i.e from the row with label CH and then move upwards.
the stop has a negative value of -4 which is the row labelled CA. So, this row and the ones after it is not be included in the rows returned
hence we have just two rows labelled CH and BR returned.

In 14.py, lines 11-17 above, given countries[4:-4:-2] - note the following below:

the stride/step is negative and in steps of 2
so we return rows starting from index position 4, then go upwards/backwards i.e from row with label FR and then move upwards/backwards but excludes row with index position 4 with all other rows after it too.

In 14.py, lines 19-24 above, given countries[0: -1: -1], it returns an
empty DataFrame. Here is why:

Since the step/stride is negative we start at 0 and then go upwards or backwards - but it seems the stop index comes before the start index - so this cannot work, hence an empty DataFrame is returned

1.2.1: Labelled indexes:
With labelled indexes, it also takes the form start:stop:step/stride. But here are some things to take note of below:

The stop label is inclusive, unlike numeric indices where the stop index is exclusive.
Pairing labelled and numeric indices in the start or stop indices are not allowed e.g countries[‘NG’: 4] would give an error
The step/stride is still numeric and can also be positive or negative
and of course, all rules that go with having positive or negative step/stride applies here too.

Let’s see some examples in 15.py below:

In 15.py, lines 3-10 above, given countries[‘CA’:’CH’], we can see that the rows returned includes the row with the label CH which is the stop index label specified.

In 15.py, lines 12-20 above, given countries[:’CH’], the rows returned starts from the first row in the DataFrame till it includes the row labelled CH.

In 15.py, lines 22-30 above, given countries[‘CA’:], the rows returned starts from the row labelled ‘CA’ in the DataFrame till it includes the last row.

In 15.py, lines 32-39 above, given countries[‘NG’::2], the rows returned -

starts from the row labelled ‘NG’ in the DataFrame
add steps of 2 - index NG has a default index position of 0, so 0 + 2 = index 2 which is the row labelled BR; then index 2 + 2 = index 4, which is the row labelled FR
so, we have rows labelled NG, BR, FR which have default numeric index positions 0, 2, 4 returned.

In 15.py, lines 41-49 above, given countries[:‘CA’:-1], the rows
returned -

starts from the last row labelled FR in the dataframe
add steps of -1 :
- index FR has a default negative index position of -1, so -1 + -1 = index -2 which is the row labelled CH
- then index -2 + (-1) = index -3, which is the row labelled BR
- then index -3 + (-1) = index -4, which is the row labelled CA
so, we have rows labelled FR, CH, BR, CA which have default numeric index positions -1, -2, -3, -4 returned.

Using square brackets, [ ] has its limitations like the ability to select several rows and columns at the same time. So, let's jump into loc and iloc to see its awesomeness!

loc - 2

2.1: Row Access:
By default, loc accesses rows. Loc is label-based, so I just need to specify the row label. Let us see some examples in 16.py below:

In 16.py, lines 3-16 above, given countries.loc[‘FR’] or countries.loc[[‘FR’]], the row with label FR returns a series or dataframe respectively.

In 16.py, lines 18-25 above, given countries.loc[[‘CA’, ‘BR’, ‘CH’]], the rows with labels listed in the square brackets are returned

2.2: Row and Column Access:
We can simultaneously access rows and columns using loc. It takes the form countries.loc[row, column]. Let us see some examples below in 17.py:

In 17.py, lines 3-10 above, given countries.loc[['CA', 'BR', 'CH'], ['Country',
'Capital']] - we specify a list of row labels and also a list of column labels we want to be returned.

In 17.py, lines 12-50 above, we can see slices can also be specified. Everything we have seen so far in relation to positive or negative steps/strides also applies here as seen in the examples given.

2.3: Column Access:
We can also select the specific columns we need while we select all rows as seen below in 18.py:

iloc - 3

Just like loc, iloc is row-based by default. The difference is that iloc is position-based. let's see some examples in 19.py below:

19.py above gives the iloc version of the examples given for loc for row access, row and column access and column access. As I earlier said, the only difference is that iloc is position-based while loc is label-based.

Kindly note that ix, which is also a way to accessing data in a DataFrame, is deprecated in favour of loc and iloc, so it is not advisable to use it the ix indexer.

at - 4

at is label-based like loc but it is used only to access a single value in a DataFrame. Unlike loc, which can not just get a single value but also several values as we have seen above.

See some examples below in 20.py:

iat - 5

iat is position-based like iloc but also accesses a single value like at. See some examples below in 21.py:

Now, let's summarize the key points together:

Accessing data using indices takes the form, [start:stop:step]
Numeric stop indices are exclusive
Label stop indices are inclusive
A step can be positive or negative - When it is positive, start at the specified index then go downwards/forward but when it is negative, start at the specified index then go upwards/backwards.
Given [:y:-z] as a slicing index, start from the last row in the DataFrame and go backwards/upwards.
Given [:y:z] is given, start from the first row in the DataFrame and go forward/downward*.
loc - label-based
iloc - position-based
at - label-based but returns a single value
iat - position-based but returns a single value

Wowww! That was so much to take in. But you are on your way in becoming a Pandas Pundit! Stay tuned on this series for my next article on Filtering Data in Dataframes! Have an amazing and fulfilled week ahead!

The Proficient Pythonista: List Comprehensions

Joy Ada Uche — Sat, 29 Feb 2020 18:06:16 +0000

For loops are a thing! But if you could sometimes use a construct that is way more concise and efficient won’t you go for it? Hell Yeah!!!

List comprehensions give us a succinct way to create lists based on existing lists.

To create a new list of numbers from an existing list using a for loop construct:

What if I told you what we have in the loop above can be done in just a single line code! - This is the power of list comprehensions. Take a look below in 2.py:

The structure for the code written in 2.py above is [ output expression for iterator variable in iterable ]:

number + 3 is the output expression - the result returned
number is the iterator variable
numbers is the iterable

List comprehension can be written over any iterable like a range object and not just lists as below:

We can also have conditionals in list comprehensions which helps filter what we get returned - it controls what items from an existing list are returned. This conditional logic can be on the iterator variable or the output expression.

In 4.py below is an example of a conditional logic used on the iterator variable, where it returns only items that are not equivalent to the string cisco:

In 4.py above, the structure for the code written is [output expression for iterator variable in iterable if predicate expression ] -

If flash != ‘cisco’ - the conditional logic which is an if predicate expression

Let’s take it further by having nested if predicate expressions within a list comprehension:

In 5.py above, it checks to see if the number num is divisible by 2 and then 4, and prints it out if it satisfies both conditions.

Moving on to conditionals on the output expression, let us see an example below:

The structure for the code written in 6.py above is [conditional on output expression for iterator variable in iterable] -

hero if len(hero)>=8 else '' - the conditional logic - outputs hero if the length of the string hero is less than 8, else it outputs an empty string.

Like list comprehensions, we can also have a dictionary comprehension or even a set comprehension. This is getting quite interesting right?

With set comprehensions, the output returned unlike a list comprehension contains no duplicates. Let’s see an example in 7.py below:

In 7.py, no duplicates are returned. Also, note that sets like dictionaries but unlike lists are inherently unordered - the order of items is not important - this is why o comes first in the set even if e is the first character in cool_quote.

Talking about dictionaries, let’s see an example of how we can use a dictionary comprehension to create a dictionary below:

In 8.py above, we have a key and value pair separated by a colon (:) - the keys are the items of the list while the values are the length of the items in the list.

Nested loops are also something we do sometimes. See an example below:

The code above is multiplying the items in the first list by the items in the second list. To write this in a list comprehension, it is written as below:

In 10.py above, the outer list comprehension [... for i in range(5, 8)] creates 3 rows, while the inner list comprehension [i*j for j in range(1,3)] fills these rows with values i.e i*j.

Very good use of list comprehension is to flatten a list consisting of multiple lists as seen below:

List comprehensions are superb! But I think you can still use for loops in certain places where code readability is key because it is important to also write what your team can read easily. I hope you are excited about your journey to becoming a Proficient Pythonista!!! 😉

Stay tuned to this series for my next article on GENERATORS! Have an amazing and fulfilled week ahead!

The SQL Savant: Inner Joins in SQL

Joy Ada Uche — Sun, 05 Jan 2020 05:15:09 +0000

Ever tried retrieving the data you need from just one table but suddenly realised you need more detail or information about these data which you must get from another table? Joins to the rescue!!!
You can get that additional detail you need using the power of Joins. With Joins in SQL, you can retrieve or access the information you need from two or more tables.

Let’s say we are at a Javascript college and they just hired a new teacher who wishes to carry every student along in his class and then he requested the academic detail of every student and these details are located in different tables. This certainly sounds like a task for Joins right?

Assuming this class maintains a database consisting of 3 tables such as person, grade, and activity. We would most likely have to get the information we need by looking up data in the person and grade tables using joins.

But here comes a pool of questions. How do we get the needed information using joins? What if there are students without grades? Does the teacher want to see only students with grades or all students whether there is a grade or not? Well, it actually depends on how this new Javascript teacher wants it right? So, we had a meeting with him and he wants to see only students with grades. It is in this light we introduce the main types of Joins. They are:

The inner joins and
The outer joins

This class has a database with the person and grade tables as below:

THE INNER JOIN (which is the most common type of join), also referred to as JOIN, is a type of JOIN that returns all rows from both participating tables where the key record of one table is equal to the key records of another table. Using an inner join on the person and grade table will return only students with grades like below:

In 2.md above, you can see that the inner join has combined both tables ON key columns (which has values) that are common to both tables. It then returns records that contain the selected columns in the SELECT clause - You can see in 1.md, that the id field for person table and person_id field for grade table matches for values of 33CC and 44DD only.

The code for completing an INNER JOIN from the person table to the grade table based on the common values of the key field is shown below:

In 3.sql and 2.md, please note the following:

3.sql:

In the SELECT clause, the fields or columns you want to be returned from the tables are listed.
The person table is conventionally called the left table because it is the first table in the SELECT clause.
The grade table is conventionally called the right table because it is the table you are joining on.
The INNER JOIN keyword can be replaced with the JOIN keyword - INNER JOIN is the default if you don't specify the type when you use the word JOIN.
The ON clause introduces the key fields from each table we would be joining on. The ON clause is used to specify a join condition or a join-predicate. The name of a key field or key column can be the same or vary from one table to another - the key field name in the grade table, person_id, can also be named id like in the person table as long as it does not conflict with other column names in the grade table.

2.md:

Regardless of whether the value of the key field appears multiple times in a table, as far as it appears in both tables, the record would be included in the result e.g 33CC is returned twice for the year 2019 and 2020 respectively.

So, we can gladly use the inner join above and give the new teacher what he asked for.

An ad-hoc request comes in! The new teacher wants to see students with grades along with their yearly activities. Herein lies the awesomeness of SQL- the ability to combine multiple joins in a single query - Multiple Joins. Ring a bell? Let us see an example below:

In 4.md above, we have 3 tables - person, grade and activity - an additional table, activity, which shows the hobbies a student partake in yearly. So we came up with the query below:

In 5.sql above, activity.year i.e table_name.column_name is used because the field year is common to the grade and activity tables. If the table_name is not specified, an error would be thrown saying it is ambiguous. So, 5.sql returns:

Take a second look at this result! Do you observe something wrong with it? Some score field values are wrongly paired with activity and year values - Zamani for example, the 3rd and 4th rows are incorrect. This is so because we did not join on an extra key field, year, which is common to the activity and grade tables. So we modify the query to be:

which returns the desired result below:

The thought of how multiple inner joins work can be confusing right? Well, what actually happens like in 7.sql above is that every single join produces a single derived table which is then joined to the next table and on and on. Using the 7.sql example above:
JOIN 1: This is the inner join between the person table and the grade tables. Let’s call this result derived table one (DT1)
JOIN 2: This is another inner join between DT1 and the activity table. The result gotten here is the final result returned by this query

So having delivered what is needed in time, let us refactor what we have written for a healthy codebase. There are several ways the query in 7.sql could have been written. For example:

The AS keyword is used for creating an alias - an alias is a temporary name that only exists for the duration of the query. Good use cases for aliases are when:

You want to make column names or table names more readable
You want to write less because the names are long. - Example: there is more than one table in your query - so you write p.id = a.person_id instead of person.id = activity.person_id.

If the key field in grade and activity tables is also id (and not person_id), the SQL code would rather be:

You have seen the ON clause so far but if the key field you are joining on has the same name in both tables, use the USING keyword instead.

So, I believe you are now becoming an SQL savant! Oh Gosh! This new Javascript teacher is so demanding! Now, he is asking for all students' academic details regardless of whether they have a grade or not. Well! Body no be wood! 😁 So, we have to push this request to another sprint right? 😉 We would set up a meeting with him in the near future to talk about how he wants this because this sounds like a job for the OUTER JOINS and there are 3 kinds.

Stay tuned to this series for my next article on OUTER JOINS! Have an amazing and fulfilled week ahead!