DEV Community: Lorraine

Cleaning and Transforming Data with SQL

Lorraine — Tue, 10 Dec 2019 00:33:36 +0000

One of the first tasks performed when doing data analytics is to create clean the dataset you're working with. The insights you draw from your data are only as good as the data itself, so it's no surprise that an estimated 80% of the time spent by analytics professionals involves preparing data for use in analysis.

SQL can help expedite this important task. In this tutorial, we will discuss different functions commonly used to clean, transform, and remove duplicate data from query outputs that may not be in the form we would like. This means you'll learn about:

CASE WHEN
COALESCE
NULLIF
LEAST / GREATEST
Casting
DISTINCT

We will be using the following sample table, employees, throughout this tutorial to illustrate how our functions work:

id	first_name	last_name	title	age	wage	hire_date
1	Amy	Jordan	Ms	24	15	2019-04-27
2	Bill	Tibb	Mr	61	28	2012-05-02
3	Bill	Sadat		18	12	2019-11-08
4	Christine	Riveles	Mrs	36	20	2018-03-30
5	David	Guerin	Honorable	28	20	2016-11-02

This data is preloaded into a Next Tech sandbox for you to experiment and test the below queries. Connect to the database here for free!

Let’s get started!

This tutorial is adapted from Next Tech’s full SQL for Data Analysis course, which includes an in-browser sandboxed environment and interactive activities and challenges using real datasets. You can get started with this course here!

`CASE WHEN`

CASE WHEN is a function that allows a query to map various values in a column to other values. The general format of a CASE WHEN statement is:

CASE
    WHEN condition1 THEN value1
    WHEN condition2 THEN value2
    ...
    WHEN conditionX THEN valueX
    ELSE else_value
END

Here, condition1 and condition2, through conditionX, are Boolean conditions; value1 and value2, through valueX, are values to map the Boolean conditions; and else_value is the value that is mapped if none of the Boolean conditions are met.

For each row, the program starts at the top of the CASE WHEN statement and evaluates the first Boolean condition. The program then runs through each Boolean condition from the first one. For the first condition from the start of the statement that evaluates as true, the statement will return the value associated with that condition. If none of the statements evaluate as true, then the value associated with the ELSE statement will be returned.

As an example, let's say you wanted to return all rows for employees from the employees table. Additionally, you would like to add a column that labels an employee as being an New employee if they were hired after 2019-01-01. Otherwise, it will mark the employee as a Standard employee. This column will be called employee_type. We can create this table by using a CASE WHEN statement as follows:

SELECT
    *,
    CASE
        WHEN hire_date >= '2019-01-01' THEN 'New'
        ELSE 'Standard'
    END AS employee_type
FROM
    employees;

This query will give the following output:

id	first_name	last_name	title	age	wage	hire_date	employee_type
1	Amy	Jordan	Ms	24	15	2019-04-27	New
2	Bill	Tibb	Mr	61	28	2012-05-02	Standard
3	Bill	Sadat		18	12	2019-11-08	New
4	Christine	Riveles	Mrs	36	20	2018-03-30	Standard
5	David	Guerin	Honorable	28	20	2016-11-02	Standard

The CASE WHEN statement effectively mapped a hire date to a string describing the employee type. Using a CASE WHEN statement, you can map values in any way you please.

COALESCE

Another useful technique is to replace NULL values with a standard value. This can be accomplished easily by means of the COALESCE function. COALESCE allows you to list any number of columns and scalar values, and, if the first value in the list is NULL, it will try to fill it in with the second value. The COALESCE function will keep continuing down the list of values until it hits a non-NULL value. If all values in the COALESCE function are NULL, then the function returns NULL.

To illustrate a simple usage of the COALESCE function, let's say we want a list of the names and titles of our employees. However, for those with no title, we want to instead write the value 'NO TITLE'. We can accomplish this request with COALESCE:

SELECT
    first_name,
    last_name,
    COALESCE(title, 'NO TITLE') AS title
FROM
    employees;

This query produces the following results:

first_name	last_name	title
Amy	Jordan	Ms
Bill	Tibb	Mr
Bill	Sadat	NO TITLE
Christine	Riveles	Mrs
David	Guerin	Honorable

When dealing with creating default values and avoiding NULL, COALESCE will always be helpful.

NULLIF

NULLIF is, in a sense, the opposite of COALESCE. NULLIF is a two-value function and will return NULL if the first value equals the second value.

As an example, imagine that we want a list of the names and titles of our employees. However, this time, we want to replace the title 'Honorable' with NULL. This could be done with the following query:

SELECT
    first_name,
    last_name,
    NULLIF(title, 'Honorable') AS title
FROM
    employees;

This will blot out all mentions of 'Honorable' from the title column and give the following output:

first_name	last_name	title
Amy	Jordan	Ms
Bill	Tibb	Mr
Bill	Sadat
Christine	Riveles	Mrs
David	Guerin

LEAST / GREATEST

Two functions often come in handy for data preparation are the LEAST and GREATEST functions. Each function takes any number of values and returns the least or the greatest of the values, respectively.

A simple use of this variable would be to replace the value if it's too high or low. For example, say the minimum wage increased to $15/hour and we need to change the wages of any employee earning less than that. We can create this using the following query:

SELECT
    id,
    first_name,
    last_name,
    title,
    age,
    GREATEST(15, wage) as wage,
    hire_date
FROM
    employees;

This query will give the following output:

id	first_name	last_name	title	age	wage	hire_date
1	Amy	Jordan	Ms	24	15	2019-04-27
2	Bill	Tibb	Mr	61	28	2012-05-02
3	Bill	Sadat		18	15	2019-11-08
4	Christine	Riveles	Mrs	36	20	2018-03-30
5	David	Guerin	Honorable	28	20	2016-11-02

As you can see, Bill Sadat’s wage has increased from $12 to $15.

Casting

Another useful data transformation is to change the data type of a column within a query. This is usually done to use a function only available to one data type, such as text, while working with a column that is in a different data type, such as a numeric.

To change the data type of a column, you simply need to use the column::datatype format, where column is the column name, and datatype is the data type you want to change the column to. For example, to change the age in the employees table to a text column in a query, use the following query:

SELECT
    first_name,
    last_name,
    age::TEXT
FROM
    employees;

This will convert the age column from an integer to text. You can now apply text functions to this transformed column. There is one final catch; not every data type can be cast to a specific data type. For instance, datetime cannot be cast to float types. Your SQL client will throw an error if you ever make an unexpected strange conversion.

DISTINCT

Often, when looking through a dataset, you may be interested in determining the unique values in a column or group of columns. This is the primary use case of the DISTINCT keyword. For example, if you wanted to know all the unique first names in the employees table, you could use the following query:

SELECT
    DISTINCT first_name
FROM
    employees;

This gives the following result:

first_name
Amy
Bill
Christine
David

You can also use DISTINCT with multiple columns to get all distinct column combinations present.

I hope you enjoyed this tutorial on data cleaning and transformation with SQL. This is just the beginning of what you can use SQL for in data analysis. If you’d like to learn more, Next Tech’s SQL for Data Analysis course covers:

More functions used for data preparation and cleaning
Aggregate functions and window functions
Importing and exporting data
Analytics using complex data types
Writing performant queries

You can get started for free here!

Deep Learning Basics: A Crash Course

Lorraine — Thu, 21 Nov 2019 00:56:11 +0000

Machine learning (ML) and deep learning (DL) techniques are being applied in a variety of fields, and data scientists are being sought after in many different industries.

With machine learning, we identify the processes through which we gain knowledge that is not readily apparent from data in order to make decisions. Applications of machine learning techniques may vary greatly, and are found in disciplines as diverse as medicine, finance, and advertising.

Deep learning makes use of more advanced neural networks than those used during the 1980s. This is not only a result of recent developments in the theory, but also advancements in computer hardware.

In this crash course, we will learn about deep learning and deep neural networks (DNNs), that is, neural networks with multiple hidden layers. We will cover the following topics:

Introduction to deep learning
Deep learning algorithms
Applications of deep learning

Let’s get started!

This crash course is adapted from Next Tech's Python Deep Learning Projects (Part 1: The Fundamentals) course, which explores the basics of deep learning, setting up a DL environment, and building an MLP. It includes an in-browser sandboxed environment with all the necessary software and libraries pre-installed. You can get started here!

Introduction to Deep Learning

Given the universal approximation theorem, you may wonder what the point of using more than one hidden layer is. This is in no way a naive question, and for a long time neural networks were used in this way.

One reason for multiple hidden layers is that approximating a complex function might require a huge number of neurons in the hidden layer, making it impractical to use. A more important reason for using deep networks, which is not directly related to the number of hidden layers but to the level of learning, is that a deep network does not simply learn to predict output Y given input X: it also understands basic features of the input.

Let’s take a look at an example.

In Proceedings of the International Conference on Machine Learning (ICML) (2009) by H. Lee, R. Grosse, R. Ranganath, and A. Ng, the authors train a neural network with pictures of different categories of either objects or animals. In the following image we can see how the different layers of the network learn different characteristics of the input data. In the first layer the network learns to detect some basic features, such as lines and edges, which are common to all images in all categories:

The first layer weights (top) and the second layer weights (bottom) after training

In the next layers, shown in the image below, it combines those lines and edges to compose more complex features that are specific to each category.

Columns 1-4 represent the second layer (top) and third layer (bottom) weights learned for a specific object category (class). Column 5 represents the weights learned for a mixture of four object categories (faces, cars, airplanes, and motobikes

In the top row we can see how the network detects different features of each category. Eyes, noses, and mouths for human faces, doors and wheels for cars, and so on. These features are abstract. That is, the network has learned the generic shape of a feature, such as a mouth or a nose, and can detect this feature in the input data despite variations it might have.

In the second row of the preceding image, we can see how the deeper layers of the network combines these features into even more complex ones, such as faces and whole cars. A strength of deep neural networks is that they can learn these high-level abstract representations themselves by deducing them from the training data.

Deep Learning Algorithms

We could define deep learning as a class of machine learning techniques where information is processed in hierarchical layers to understand representations and features from data in increasing levels of complexity. In practice, all deep learning algorithms are neural networks, which share some common basic properties. They all consist of interconnected neurons that are organized in layers. Where they differ is network architecture (the way neurons are organized in the network), and sometimes the way they are trained.

With that in mind, let's look at the main classes of neural networks. The following list is not exhaustive, but it represents the vast majority of algorithms in use today.

Multi-Layer Perceptrons (MLPs)

A neural network with feedforward propagation, fully-connected layers, and at least one hidden layer.

The above diagram demonstrates a 3-layer fully connected neural network with two hidden layers. The input layer has k input neurons, the first hidden layer has n hidden neurons, and the second hidden layer has m hidden neurons. The output, in this example, is the two classes y1 and y2. On top is the always-on bias neuron. A unit from one-layer is connected to all units from the previous and following layers (hence fully connected).

Convolutional Neural Networks (CNNs)

A CNN is a feedforward neural network with several types of special layers. For example, convolutional layers apply a filter to the input image (or sound) by sliding that filter all across the incoming signal, to produce an n-dimensional activation map. There is some evidence that neurons in CNNs are organized similarly to how biological cells are organized in the visual cortex of the brain. Today, they outperform all other ML algorithms on a large number of computer vision and natural language processing tasks.

Recurrent Neural Networks (RNNs)

This type of network has an internal state (or memory), which is based on all or part of the input data already fed to the network. The output of a recurrent network is a combination of its internal state (memory of previous inputs) and the latest input sample. At the same time, the internal state changes, to incorporate newly input data. Because of these properties, recurrent networks are good candidates for tasks that work on sequential data, such as text or time-series data.

Autoencoders

A class of unsupervised learning algorithms, in which the output shape is the same as the input, that allows the network to better learn basic representations. It consists of an input, hidden (or bottleneck), and output layers. Although it's a single network, we can think of it as a virtual composition of two components:

Encoder: Maps the input data to the network's internal representation.
Decoder: Tries to reconstruct the input from the network's internal data representation.

Reinforcement Learning (RL)

Reinforcement algorithms learn how to achieve a complex objective over many steps by using penalties when they make a wrong decision and rewards when they make a correct decision. It is a method often used to teach a machine how to interact with an environment, similar to the way a human behaviour is shaped by negative and positive feedback. RL is often used in building computer games and autonomous vehicles.

Applications of Deep Learning

Machine learning, particularly deep learning, is producing more and more astonishing results in terms of the quality of predictions, feature detection, and classification. Many of these recent results have even made the news! Here are a just a few of the ways that these techniques can be applied today or in the near future:

Autonomous vehicles

Nowadays, new cars have a suite of safety and convenience features that aim to make the driving experience safer and less stressful. One such feature is automated emergency braking if the car sees an obstacle. Another one is lane-keeping assist, which allows the vehicle to stay in its current lane without the driver needing to make corrections with the steering wheel. To recognize lane markings, other vehicles, pedestrians, and cyclists, these systems use a forward-facing camera. We can speculate that future autonomous vehicles will also use deep networks for computer vision.

Image and text recognition

Both Google's Vision API and Amazon's Rekognition service use deep learning models to provide various computer vision capabilities. These include recognizing and detecting objects and scenes in images, text recognition, face recognition, and so on.

Medical imaging

Medical imaging is an umbrella term for various non-invasive methods of creating visual representations of the inside of the body. Some of these include Magnetic resonance images (MRIs), ultrasound, Computed Axial Tomography (CAT) scans, X-rays, and histology images. Typically, such an image is analyzed by a medical professional to determine the patient's condition. Machine learning, computer vision in particular, is enabling computer-aided diagnosis which can help specialists by detecting and highlighting important features of images.

For example, to determine the degree of malignancy of colon cancer a pathologist would have to analyze the morphology of the glands using histology imaging. This is a challenging task because morphology can vary greatly. A deep neural network could segment the glands from the image automatically, leaving the pathologist to verify the results. This would reduce the time needed for analysis, making it cheaper and more accessible.

Medical history analysis

Another medical area that could benefit from deep learning is the analysis of medical history records. Before a doctor diagnoses a condition and prescribes treatment they consult the patient's medical history for additional insight. A deep learning algorithm could extract the most relevant and important information from those extensive records, even if they are handwritten. In this way the doctor's job can be made easier while also reducing the risk of errors.

Language translation

Google's Neural Machine Translation API uses – you guessed it – deep neural networks for machine translation.

Speech recognition and generation

Google Duplex is another impressive real-world demonstration of deep learning. It's a new system that can carry out natural conversations over the phone. For example, it can make restaurant reservations on a user's behalf. It uses deep neural networks to both to understand the conversation and to generate realistic, human-like replies.

Siri, Google Assistant, and Amazon Alexa also rely on deep networks for speech recognition.

Gaming

Finally, AlphaGo is an artificial intelligence (AI) machine based on deep learning, that made the news in March 2016 for beating the world Go champion, Lee Sedol. AlphaGo had already made the news in January 2016, when it beat the European champion, Fan Hui. Although, at the time, it seemed unlikely that it could go on to beat the world champion. Fast-forward a couple of months and AlphaGo was able to achieve this remarkable feat by sweeping its opponent in a 4-1 victory series.

This was an important milestone because Go has many more possible game variations than other games, such as chess, and it's impossible to consider every possible move in advance. Also, unlike chess, in Go it's very difficult to even judge the current position or value of a single stone on the board. In 2017, DeepMind released an updated version of AlphaGo called AlphaZero.

In this crash course, we explained what deep learning is and how it's related to deep neural networks. We discussed the different types of networks and some real-world applications of deep learning.

This is only the beginning — if you’d like to learn more about deep learning, Next Tech has a Python Deep Learning Projects series that explores real-world deep learning projects across computer vision, natural language processing (NLP), and image processing. Part 1 of the series covers setting up a DL environment and building an MLP. You can get started here.

Immutable vs Mutable Data Types in Python

Lorraine — Thu, 17 Oct 2019 20:07:56 +0000

By now you may have heard the phrase "everything in Python is an object". Objects are abstraction for data, and Python has an amazing variety of data structures that you can use to represent data, or combine them to create your own custom data.

A first fundamental distinction that Python makes on data is about whether or not the value of an object changes. If the value can change, the object is called mutable, while if the value cannot change, the object is called immutable.

In this crash course, we will explore:

The difference between mutable and immutable types
Different data types and how to find out whether they are mutable or immutable

It is very important that you understand the distinction between mutable and immutable because it affects the code you write.

Let’s get started!

This crash course is adapted from Next Tech’s Learn Python Programming course that uses a mix of theory and practicals to explore Python and its features, and progresses from beginner to being skilled in Python. It includes an in-browser sandboxed environment with all the necessary software and libraries pre-installed. You can get started for free here!

Mutable vs Immutable

To get started, it's important to understand that every object in Python has an ID (or identity), a type, and a value, as shown in the following snippet:

age = 42
print(id(age))     # id
print(type(age))   # type
print(age)         # value

10966208
<class 'int'>
42

Once created, the ID of an object never changes. It is a unique identifier for it, and it is used behind the scenes by Python to retrieve the object when we want to use it.

The type also never changes. The type tells what operations are supported by the object and the possible values that can be assigned to it.

The value can either change or not. If it can, the object is said to be mutable, while when it cannot, the object is said to be immutable.

Let's take a look at an example:

age = 42
print(id(age))
print(type(age))
print(age)

age = 43
print(age)
print(id(age))

10966208
<class 'int'>
42
43
10966240

Has the value of age changed? Well, no. 42 is an integer number, of the type int, which is immutable. So, what happened is really that on the first line, age is a name that is set to point to an int object, whose value is 42.

When we type age = 43, what happens is that another object is created, of the type int and value 43 (also, the id will be different), and the name age is set to point to it. So, we didn't change that 42 to 43. We actually just pointed age to a different location.

As you can see from printing id(age) before and after the second object named age was created, they are different.

Now, let's see the same example using a mutable object.

x = [1, 2, 3]
print(x)
print(id(x))

x.pop()
print(x)
print(id(x))

[1, 2, 3]
139912816421064
[1, 2]
139912816421064

For this example, we created a list named m that contains 3 integers, 1, 2, and 3. After we change m by “popping” off the last value 3, the ID of m stays the same!

So, objects of type int are immutable and objects of type list are mutable. Now let’s discuss other immutable and mutable data types!

Mutable Data Types

Mutable sequences can be changed after creation. Some of Python's mutable data types are: lists, byte arrays, sets, and dictionaries.

Lists

As you saw earlier, lists are mutable. Here's another example using the append() method:

a = list(('apple', 'banana', 'clementine'))
print(id(a))

a.append('dates')
print(id(a))

140372445629448
140372445629448

Byte Arrays

Byte arrays represent the mutable version of bytes objects. They expose most of the usual methods of mutable sequences as well as most of the methods of the bytes type. Items are integers in the range [0, 256).

Let's see a quick example with the bytearray type to show that it is mutable:

b = bytearray(b'python')
print(id(bk))

b.replace(b'p', b'P')
print(id(bk))

139963525979808
139963525979808

Sets

Python provides two set types, set and frozenset. They are unordered collections of immutable objects.

c = set(('San Francisco', 'Sydney', 'Sapporo'))
print(id(cl))

c.pop()
print(id(cl))

140494031990344
140494031990344

As you can see, sets are indeed mutable. Later, in the Immutable Data Types section, we will see that frozensets are immutable.

Dictionaries

d = {
    'a': 'alpha',
    'b': 'bravo',
    'c': 'charlie',
    'd': 'delta',
    'e': "echo"
}
print(id(d))

d.update({
    'f': 'foxtrot'
})
print(id(d))

140071114319408
140071114319408

Immutable Data Types

Immutable data types differ from their mutable counterparts in that they can not be changed after creation. Some immutable types include numeric data types, strings, bytes, frozen sets, and tuples.

Numeric Data Types

You have already seen that integers are immutable; similarly, Python’s other built-in numeric data types such as booleans, floats, complex numbers, fractions, and decimals are also immutable!

Strings and Bytes

Textual data in Python is handled with str objects, more commonly known as strings. They are immutable sequences of Unicode code points. Unicode code points can represent a character.

When it comes to storing textual data though, or sending it on the network, you may want to encode it, using an appropriate encoding for the medium you're using. The result of an encoding produces a bytes object, whose syntax and behavior is similar to that of strings.

Both strings and bytes are immutable, as shown in the following snippet:

# string
e = 'Hello, World!'
print(id(e))

e = 'Hello, Mars!'
print(id(e))

140595675113648
140595675113776

# bytes
unicode = 'This is üŋíc0de'     # unicode string: code points
print(type(unicode))
f = unicode.encode('utf-8')     # utf-8 encoded version of unicode string
print(type(f))
print(id(f))

f = b'A bytes object'           # a bytes object
print(id(f))

<class 'str'>
<class 'bytes'>
140595675068152
140595675461360

In the bytes section, we first defined f as an encoded version of our unicode string. As you can see from print(type(f)) this is a bytes type. We then create another bytes object named f whose value is b'A bytes object'. The two f objects have different IDs, which shows that bytes are immutable.

Frozen Sets

As discussed in the previous section, frozensets are similar to sets. However, frozenset objects are quite limited in respect of their mutable counterpart since they cannot be changed. Nevertheless, they still prove very effective for membership test, union, intersection, and difference operations, and for performance reasons.

Tuples

The last immutable sequence type we're going to see is the tuple. A tuple is a sequence of arbitrary Python objects. In a tuple, items are separated by commas. These, too, are immutable, as shown in the following example:

g = (1, 3, 5)
print(id(g))

g = (42, )
print(id(g))

139952252343784
139952253457184

I hope you enjoyed this crash course on the difference between immutable and mutable objects and how to find out which an object is! Now that you understand this fundamental concept of Python programming, you can now explore the methods you can use for each data type.

If you’d like to learn about this and continue to advance your Python skills, Next Tech has a full Learn Python Programming course that covers:

Functions
Conditional programming
Comprehensions and generators
Decorators, object-oriented programming, and iterators
File data persistence
Testing, including a brief introduction to test-driven development
Exception handling
Profiling and performances

You can get started here for free!

Database Normalization Explained

Lorraine — Mon, 01 Jul 2019 19:29:45 +0000

Normalization is a technique for organizing data in a database. It is important that a database is normalized to minimize redundancy (duplicate data) and to ensure only related data is stored in each table. It also prevents any issues stemming from database modifications such as insertions, deletions, and updates.

The stages of organization are called normal forms. In this tutorial we will be redesigning a database for a construction company and ensuring that it satisfies the three normal forms:

First Normal Form (1NF):

Data is stored in tables with rows uniquely identified by a primary key
Data within each table is stored in individual columns in its most reduced form
There are no repeating groups

Second Normal Form (2NF):

Everything from 1NF
Only data that relates to a table’s primary key is stored in each table

Third Normal Form (3NF):

Everything from 2NF
There are no in-table dependencies between the columns in each table

Note that there are actually six levels of normalization; however, the third normal form is considered the highest level necessary for most applications so we will only be discussing the first three forms.

Let's get started!

This tutorial is adapted from Next Tech's Database Fundamentals course which comes with an in-browser MySQL database and interactive tasks and projects to complete. You can get started here for free!

Our Database: Codey's Construction

Codey's Construction's database schema with a new table that causes the database to violate the rules of normalization.

The database we will be working with in this tutorial is for Codey's Construction company (Codey is a helpful coding bot that works with you in the course mentioned earlier). As you can see from the schema above, the database contains the tables projects, job_orders, employees, project_employees. Recently, they have decided to add the table customers to store customer data.

Unfortunately, this table has not designed in a way that satisfies the three forms of normalization... Let's fix that!

First Normal Form

First normal form relates to the duplication and over-grouping of data in tables and columns.

Codey’s Construction's table customers violates all three rules of 1NF.

There is no primary key! A user of the database would be forced to look up companies by their name, which is not guaranteed to be unique (since unique company names are registered on a state-by-state basis).
The data is not in its most reduced form. The column contact_person_and_role can be further divided into two columns, such as contact_person and contact_role.
There are two repeating groups of columns - (project1_id, project1_feedback) and (project2_id, project2_feedback).

The following SQL statement was used to create the customers table:

CREATE TABLE customers (
    name                          VARCHAR(255),
    industry                      VARCHAR(255),
    project1_id                   INT(6),
    project1_feedback             TEXT,
    project2_id                   INT(6),
    project2_feedback             TEXT,
    contact_person_id             INT(6),
    contact_person_and_role       VARCHAR(300),
    phone_number                  VARCHAR(12),
    address                       VARCHAR(255),
    city                          VARCHAR(255),
    zip                           VARCHAR(5)
  );

Example data for `customers` table.

By modifying some columns, we can help redesign this table so that it satisfies 1NF.

First, we need to add a primary key column called id with data type INT(6):

ALTER TABLE customers
    ADD COLUMN id INT(6) AUTO_INCREMENT PRIMARY KEY FIRST;

With this statement, we added an automatically incrementing primary key as the first column in the table.

To satisfy the second condition, we need to split the contact_person_and_role column:

ALTER TABLE customers
    CHANGE COLUMN contact_person_and_role contact_person VARCHAR(300);

ALTER TABLE customers
    ADD COLUMN contact_person_role VARCHAR(300) AFTER contact_person;

Here, we simply renamed it as contact_person, and added a column contact_person_role immediately after it.

To satisfy the third condition, we need to move the columns containing project IDs and project feedback to a new table called project_feedbacks. First, let's drop these columns from the customers table:

ALTER TABLE customers
    DROP COLUMN project1_id,
    DROP COLUMN project1_feedback,
    DROP COLUMN project2_id,
    DROP COLUMN project2_feedback;

And then create the project_feedbacks table:

CREATE TABLE project_feedbacks (
    id                  INT(6) AUTO_INCREMENT PRIMARY KEY,
    project_id          INT(6),
    customer_id         INT(6),
    project_feedback    TEXT
);

Here's what the database schema looks like now:

Modified schema that now satisfies 1NF.

As you can see, there are no more repeating groups in either the project_feedbacks table or the customers table. We still know which customer said what since project_feedbacks.customer_id refers back to the customers table.

Now our customers table satisfies 1NF! Let's move on to second normal form.

Second Normal Form

To achieve second normal form, a database must first satisfy all the conditions for 1NF. After this, satisfying 2NF requires that all data in each table relates directly to the record that the primary key of the table identifies.

We are in violation of 2NF because the contact_person, contact_person_role and phone_number columns track data that relate to the contact person, not the customer. If the contact person for a customer changes, we would have to edit all of these columns, running the risk that we will change the values in one of the columns but forget to change another.

To help Codey's Construction fix this table to satisfy 2NF, these columns should be moved to a table containing data on the contact person. First, let's remove the columns in customers that are not related to our primary key:

ALTER TABLE customers
    DROP COLUMN contact_person,
    DROP COLUMN contact_person_role,
    DROP COLUMN phone_number;

Note that we kept the contact_person_id so we still know who to contact. Now, let's create our new table contact_persons so we have somewhere to store data about each contact.

CREATE TABLE contact_persons (
  id            INT(6) PRIMARY KEY,
  name          VARCHAR(300),
  role          VARCHAR(300),
  phone_number  VARCHAR(15)
);

Codey's Construction's database schema now looks like this:

Modified schema that now satisfies 2NF.

Now, if the contact person for a customer changes, the construction company just has to insert a record into the contact_persons table and change the contact_person_id in the customers table.

Third Normal Form

For a database to be in third normal form, it must first satisfy all the criteria for 2NF (and therefore, also 1NF).

Then, each column must be non-transitively dependent on the table’s primary key. This means that all columns in a table should rely on the primary key and no other column. If column_a relies on the primary key and also on column_b then column_a is transitively dependent on the primary key so the table does not satisfy 3NF.

Does your brain hurt from reading that? Don't worry! It's explained more below.

This is how the customers table looks after we have satisfied 1NF and 2NF:

Example data for modified `customers` table.

The table currently has transitively dependent columns. The transitively dependent relationship is between city and zip. The city in which a customer is located relies on the customer, so this satisfies 2NF; however, the city also depends on the zip code. If a customer relocates, there may be a chance we update one column but not the other. Because this relationship exists, the database is not in 3NF.

To fix our database to satisfy 3NF, we need to drop the city column from customers, and create a new table zips to store this data:

ALTER TABLE customers
    DROP COLUMN city;

CREATE TABLE zips (
  zip   VARCHAR(5) PRIMARY KEY, 
  city  VARCHAR(255)
);

Modified schema that now satisfies 3NF.

That's it! Finding issues that violate 3NF can be difficult, but it's worth it to ensure that your database is resilient to errors caused by only partially updating data.

I hope you enjoyed this tutorial on database normalization! Codey's Construction's database now satisfies the three forms of normalization.

If you'd like to continue learning about databases, Next Tech's Database Fundamentals course covers all you need to know to get started with databases and SQL. By helping an interactive coding bot named Codey, you will learn how to create and design databases, modify data, and how to write SQL queries to answer business problems. You can get started for free here!

Introduction to Multilayer Neural Networks with TensorFlow’s Keras API

Lorraine — Tue, 11 Jun 2019 20:26:43 +0000

The development of Keras started in early 2015. As of today, it has evolved into one of the most popular and widely used libraries built on top of Theano and TensorFlow. One of its prominent features is that it has a very intuitive and user-friendly API, which allows us to implement neural networks in only a few lines of code.

Keras is also integrated into TensorFlow from version 1.1.0. It is part of the contrib module (which contains packages developed by contributors to TensorFlow and is considered experimental code).

In this tutorial we will look at this high-level TensorFlow API by walking through:

The basics of feedforward neural networks
Loading and preparing the popular MNIST dataset
Building an image classifier
Train a neural network and evaluate its accuracy

Let's get started!

This tutorial is adapted from Part 4 of Next Tech’s Python Machine Learning series, which takes you through machine learning and deep learning algorithms with Python from 0 to 100. It includes an in-browser sandboxed environment with all the necessary software and libraries pre-installed, and projects using public datasets. You can get started for free here!

Multilayer Perceptrons

Multilayer feedforward neural networks are a special type of fully connected network with multiple single neurons. They are also called Multilayer Perceptrons (MLP). The following figure illustrates the concept of an MLP consisting of three layers:

The MLP depicted in the preceding figure has one input layer, one hidden layer, and one output layer. The units in the hidden layer are fully connected to the input layer, and the output layer is fully connected to the hidden layer. If such a network has more than one hidden layer, we also call it a deep artificial neural network.

We can add an arbitrary number of hidden layers to the MLP to create deeper network architectures. Practically, we can think of the number of layers and units in a neural network as additional hyperparameters that we want to optimize for a given problem task.

As shown in the preceding figure, we denote the i^th activation unit in the i^th layer as ai^(l). To make the math and code implementations a bit more intuitive, we will use the in superscript for the input layer, the h superscript for the hidden layer, and the o superscript for the output layer.

For instance, ai⁽ⁱⁿ⁾ refers to the i^th value in the input layer, ai^(h) refers to the i^th unit in the hidden layer, and ai^(out) refers to the i^th unit in the output layer. Here, the activation units a0⁽ⁱⁿ⁾ and a0^(h) are the bias units, which we set equal to 1. The activation of the units in the input layer is just its input plus the bias unit:

Each unit in layer l is connected to all units in layer l + 1 via a weight coefficient. For example, the connection between the k^th unit in layer l to the j^th unit in layer l + 1 will be written as wk, j^(l). Referring back to the previous figure, we denote the weight matrix that connects the input to the hidden layer as W^(h), and we write the matrix that connects the hidden layer to the output layer as W^(out).

We summarize the weights that connect the input and hidden layers by a matrix W^(h) ∈ ℝ ^{m × d}, where d is the number of hidden units and m is the number of input units including the bias unit. Since it is important to internalize this notation to follow the concepts later in this lesson, let's summarize what we have just learned in a descriptive illustration of a simplified 3-4-3 multilayer perceptron:

The MNIST dataset

To see what neural network training via the tensorflow.keras (tf.keras) high-level API looks like, let's implement a multilayer perceptron to classify the handwritten digits from the popular Mixed National Institute of Standards and Technology (MNIST) dataset that serves as a popular benchmark dataset for machine learning algorithm.

To follow along with the code snippets in this tutorial, you can use this Next Tech sandbox, which has the MNIST dataset and all necessary packages installed. Otherwise, you can use your local environment and download the dataset here.

The MNIST dataset in four parts, as listed here:

Training set images: train-images-idx3-ubyte.gz — 60,000 samples
Training set labels: train-labels-idx1-ubyte.gz — 60,000 labels
Test set images: t10k-images-idx3-ubyte.gz — 10,000 samples
Test set labels: t10k-labels-idx1-ubyte.gz — 10,000 labels

The training set consists of handwritten digits from 250 different people (50% high school students, 50% employees from the Census Bureau). The test set contains handwritten digits from different people.

Note that TensorFlow also provides the same dataset as follows:

import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

However, we work with the MNIST dataset as an external dataset to learn all the steps of data preprocessing separately. This way, you learn what you need to do with your own dataset.

The first step is to unzip the four parts of the MNIST dataset by running the following commands in your Terminal:

cd mnist/
gzip *ubyte.gz -d

Our images are stored in byte format, and we will read them into NumPy arrays that we will use to train and test our MLP implementation. In order to do that, we will define the following helper function:

import os
import struct

def load_mnist(path, kind='train'):
    """Load MNIST data from `path`"""
    labels_path = os.path.join(
        path, f'{kind}-labels-idx1-ubyte'
    )
    images_path = os.path.join(
        path, f'{kind}-images-idx3-ubyte'
    )

    with open(labels_path, 'rb') as lbpath:
        magic, n = struct.unpack('>II', lbpath.read(8))
        labels = np.fromfile(lbpath, dtype=np.uint8)

    with open(images_path, 'rb') as imgpath:
        magic, num, rows, cols = struct.unpack(">IIII", imgpath.read(16))
        images = np.fromfile(imgpath, dtype=np.uint8).reshape(len(labels), 784)
        images = ((images / 255.) - .5) * 2

    return images, labels

The load_mnist function returns two arrays, the first being an n x m dimensional NumPy array (images), where n is the number of samples and m is the number of features (here, pixels). The images in the MNIST dataset consist of 28 x 28 pixels, and each pixel is represented by a gray scale intensity value. Here, we unroll the 28 x 28 pixels into one-dimensional row vectors, which represent the rows in our images array (784 per row or image). The second array (labels) returned by the load_mnist function contains the corresponding target variable, the class labels (integers 0-9) of the handwritten digits.

Then, the dataset is loaded and prepared as follows:

# loading the data
X_train, y_train = load_mnist('./mnist/', kind='train')
print(f'Rows: {X_train.shape[0]},  Columns: {X_train.shape[1]}')

X_test, y_test = load_mnist('./mnist/', kind='t10k')
print(f'Rows: {X_test.shape[0]},  Columns: {X_test.shape[1]}')

# mean centering and normalization:
mean_vals = np.mean(X_train, axis=0)
std_val = np.std(X_train)

X_train_centered = (X_train - mean_vals)/std_val
X_test_centered = (X_test - mean_vals)/std_val

del X_train, X_test

print(X_train_centered.shape, y_train.shape)
print(X_test_centered.shape, y_test.shape)

[Out:]
Rows: 60000,  Columns: 784
Rows: 10000,  Columns: 784
(60000, 784) (60000,)
(10000, 784) (10000,)

To get an idea of how those images in MNIST look, let's visualize examples of the digits 0-9 via Matplotlib's imshowfunction:

import matplotlib.pyplot as plt

fig, ax = plt.subplots(nrows=2, ncols=5,
                       sharex=True, sharey=True)
ax = ax.flatten()
for i in range(10):
    img = X_train_centered[y_train == i][0].reshape(28, 28)
    ax[i].imshow(img, cmap='Greys')

ax[0].set_yticks([])
ax[0].set_xticks([])
plt.tight_layout()
plt.show()

We should now see a plot of the 2 x 5 subfigures showing a representative image of each unique digit:

Now let’s start building our model!

Building an MLP using TensorFlow's Keras API

First, let's set the random seed for NumPy and TensorFlow so that we get consistent results:

import tensorflow.contrib.keras as keras

np.random.seed(123)
tf.set_random_seed(123)

To continue with the preparation of the training data, we need to convert the class labels (integers 0-9) into the one-hot format. Fortunately, Keras provides a convenient tool for this:

y_train_onehot = keras.utils.to_categorical(y_train)

print('First 3 labels: ', y_train[:3])
print('\nFirst 3 labels (one-hot):\n', y_train_onehot[:3])

First 3 labels:  [5 0 4]

First 3 labels (one-hot):
 [[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]]

Now, let's implement our neural network! Briefly, we will have three layers, where the first two layers (the input and hidden layers) each have 50 units with the tanh activation function and the last layer (the output layer) has 10 layers for the 10 class labels and uses softmax to give the probability of each class. Keras makes these tasks very simple:

# initialize model
model = keras.models.Sequential()

# add input layer
model.add(keras.layers.Dense(
    units=50,
    input_dim=X_train_centered.shape[1],
    kernel_initializer='glorot_uniform',
    bias_initializer='zeros',
    activation='tanh') 
)
# add hidden layer
model.add(
    keras.layers.Dense(
        units=50,
        input_dim=50,
        kernel_initializer='glorot_uniform',
        bias_initializer='zeros',
        activation='tanh')
    )
# add output layer
model.add(
    keras.layers.Dense(
        units=y_train_onehot.shape[1],
        input_dim=50,
        kernel_initializer='glorot_uniform',
        bias_initializer='zeros',
        activation='softmax')
    )

# define SGD optimizer
sgd_optimizer = keras.optimizers.SGD(
    lr=0.001, decay=1e-7, momentum=0.9
)
# compile model
model.compile(
    optimizer=sgd_optimizer,
    loss='categorical_crossentropy'
)

First, we initialize a new model using the Sequential class to implement a feedforward neural network. Then, we can add as many layers to it as we like. However, since the first layer that we add is the input layer, we have to make sure that the input_dim attribute matches the number of features (columns) in the training set (784 features or pixels in the neural network implementation).

Also, we have to make sure that the number of output units (units) and input units (input_dim) of two consecutive layers match. Our first two layers have 50 units plus one bias unit each. The number of units in the output layer should be equal to the number of unique class labels — the number of columns in the one-hot-encoded class label array.

Note that we used glorot_uniform to as the initialization algorithm for weight matrices. Glorot initialization is a more robust way of initialization for deep neural networks. The biases are initialized to zero, which is more common, and in fact the default setting in Keras.

Before we can compile our model, we also have to define an optimizer. We chose a stochastic gradient descent optimization. Furthermore, we can set values for the weight decay constant and momentum learning to adjust the learning rate at each epoch. Lastly, we set the cost (or loss) function to categorical_crossentropy.

The binary cross-entropy is just a technical term for the cost function in the logistic regression, and the categorical cross-entropy is its generalization for multiclass predictions via softmax).

After compiling the model, we can now train it by calling the fit method. Here, we are using mini-batch stochastic gradient with a batch size of 64 training samples per batch. We train the MLP over 50 epochs, and we can follow the optimization of the cost function during training by setting verbose=1.

The validation_split parameter is especially handy since it will reserve 10% of the training data (here, 6,000 samples) for validation after each epoch so that we can monitor whether the model is overfitting during training:

# train model
history = model.fit(
    X_train_centered, y_train_onehot,
    batch_size=64, epochs=50,
    verbose=1, validation_split=0.1
)

Printing the value of the cost function is extremely useful during training to quickly spot whether the cost is decreasing during training and stop the algorithm earlier. Otherwise, hyperparameter values will need to be tuned.

To predict the class labels, we can then use the predict_classes method to return the class labels directly as integers:

y_train_pred = model.predict_classes(X_train_centered, verbose=0)
print('First 3 predictions: ', y_train_pred[:3])

[Out:]
First 3 predictions: [5 0 4]

Finally, let's print the model accuracy on training and test sets:

# calculate training accuracy
y_train_pred = model.predict_classes(X_train_centered, verbose=0)
correct_preds = np.sum(y_train == y_train_pred, axis=0)
train_acc = correct_preds / y_train.shape[0]

print(f'Training accuracy: {(train_acc * 100):.2f}')

# calculate testing accuracy
y_test_pred = model.predict_classes(X_test_centered, verbose=0)
correct_preds = np.sum(y_test == y_test_pred, axis=0)
test_acc = correct_preds / y_test.shape[0]

print(f'Test accuracy: {(test_acc * 100):.2f}')

[Out:]
Training accuracy: 98.81
Test accuracy: 96.27

I hope you enjoyed this tutorial on using TensorFlow's keras API to build and train a multilayered neural network for image classification! Note that this is just a very simple neural network without optimized tuning parameters.

In practice you need to know how to optimize the model by tweaking learning rate, momentum, weight decay, and number of hidden units. You also need to learn how to deal with the vanishing gradient problem, wherein error gradients become increasingly small as more layers are added to a network.

We cover these topics in Next Tech's Python Machine Learning (Part 4) course, as well as:

Breaking down the mechanics of TensorFlow, such as tensors, activation functions computation graphs, variables, and placeholders
Low-level TensorFlow and another high-level API, Layers
Modeling sequential data using recurrent neural networks (RNN) and long short-term memory (LSTM) networks
Classifying images with deep convolutional neural networks (CNN).

You can get started here for free!

K-Means Clustering with scikit-learn

Lorraine — Thu, 30 May 2019 18:59:06 +0000

Clustering (or cluster analysis) is a technique that allows us to find groups of similar objects, objects that are more related to each other than to objects in other groups. Examples of business-oriented applications of clustering include the grouping of documents, music, and movies by different topics, or finding customers that share similar interests based on common purchase behaviors as a basis for recommendation engines.

In this tutorial, we will learn about one of the most popular clustering algorithms, k-means, which is widely used in academia as well as in industry. We will cover:

The basic concepts of k-means clustering
The mathematics behind the k-means algorithm
The advantages and disadvantages of k-means
How to implement the algorithm on a sample dataset using scikit-learn
How to visualize clusters
How to choose the optimal k using the elbow method

Let’s get started!

This tutorial is adapted from Part 3 of Next Tech’s Python Machine Learning series, which takes you through machine learning and deep learning algorithms with Python from 0 to 100. It includes an in-browser sandboxed environment with all the necessary software and libraries pre-installed, and projects using public datasets. You can get started for free here!

Fundamentals of K-Means Clustering

As we will see, the k-means algorithm is extremely easy to implement and is also computationally very efficient compared to other clustering algorithms, which might explain its popularity. The k-means algorithm belongs to the category of prototype-based clustering.

Prototype-based clustering means that each cluster is represented by a prototype, which can either be the centroid (average) of similar points with continuous features, or the medoid (the most representative or most frequently occurring point) in the case of categorical features.

While k-means is very good at identifying clusters with a spherical shape, one of the drawbacks of this clustering algorithm is that we have to specify the number of clusters, k, a priori. An inappropriate choice for k can result in poor clustering performance — we will discuss later in this tutorial how to choose k.

Although k-means clustering can be applied to data in higher dimensions, we will walk through the following examples using a simple two-dimensional dataset for the purpose of visualization.

You can follow along with the code in this tutorial by using a Next Tech sandbox, which has all the necessary libraries pre-installed, or if you’d prefer, you can run the snippets in your own local environment.

Once your sandbox loads, let’s import the toy dataset from scikit-learn and visualize the datapoints:

import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# create dataset
X, y = make_blobs(
    n_samples=150, n_features=2,
    centers=3, cluster_std=0.5,
    shuffle=True, random_state=0
)

# plot
plt.scatter(
    X[:, 0], X[:, 1],
    c='white', marker='o',
    edgecolor='black', s=50
)
plt.show()

The dataset that we just created consists of 150 randomly generated points that are roughly grouped into three regions with higher density, which is visualized via a two-dimensional scatterplot.

In real-world applications of clustering, we do not have any ground truth category information (information provided as empirical evidence as opposed to inference) about those samples; otherwise, it would fall into the category of supervised learning. Thus, our goal is to group the samples based on their feature similarities, which can be achieved using the k-means algorithm that can be summarized by the following four steps:

Randomly pick k centroids from the sample points as initial cluster centers.
Assign each sample to the nearest centroid μ^(j), j ∈ {1, ..., k}.
Move the centroids to the center of the samples that were assigned to it.
Repeat steps 2 and 3 until the cluster assignments do not change or a user-defined tolerance or maximum number of iterations is reached.

Now, the next question is how do we measure similarity between objects? We can define similarity as the opposite of distance, and a commonly used distance for clustering samples with continuous features is the squared Euclidean distance between two points x and y in m-dimensional space:

Note that in the preceding equation, the index j refers to the j^th dimension (feature column) of the sample points x and y. We will use the superscripts i and j to refer to the sample index and cluster index, respectively.

Based on this Euclidean distance metric, we can describe the k-means algorithm as a simple optimization problem, an iterative approach for minimizing the within-cluster Sum of Squared Errors (SSE), which is sometimes also called cluster inertia:

Here, μ^(j) is the centroid for cluster j, and

w^{(i, j)} = 1 if the sample x⁽ⁱ⁾ is in cluster j
= 0 otherwise

Note that when we are applying k-means to real-world data using a Euclidean distance metric, we want to make sure that the features are measured on the same scale and apply z-score standardization or min-max scaling if necessary.

K-means clustering using `scikit-learn`

Now that we have learned how the k-means algorithm works, let's apply it to our sample dataset using the KMeans class from scikit-learn's cluster module:

from sklearn.cluster import KMeans

km = KMeans(
    n_clusters=3, init='random',
    n_init=10, max_iter=300, 
    tol=1e-04, random_state=0
)
y_km = km.fit_predict(X)

Using the preceding code, we set the number of desired clusters to 3. We set n_init=10 to run the k-means clustering algorithms 10 times independently with different random centroids to choose the final model as the one with the lowest SSE. Via the max_iter parameter, we specify the maximum number of iterations for each single run (here, 300).

Note that the k-means implementation in scikit-learn stops early if it converges before the maximum number of iterations is reached. However, it is possible that k-means does not reach convergence for a particular run, which can be problematic (computationally expensive) if we choose relatively large values for max_iter.

One way to deal with convergence problems is to choose larger values for tol, which is a parameter that controls the tolerance with regard to the changes in the within-cluster sum-squared-error to declare convergence. In the preceding code, we chose a tolerance of 1e-04 (= 0.0001).

A problem with k-means is that one or more clusters can be empty. However, this problem is accounted for in the current k-means implementation in scikit-learn. If a cluster is empty, the algorithm will search for the sample that is farthest away from the centroid of the empty cluster. Then it will reassign the centroid to be this farthest point.

Now that we have predicted the cluster labels y_km, let's visualize the clusters that k-means identified in the dataset together with the cluster centroids. These are stored under the cluster_centers_ attribute of the fitted KMeans object:

# plot the 3 clusters
plt.scatter(
    X[y_km == 0, 0], X[y_km == 0, 1],
    s=50, c='lightgreen',
    marker='s', edgecolor='black',
    label='cluster 1'
)

plt.scatter(
    X[y_km == 1, 0], X[y_km == 1, 1],
    s=50, c='orange',
    marker='o', edgecolor='black',
    label='cluster 2'
)

plt.scatter(
    X[y_km == 2, 0], X[y_km == 2, 1],
    s=50, c='lightblue',
    marker='v', edgecolor='black',
    label='cluster 3'
)

# plot the centroids
plt.scatter(
    km.cluster_centers_[:, 0], km.cluster_centers_[:, 1],
    s=250, marker='*',
    c='red', edgecolor='black',
    label='centroids'
)
plt.legend(scatterpoints=1)
plt.grid()
plt.show()

In the resulting scatterplot, we can see that k-means placed the three centroids at the center of each sphere, which looks like a reasonable grouping given this dataset.

The Elbow Method

Although k-means worked well on this toy dataset, it is important to reiterate that a drawback of k-means is that we have to specify the number of clusters, k, before we know what the optimal k is. The number of clusters to choose may not always be so obvious in real-world applications, especially if we are working with a higher dimensional dataset that cannot be visualized.

The elbow method is a useful graphical tool to estimate the optimal number of clusters k for a given task. Intuitively, we can say that, if k increases, the within-cluster SSE (“distortion”) will decrease. This is because the samples will be closer to the centroids they are assigned to.

The idea behind the elbow method is to identify the value of k where the distortion begins to increase most rapidly, which will become clearer if we plot the distortion for different values of k:

# calculate distortion for a range of number of cluster
distortions = []
for i in range(1, 11):
    km = KMeans(
        n_clusters=i, init='random',
        n_init=10, max_iter=300,
        tol=1e-04, random_state=0
    )
    km.fit(X)
    distortions.append(km.inertia_)

# plot
plt.plot(range(1, 11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()

As we can see in the resulting plot, the elbow is located at k = 3, which is evidence that k = 3 is indeed a good choice for this dataset.

I hope you enjoyed this tutorial on the k-means algorithm! We explored the basic concepts and mathematics behind the k-means algorithm, how to implement k-means, and how to select an optimal number of clusters, k.

If you’d like to learn more, Next Tech’s Python Machine Learning (Part 3) course further explores clustering algorithms and techniques such as:

Silhouette plots, another method used to select the optimal k
k-means++, a variant of k-means, that improves clustering results through more clever seeding of the initial cluster centers.
Other categories of clustering algorithms, such as hierarchical and density-based clustering, that do not require us to specify the number of clusters upfront or assume spherical structures in our dataset.

The course also explores regression analysis, sentiment analysis, and how to deploy a dynamic machine learning model to a web application. You can get started here!

Principal Component Analysis for Dimensionality Reduction

Lorraine — Fri, 24 May 2019 17:37:17 +0000

In the modern age of technology, increasing amounts of data are produced and collected. In machine learning, however, too much data can be a bad thing. At a certain point, more features or dimensions can decrease a model’s accuracy since there is more data that needs to be generalized — this is known as the curse of dimensionality.

Dimensionality reduction is way to reduce the complexity of a model and avoid overfitting. There are two main categories of dimensionality reduction: feature selection and feature extraction. Via feature selection, we select a subset of the original features, whereas in feature extraction, we derive information from the feature set to construct a new feature subspace.

In this tutorial we will explore feature extraction. In practice, feature extraction is not only used to improve storage space or the computational efficiency of the learning algorithm, but can also improve the predictive performance by reducing the curse of dimensionality—especially if we are working with non-regularized models.

Specifically, we will discuss the Principal Component Analysis (PCA) algorithm used to compress a dataset onto a lower-dimensional feature subspace with the goal of maintaining most of the relevant information. We will explore:

The concepts and mathematics behind PCA
How to execute PCA step-by-step from scratch using Python
How to execute PCA using the Python library scikit-learn

Let’s get started!

This tutorial is adapted from Part 2 of Next Tech’s Python Machine Learning series, which takes you through machine learning and deep learning algorithms with Python from 0 to 100. It includes an in-browser sandboxed environment with all the necessary software and libraries pre-installed, and projects using public datasets. You can get started for free here!

Introduction to Principle Component Analysis

Principle Component Analysis (PCA) is an unsupervised linear transformation technique that is widely used across different fields, most prominently for feature extraction and dimensionality reduction. Other popular applications of PCA include exploratory data analyses and de-noising of signals in stock market trading, and the analysis of genome data and gene expression levels in the field of bioinformatics.

PCA helps us to identify patterns in data based on the correlation between features. In a nutshell, PCA aims to find the directions of maximum variance in high-dimensional data and projects it onto a new subspace with equal or fewer dimensions than the original one.

The orthogonal axes (principal components) of the new subspace can be interpreted as the directions of maximum variance given the constraint that the new feature axes are orthogonal to each other, as illustrated in the following figure:

In the preceding figure, x1 and x2 are the original feature axes, and PC1 and PC2 are the principal components.

If we use PCA for dimensionality reduction, we construct a d x k–dimensional transformation matrix W that allows us to map a sample vector x onto a new k–dimensional feature subspace that has fewer dimensions than the original d–dimensional feature space:

As a result of transforming the original d-dimensional data onto this new k-dimensional subspace (typically k ≪ d), the first principal component will have the largest possible variance, and all consequent principal components will have the largest variance given the constraint that these components are uncorrelated (orthogonal) to the other principal components — even if the input features are correlated, the resulting principal components will be mutually orthogonal (uncorrelated).

Note that the PCA directions are highly sensitive to data scaling, and we need to standardize the features prior to PCA if the features were measured on different scales and we want to assign equal importance to all features.

Before looking at the PCA algorithm for dimensionality reduction in more detail, let’s summarize the approach in a few simple steps:

Standardize the d-dimensional dataset.
Construct the covariance matrix.
Decompose the covariance matrix into its eigenvectors and eigenvalues.
Sort the eigenvalues by decreasing order to rank the corresponding eigenvectors.
Select k eigenvectors which correspond to the k largest eigenvalues, where k is the dimensionality of the new feature subspace (k ≤ d).
Construct a projection matrix W from the “top” k eigenvectors.
Transform the d-dimensional input dataset X using the projection matrix W to obtain the new k-dimensional feature subspace.

Let’s perform a PCA step by step, using Python as a learning exercise. Then, we will see how to perform a PCA more conveniently using scikit-learn.

Extracting the Principal Components Step By Step

We will be using the Wine dataset from The UCI Machine Learning Repository in our example. This dataset consists of 178 wine samples with 13 features describing their different chemical properties. You can find out more here.

In this section we will tackle the first four steps of a PCA; later we will go over the last three. You can follow along with the code in this tutorial by using a Next Tech sandbox, which has all the necessary libraries pre-installed, or if you’d prefer, you can run the snippets in your own local environment.

Once your sandbox loads, we will start by loading the Wine dataset directly from the repository:

import pandas as pd

df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/'
                      'machine-learning-databases/wine/wine.data',
                      header=None)
df_wine.head()

Next, we will process the Wine data into separate training and test sets — using a 70:30 split — and standardize it to unit variance:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# split into training and testing sets
X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3,
    stratify=y, random_state=0
)
# standardize the features
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

After completing the mandatory preprocessing, let’s advance to the second step: constructing the covariance matrix. The symmetric d x d-dimensional covariance matrix, where d is the number of dimensions in the dataset, stores the pairwise covariances between the different features. For example, the covariance between two features xj and xk on the population level can be calculated via the following equation:

Here, μj and μk are the sample means of features j and k, respectively.

Note that the sample means are zero if we standardized the dataset. A positive covariance between two features indicates that the features increase or decrease together, whereas a negative covariance indicates that the features vary in opposite directions. For example, the covariance matrix of three features can then be written as follows (note that Σ stands for the Greek uppercase letter sigma, which is not to be confused with the sum symbol):

The eigenvectors of the covariance matrix represent the principal components (the directions of maximum variance), whereas the corresponding eigenvalues will define their magnitude. In the case of the Wine dataset, we would obtain 13 eigenvectors and eigenvalues from the 13 x 13-dimensional covariance matrix.

Now, for our third step, let’s obtain the eigenpairs of the covariance matrix. An eigenvector v satisfies the following condition:

Here, λ is a scalar: the eigenvalue. Since the manual computation of eigenvectors and eigenvalues is a somewhat tedious and elaborate task, we will use the linalg.eig function from NumPy to obtain the eigenpairs of the Wine covariance matrix:

import numpy as np

cov_mat = np.cov(X_train_std.T)
eigen_vals, eigen_vecs = np.linalg.eig(cov_mat)

Using the numpy.cov function, we computed the covariance matrix of the standardized training dataset. Using the linalg.eig function, we performed the eigendecomposition, which yielded a vector (eigen_vals) consisting of 13 eigenvalues and the corresponding eigenvectors stored as columns in a 13 x 13-dimensional matrix (eigen_vecs).

Total and Explained Variance

Since we want to reduce the dimensionality of our dataset by compressing it onto a new feature subspace, we only select the subset of the eigenvectors (principal components) that contains most of the information (variance). The eigenvalues define the magnitude of the eigenvectors, so we have to sort the eigenvalues by decreasing magnitude; we are interested in the top k eigenvectors based on the values of their corresponding eigenvalues.

But before we collect those k most informative eigenvectors, let’s plot the variance explained ratios of the eigenvalues. The variance explained ratio of an eigenvalue λj is simply the fraction of an eigenvalue λj and the total sum of the eigenvalues:

Using the NumPy cumsum function, we can then calculate the cumulative sum of explained variances, which we will then plot via matplotlib’s step function:

import matplotlib.pyplot as plt

# calculate cumulative sum of explained variances
tot = sum(eigen_vals)
var_exp = [(i / tot) for i in sorted(eigen_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)

# plot explained variances
plt.bar(range(1,14), var_exp, alpha=0.5,
        align='center', label='individual explained variance')
plt.step(range(1,14), cum_var_exp, where='mid',
         label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.show()

The resulting plot indicates that the first principal component alone accounts for approximately 40% of the variance. Also, we can see that the first two principal components combined explain almost 60% of the variance in the dataset.

Feature Transformation

After we have successfully decomposed the covariance matrix into eigenpairs, let’s now proceed with the last three steps of PCA to transform the Wine dataset onto the new principal component axes.

We will sort the eigenpairs by descending order of the eigenvalues, construct a projection matrix from the selected eigenvectors, and use the projection matrix to transform the data onto the lower-dimensional subspace.

We start by sorting the eigenpairs by decreasing order of the eigenvalues:

# Make a list of (eigenvalue, eigenvector) tuples
eigen_pairs = [(np.abs(eigen_vals[i]), eigen_vecs[:, i]) for i in range(len(eigen_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eigen_pairs.sort(key=lambda k: k[0], reverse=True)

Next, we collect the two eigenvectors that correspond to the two largest eigenvalues, to capture about 60% of the variance in this dataset. Note that we only chose two eigenvectors for the purpose of illustration, since we are going to plot the data via a two-dimensional scatter plot later in this subsection. In practice, the number of principal components has to be determined by a trade-off between computational efficiency and the performance of the classifier:

w = np.hstack((eigen_pairs[0][1][:, np.newaxis], eigen_pairs[1][1][:, np.newaxis]))
print('Matrix W:\n', w)

[Out:]
Matrix W:
 [[-0.13724218  0.50303478]
 [ 0.24724326  0.16487119]
 [-0.02545159  0.24456476]
 [ 0.20694508 -0.11352904]
 [-0.15436582  0.28974518]
 [-0.39376952  0.05080104]
 [-0.41735106 -0.02287338]
 [ 0.30572896  0.09048885]
 [-0.30668347  0.00835233]
 [ 0.07554066  0.54977581]
 [-0.32613263 -0.20716433]
 [-0.36861022 -0.24902536]
 [-0.29669651  0.38022942]]

By executing the preceding code, we have created a 13 x 2-dimensional projection matrix W from the top two eigenvectors.

Using the projection matrix, we can now transform a sample x (represented as a 1 x 13-dimensional row vector) onto the PCA subspace (the principal components one and two) obtaining x′, now a two-dimensional sample vector consisting of two new features:

X_train_std[0].dot(w)

Similarly, we can transform the entire 124 x 13-dimensional training dataset onto the two principal components by calculating the matrix dot product:

X_train_pca = X_train_std.dot(w)

Lastly, let’s visualize the transformed Wine training set, now stored as an 124 x 2-dimensional matrix, in a two-dimensional scatterplot:

colors = ['r', 'b', 'g']
markers = ['s', 'x', 'o']
for l, c, m in zip(np.unique(y_train), colors, markers):
    plt.scatter(X_train_pca[y_train==l, 0], 
                X_train_pca[y_train==l, 1], 
                c=c, label=l, marker=m) 
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.legend(loc='lower left')
plt.show()

As we can see in the resulting plot, the data is more spread along the x-axis — the first principal component — than the second principal component (y-axis), which is consistent with the explained variance ratio plot that we created previously. However, we can intuitively see that a linear classifier will likely be able to separate the classes well.

Although we encoded the class label information for the purpose of illustration in the preceding scatter plot, we have to keep in mind that PCA is an unsupervised technique that does not use any class label information.

PCA in `scikit-learn`

Although the verbose approach in the previous subsection helped us to follow the inner workings of PCA, we will now discuss how to use the PCA class implemented in scikit-learn. The PCA class is another one of scikit-learn’s transformer classes, where we first fit the model using the training data before we transform both the training data and the test dataset using the same model parameters.

Let’s use the PCA class on the Wine training dataset, classify the transformed samples via logistic regression:

from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA

# intialize pca and logistic regression model
pca = PCA(n_components=2)
lr = LogisticRegression(multi_class='auto', solver='liblinear')

# fit and transform data
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)
lr.fit(X_train_pca, y_train)

Now, using a custom plot_decision_regions function, we will visualize the decision regions:

from matplotlib.colors import ListedColormap

def plot_decision_regions(X, y, classifier, resolution=0.02):
    # setup marker generator and color map
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])

    # plot the decision surface
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                           np.arange(x2_min, x2_max, resolution))
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())

    # plot class samples
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], 
                    y=X[y == cl, 1],
                    alpha=0.6, 
                    c=[cmap(idx)],
                    edgecolor='black',
                    marker=markers[idx], 
                    label=cl)# plot decision regions for training set


plot_decision_regions(X_train_pca, y_train, classifier=lr)
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.legend(loc='lower left')
plt.show()

By executing the preceding code, we should now see the decision regions for the training data reduced to two principal component axes.

For the sake of completeness, let’s plot the decision regions of the logistic regression on the transformed test dataset as well to see if it can separate the classes well:

# plot decision regions for test set
plot_decision_regions(X_test_pca, y_test, classifier=lr)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend(loc='lower left')
plt.show()

After we plotted the decision regions for the test set by executing the preceding code, we can see that logistic regression performs quite well on this small two-dimensional feature subspace and only misclassifies very few samples in the test dataset.

If we are interested in the explained variance ratios of the different principal components, we can simply initialize the PCA class with the n_components parameter set to None, so all principal components are kept and the explained variance ratio can then be accessed via the explained_variance_ratio_ attribute:

pca = PCA(n_components=None)
X_train_pca = pca.fit_transform(X_train_std)
pca.explained_variance_ratio_

Note that we set n_components=None when we initialized the PCA class so that it will return all principal components in a sorted order instead of performing a dimensionality reduction.

I hope you enjoyed this tutorial on principal component analysis for dimensionality reduction! We covered the mathematics behind the PCA algorithm, how to perform PCA step-by-step with Python, and how to implement PCA using scikit-learn. Other techniques for dimensionality reduction are Linear Discriminant Analysis (LDA) and Kernel PCA (used for non-linearly separable data).

These other techniques and more topics to improve model performance, such as data preprocessing, model evaluation, hyperparameter tuning, and ensemble learning techniques are covered in Next Tech’s Python Machine Learning (Part 2) course.

You can get started here for free!

Tracking System Metrics with collectd

Lorraine — Wed, 22 May 2019 17:32:38 +0000

Collecting system metrics allows system administrators to monitor available resources, detect bottlenecks, and make informed decisions about servers and projects.

At Next Tech, we use system metrics we gather for many things, such as:

Detecting an issue, like a server that doesn't have enough resources.
Conversely, for identifying areas we could be saving money by reducing a server's resources.
Ensuring that all services (like a web server or database) are running.

In this tutorial we will go over how to install and configure collectd, which is an open source daemon that collects system performance statistics and provides ways to store and publish them. Then, using the collectd's write_http plugin, we will send our metrics data to a Flask application using HTTP POST requests.

Let's get started!

Step 1: Load a Python environment

The easiest way to jump into a Python sandbox is using the Next Sandbox, which gives you access to an online computing environment in a couple of seconds. You can click here to launch one, or here to read about how you can install Python on your computer.

Step 2: Installing collectd

The first step is to install collectd. You can do this by entering the following commands in your terminal:

apt-get update
apt-get install collectd

Once collectd is installed successfully, move on to the next step — configuring the daemon!

Step 3: Configuring collectd

We need to configure collectd so that it knows what data to collect and how to send the values collected.

The collectd configuration file can be found at etc/collectd/collectd.conf. If you know Vim, you can modify the configuration file directly by running:

vim etc/collectd/collectd.conf

Otherwise, we will link the config file to one that is easily editable in the Sandbox. To do so, run the following three commands in your terminal to link the original configuration file to our new one:

cp /etc/collectd/collectd.conf /root/sandbox/collectd.conf
rm /etc/collectd/collectd.conf
ln -s /root/sandbox/collectd.conf /etc/collectd/collectd.conf

Now, open the file and take a look at the default collectd configuration. There are four sections in this file: Global, Logging, LoadPlugin Section, and Plugin Configuration.

Global

The first part of the file displays the Global Settings. The lines beginning with a hash (#) are commented out — we will remove some of these hashes so that these settings are as follows:

Hostname "localhost"
FQDNLookup true
BaseDir "/var/lib/collectd"
PluginDir "/usr/lib/collectd"
#TypesDB "/usr/share/collectd/types.db" "/etc/collectd/my_types.db"

AutoLoadPlugin false

CollectInternalStats false

Interval 10

Logging

The next section of the configuration file displays plugins used for logging messages generated by the daemon when it is initialized and when loading or configuring other plugins.

For each plugin in this section (and the next), there is a LoadPlugin line in the configuration, followed by the plugin's options. Almost all of these lines are commented out in order to keep the default configuration lean.

Only one log plugin should be enabled. We will be using using LogFile.

Remove the hash before the LoadPlugin logfile line and edit the plugin configuration to write the output to this file by changing the File parameter. This section should look like this:

LoadPlugin logfile

<Plugin logfile>
    LogLevel "info"
    File "/root/sandbox/collectd.log"
    PrintSeverity true
</Plugin>

Make sure the default logging plugin syslog is either removed or commented out.

LoadPlugin section

The next section displays a list of features. By default the following plugins are enabled:

Plugin	Description
battery	collects the battery's charge, the drawn current and the battery's voltage.
cpu	collects the amount of time spent by the CPU in various states, e.g. executing user code, executing system code, waiting for IO-operations and being idle.
df	collects file system usage information, i.e. how much space on a mounted partition is used and how much is available.
disk	collects performance statistics of hard-disks and, where supported, partitions.
entropy	collects the available entropy on a system.
interface	collects information about the traffic (octets per second), packets per second and errors of interfaces (number of errors during one second).
irq	collects the number of times each interrupt has been handled by the operating system.
load	collects the system load (the number of runnable tasks in the run-queue). These numbers give a rough overview over the utilization of a machine.
memory	collects physical memory utilization - Used, Buffered, Cached, Free.
processes	collects the number of processes, grouped by their state (e. g. running, sleeping, zombies, etc.). It can also gather detailed statistics about selected processes.
swap	collects the amount of memory currently written onto hard disk or whatever the system calls “swap”.
users	counts the number of users currently logged into the system.

Our collectd daemon will automatically collect data using these plugins. A full list of the available plugins and a short description of each can be found here.

We also want to enable the write_http plugin so that collectd will know where to send the data it collects:

Plugin	Description
write_http	sends values collected by collectd to a web-server using HTTP POST requests.

Find this plugin in the list and remove the hash to enable it.

Plugin configuration

The final section shows the configuration for all the listed plugins. You can see all the plugin options in the collectd.conf(5) manual.

Find the plugin configuration for write_http. We will modify this to look like the following:

<Plugin "write_http">
    <Node "example">
        URL "http://127.0.0.1:5000";
        Format JSON
    </Node>
</Plugin>

Note that we specified the format of our output to be in JSON.

We will not be modifying any of the other plugin configurations but feel free to do so on your own.

Step 4: Verifying configuration

It is important to restart collectd whenever the configuration file is changed. Run the following command in your terminal to do so:

systemctl restart collectd

You can also run this command to check the status of collectd:

systemctl status collectd

If all is working, you should see Active: active (running) in the output.

You should also be able to see that collectd was initialized and plugins were loaded in your collectd.log file.

Finally, we can verify whether there are issues in the configuration file by running the following:

collectd -t ; echo $?

This command tests the configuration, then exits. It should return the output 0.

Step 5: Creating a Flask app

We've gotten collectd to track our metrics properly...now let's create a web application using Flask so collectd can send this data via HTTP POST requests!

Flask is a powerful microframework for creating web applications with Python. It comes with an inbuilt development server and is a perfect framework to build RESTful web services. The route decorator which helps to bind a function to a URL can take the HTTP methods as arguments that pave a way to build APIs in an ideal manner.

First, let's install Flask by running the following:

pip3 install Flask

Now, create a new file called flask_app.py.

There are three steps to write this program:

Create a WSGI application instance, as every application in Flask needs one to handle requests.
Define a route method which associates a URL and the function which handles it.
Activate the application's server.

Copy the following code into the flask_app.py file in your directory:

from flask import Flask, request

app = Flask(__name__)

@app.route('/', methods=['GET', 'POST'])
def get_data():
    print(request.data)
    return 'This is working!'

This snippet executes the first two steps — we created a WSGI application instance using Flask's Flask class, and then we defined a route which maps the path '/' and the function get_data to process the request using a Flask's decorator function Flask.route().

Within the Flask.route() decorator, we specified the request methods as GET and POST. Then, our get_data method prints the incoming request data using request.data.

Continue on to the final step in our lesson to activate the application's server to see our data!

Step 6: Running a Flask app

To run your Flask application, you need to first tell the terminal what application to work with by exporting the FLASK_APP environment variable:

export FLASK_APP=flask_app.py

Then, execute the following to enable the development environment, including the interactive debugger and reloader:

export FLASK_ENV=development

Finally, we can run the application with the following:

flask run

After the app start running, in your terminal you should see your collectd data coming in as a JSON output!

Summary

In this tutorial we covered:

Installing collectd
Configuring multiple collectd plugins
Building a basic Flask application
Receiving data from collectd inside of Flask

We hope you enjoyed this tutorial and learned something new! If you have any comments or questions, don't hesitate to drop us a note below.

This tutorial was extracted from a lesson on Next Tech. If you're interested in exploring the other courses we have, come take a look!

Classification and Regression Analysis with Decision Trees

Lorraine — Wed, 15 May 2019 16:11:53 +0000

A decision tree is a supervised machine learning model used to predict a target by learning decision rules from features. As the name suggests, we can think of this model as breaking down our data by making a decision based on asking a series of questions.

Let's consider the following example in which we use a decision tree to decide upon an activity on a particular day:

Based on the features in our training set, the decision tree model learns a series of questions to infer the class labels of the samples. As we can see, decision trees are attractive models if we care about interpretability.

Although the preceding figure illustrates the concept of a decision tree based on categorical variables (classification), the same concept applies if our features are real numbers (regression).

In this tutorial, we will discuss how to build a decision tree model with Python’s scikit-learn library. We will cover:

The fundamental concepts of decision trees
The mathematics behind the decision tree learning algorithm
Information gain and impurity measures
Classification trees
Regression trees

Let’s get started!

This tutorial is adapted from Next Tech’s Python Machine Learning series which takes you through machine learning and deep learning algorithms with Python from 0 to 100. It includes an in-browser sandboxed environment with all the necessary software and libraries pre-installed, and projects using public datasets. You can get started here!

The Fundamentals of Decision Trees

A decision tree is constructed by recursive partitioning — starting from the root node (known as the first parent), each node can be split into left and right child nodes. These nodes can then be further split and they themselves become parent nodes of their resulting children nodes.

For example, looking at the image above, the root node is Work to do? and splits into the child nodes Stay in and Outlook based on whether or not there is work to do. The Outlook node further splits into three child nodes.

So, how do we know what the optimal splitting point is at each node?

Starting from the root, the data is split on the feature that results in the largest Information Gain (IG) (explained in more detail below). In an iterative process, we then repeat this splitting procedure at each child node until the leaves are pure — i.e. samples at each node all belong to the same class.

In practice, this can result in a very deep tree with many nodes, which can easily lead to overfitting. Thus, we typically want to prune the tree by setting a limit for the maximal depth of the tree.

Maximizing Information Gain

In order to split the nodes at the most informative features, we need to define an objective function that we want to optimize via the tree learning algorithm. Here, our objective function is to maximize the information gain at each split, which we define as follows:

Here, f is the feature to perform the split, Dp, Dleft, and Dright are the dataset of the parent and child nodes, I is the impurity measure, Np is the total number of samples at the parent node, and Nleft and Nright are the number of samples in the child nodes.

We will discuss impurity measures for classification and regression decision trees in more detail in our examples below. But for now, just understand that information gain is simply the difference between the impurity of the parent node and the sum of the child node impurities — the lower the impurity of the child nodes, the larger the information gain.

Note that the above equation is for binary decision trees — each parent node is split into two child nodes only. If you have a decision tree with multiple nodes, you would simply sum the impurity of all nodes.

Classification Trees

We will start by talking about classification decision trees (also known as classification trees). For this example, we will be using the Iris dataset, a classic in the field of machine learning. It contains the measurements of 150 Iris flowers from three different species —Setosa, Versicolor, and Virginica. These will be our targets. Our goal is to predict which category an Iris flower belongs to. The petal length and width in centimeters are stored as columns, which we also call the features of the dataset.

Let’s first import the dataset and assign the features as X and the target as y:



from sklearn import datasets

iris = datasets.load_iris()                        # Load iris dataset

X = iris.data[:, [2, 3]]                           # Assign matrix X
y = iris.target                                    # Assign vector y

Using scikit-learn, we will now train a decision tree with a maximum depth of 4. The code is as follows:



from sklearn.tree import DecisionTreeClassifier    # Import decision tree classifier model

tree = DecisionTreeClassifier(criterion='entropy', # Initialize and fit classifier
    max_depth=4, random_state=1)
tree.fit(X, y)

Notice that we set the criterion as ‘entropy’. This criterion is known as the impurity measure (mentioned in the previous section). In classification, entropy is the most common impurity measure or splitting criteria. It is defined by:

Here, P(i|t) is the proportion of the samples that belong to class c for a particular node t. The entropy is therefore 0 if all samples at a node belong to the same class, and the entropy is maximal if we have a uniform class distribution.

For a more visual understanding of entropy, let’s plot the impurity index for the probability range [0, 1] for class 1. The code is as follows:



import numpy as np
import matplotlib.pyplot as plt

def entropy(p):
    return - p * np.log2(p) - (1 - p) * np.log2(1 - p)

x = np.arange(0.0, 1.0, 0.01)                      # Create dummy data
e = [entropy(p) if p != 0 else None for p in x]    # Calculate entropy

plt.plot(x, e, label='entropy', color='r')         # Plot impurity indices
for y in [0.5, 1.0]:
    plt.axhline(y=y, linewidth=1,
                color='k', linestyle='--')
plt.xlabel('p(i=1)')
plt.ylabel('Impurity Index')
plt.legend()
plt.show()

As you can see, entropy is 0 if p(i=1|t) = 1. If the classes are distributed uniformly with p(i=1|t) = 0.5, entropy is 1.

Now, returning to our Iris example, we will visualize our trained classification tree and see how entropy decides each split.

A nice feature in scikit-learn is that it allows us to export the decision tree as a .dot file after training, which we can visualize using GraphViz, for example. In addition to GraphViz, we will use a Python library called pydotplus, which has capabilities similar to GraphViz and allows us to convert .dot data files into a decision tree image file.

You can install pydotplus and graphviz by executing the following commands in your Terminal:



pip3 install pydotplus
apt install graphviz

The following code will create an image of our decision tree in PNG format:



from pydotplus.graphviz import graph_from_dot_data
from sklearn.tree import export_graphviz

dot_data = export_graphviz(                           # Create dot data
    tree, filled=True, rounded=True,
    class_names=['Setosa', 'Versicolor','Virginica'],
    feature_names=['petal length', 'petal width'],
    out_file=None
)

graph = graph_from_dot_data(dot_data)                 # Create graph from dot data
graph.write_png('tree.png')                           # Write graph to PNG image

Looking at the resulting decision tree figure saved in the image file tree.png, we can now nicely trace back the splits that the decision tree determined from our training dataset. We started with 150 samples at the root and split them into two child nodes with 50 and 100 samples, using the petal width cut-off ≤ 1.75 cm. After the first split, we can see that the left child node is already pure and only contains samples from the setosa class (entropy = 0). The further splits on the right are then used to separate the samples from the versicolor and virginica class.

Looking at the final entropy we see that the decision tree with a depth of 4 does a very good job of separating the flower classes.

Regression Trees

We will be using the Boston Housing dataset for our regression example. This is another very popular dataset which contains information about houses in the suburbs of Boston. There are 506 samples and 14 attributes. For simplicity and visualization purposes, we will only use two — MEDV (median value of owner-occupied homes in $1000s) as the target and LSTAT (percentage of lower status of the population) as the feature.

Let’s first import the necessary attributes from scikit-learn into a pandas DataFrame.



import pandas as pd
from sklearn import datasets

boston = datasets.load_boston()            # Load Boston Dataset
df = pd.DataFrame(boston.data[:, 12])      # Create DataFrame using only the LSAT feature
df.columns = ['LSTAT']
df['MEDV'] = boston.target                 # Create new column with the target MEDV
df.head()

Let’s use the DecisionTreeRegressor implemented in scikit-learn to train a regression tree:



from sklearn.tree import DecisionTreeRegressor    # Import decision tree regression model

X = df[['LSTAT']].values                          # Assign matrix X
y = df['MEDV'].values                             # Assign vector y

sort_idx = X.flatten().argsort()                  # Sort X and y by ascending values of X
X = X[sort_idx]
y = y[sort_idx]

tree = DecisionTreeRegressor(criterion='mse',     # Initialize and fit regressor
                             max_depth=3)         
tree.fit(X, y)

Notice that our criterion is different from the one we used for our classification tree. Entropy as a measure of impurity is a useful criteria for classification. To use a decision tree for regression, however, we need an impurity metric that is suitable for continuous variables, so we define the impurity measure using the weighted mean squared error (MSE) of the children nodes instead:

Here, Nt is the number of training samples at node t, Dt is the training subset at node t, y⁽ⁱ⁾ is the true target value, and ŷ^t is the predicted target value (sample mean):

Now, let’s model the relationship between MEDV and LSTAT to see what the line fit of a regression tree looks like:



plt.figure(figsize=(16, 8))
plt.scatter(X, y, c='steelblue',                  # Plot actual target against features
            edgecolor='white', s=70)
plt.plot(X, tree.predict(X),                      # Plot predicted target against features
         color='black', lw=2)
plt.xlabel('% lower status of the population [LSTAT]')
plt.ylabel('Price in $1000s [MEDV]')
plt.show()

As we can see in the resulting plot, the decision tree of depth 3 captures the general trend in the data.

I hope you enjoyed this tutorial on decision trees! We discussed the fundamental concepts of decision trees, the algorithms for minimizing impurity, and how to build decision trees for both classification and regression.

In practice, it is important to know how to choose an appropriate value for a depth of a tree to not overfit or underfit the data. Knowing how to combine decision trees to form an ensemble random forest is also useful as it usually has a better generalization performance than an individual decision tree due to randomness, which helps to decrease the model's variance. It is also less sensitive to outliers in the dataset and doesn't require much parameter tuning.

We cover these techniques in our Python Machine Learning series, as well as diving into other machine learning models such as perceptrons, Adaline, linear and polynomial regression, logistic regression, SVMs, kernel SVMs, k-nearest-neighbors, models for sentiment analysis, k-means clustering, DBSCAN, convolutional neural networks, and recurrent neural networks.

We also look at other topics such as regularization, data processing, feature selection and extraction, dimensionality reduction, model evaluation, ensemble learning techniques, and deploying a machine learning model.

You can get started here!

Learn Swift Basics in 5 Minutes

Lorraine — Mon, 25 Mar 2019 16:20:08 +0000

Swift is a relatively new programming language designed by Apple Inc., and was initially made available to Apple developers in 2014. It was primarily intended as a replacement for the aging Objective-C language that was the foundation of OS X and iOS software development at the time. It was made open source in December 2015, and while it remains primarily used by developers targeting the Apple macOS and iOS platforms, Swift is also fully supported on Linux, and there are unofficial ports under development for Windows as well.

Unlike many object-oriented languages, which are based on older procedural languages — for example, C++ and Objective-C are based on C — Swift was designed from the ground up as a new, modern, object-oriented language that makes programming faster and easier, and helps developers produce expressive code that's less prone to errors than many languages.

While not based on an older language, Swift, in the words of its chief architect, Chris Lattner, "was inspired by drawing ideas from Ruby, Python, C#, CLU, and far too many others to list".

In this quick crash course, we will cover the fundamentals of using the Swift programming language. You'll learn:

Basic Swift syntax
Swift program structure
Variables and constants
Type inference
Variable and constant naming conventions
Printing and string interpolation

Let's get started!

This crash course is adapted from Next Tech's full Beginning Swift course, which includes an in-browser sandboxed environment with Swift pre-installed. It also includes numerous activities for you to complete. You can check it out for free here!

Swift Syntax

In this first section, we'll look at the basic language syntax for Swift.

Like many modern programming languages, Swift draws its most basic syntax from the programming language C. If you have previous programming experience in other C-inspired languages, many aspects of Swift will seem familiar, for example:

Programs are made up of statements, executed sequentially.
More than one statement is allowed per editor line when separated by a semicolon (;).
Units of work in Swift are modularized using functions and organized into types.
Functions accept one or more parameters, and return values.
Single and multiline comments follow the same syntax as in C++ and Java.
Swift data type names and usage are similar to that in Java, C#, and C++.
Swift has the concept of named variables, which are mutable, and named constants, which are immutable.
Swift has both struct and class semantics, as do C++ and C#.

However, Swift has some improvements and differences from C-inspired languages that you may have to become accustomed to, such as:

Semicolons are not required at the end of statements — except when used to separate multiple statements typed on the same line in a source file.
Swift has no main() method to serve as the program's starting point when the operating system loads the application. Swift programs begin at the first line of code of the program's source file — as is the case in most interpreted languages.
Functions in Swift place the function return type at the right-hand side of the function declaration, rather than the left.
Function parameter declaration syntax is inspired by Objective-C, which is quite different and often at first confusing for Java, C#, and C++ developers.
The difference between a struct and a class in Swift is similar to what we have in C# (value type versus reference type), but not the same as in C++ (both are the same, except struct members are public by default).

Swift Program Structure — `Hello, World`!

To illustrate the basic structure of a Swift program, let's create a simple Swift program to display the string Hello, World. to the console:

let message = "Hello, World"
print(message)

[Out:]
Hello, World

If you are using Next Tech's sandbox, you can follow along with the code snippets in this crash course by simply typing in the editor. Otherwise, you can follow along with your own IDE — just make sure that Swift is installed!

Congratulations! In two lines of code, you've just written your first fully-functional Swift program.

Now, let's move on to learning about and using the Swift language — and break down each part of your Hello World program!

Swift Variables

Virtually all programming languages include the ability for programmers to store values in memory using an associated name chosen by the programmer. Variables allow programs to operate on data values that change during the run of the program.

A Swift variable declaration uses the following basic syntax:

var <variable name> : <type> = <value>

Given this syntax, a legal declaration for a variable called pi would be:

var pi : Double = 3.14159

This declaration means: "create a variable named pi , which stores a Double data type, and assign it an initial value of 3.14159".

Swift Constants

You may want to store a named value in your program that will not change during the life of the program. How can we ensure that, once defined, this named value can never be accidentally changed by our code? By declaring a constant!

In our earlier Hello, World program, we declared message using let instead of var — therefore, message is a constant.

Since message was declared as a constant, if we added the following line of code to the end of our program, we would receive a compile-time error, since changing a let constant is illegal:

message = "Hello, Earth."

[Out:]
error: cannot assign to value: 'message' is a 'let' constant

Generally, any time you create a named value that will never be changed during the run of your program, you should use the let keyword to create a constant. The Swift compiler enforces this recommendation by creating a compile-time warning whenever a var is created that is not subsequently changed.

Other than the restriction on mutating the value of a constant once declared, Swift variables and constants are used in virtually identical ways.

Type Inference

In our Hello World example, we created the constant message without specifying its data type. We took advantage of a Swift compiler feature called type inference.

When you assign the value of a variable or constant as you create it, the Swift compiler will analyze the right-hand side of the assignment, infer the data type, and assign that data type to the variable or constant you're creating. For example, in the following declaration, the compiler will create the variable name as a String data type:

var name = "George Smith"

As a type-safe language, once a data type is inferred by the compiler, it remains fixed for the life of the variable or constant. Attempting to assign a non-string value to the name variable declared above would result in a compile-time error:

name = 3.14159

[Out:]
error: "cannot assign value of type 'Double' to type 'String'

While Swift is a type-safe language, where variable types are explicit and do not change, it is possible to create Swift code that behaves like a dynamic type language using the Swift Any data type. For example, the following code is legal in Swift:

var anyType: Any
anyType = "Hello, world"
anyType = 3.14159

While this is legal, it's not a good Swift programming practice. The Any type is mainly provided to allow bridging between Objective-C and Swift code. To keep your code as safe and error-free as possible, you should use explicit types wherever possible.

Variable Naming Conventions

Swift variables and constants have the same naming rules as most C-inspired programming languages:

Must not start with a digit
After the first character, digits are allowed
Can begin with and include an underscore character
Symbol names are case sensitive
Reserved language keywords may be used as variable names if enclosed in backticks. For example:

  var `Int` : Int = 5

When creating variable and constant names in Swift, the generally accepted naming convention is to use a camelCase naming convention, beginning with a lowercase letter. Following generally accepted naming conventions makes code easier for others to read and understand.

For example, the following would be a conventional variable declaration:

var postalCode = "48108"

However, the following would not be conventional, and would be considered incorrect by many other Swift developers:

var PostalCode = "48108"
var postal_code  = "48108"
var POSTALCODE = "48108"

Unlike many other programming languages, Swift is not restricted to the Western alphabet for its variable name characters. You may use any Unicode character as part of your variable declarations. The following variable declarations are legal in Swift:

var helloWorld = "Hello, World"
var 你好世界 = "Hello World"
var 😊 = "Smile!"

Note that just because you can use any Unicode character within a variable name, and can use reserved words as variables when enclosed in backticks, it doesn't mean you should. Always consider other developers who may need to read and maintain your code in the future. The priority for variable names is that they should make code easier to read, understand, and maintain.

Printing and String Interpolation

In Swift, you can print a variable or a constant to your console using the print() function. Let’s create a variable and a constant and print them out.

Execute this snippet in your Code Editor to create a constant named name, and a variable named address:

let name = "John Doe"
var address = "201 Main Street"
print("\(name) lives at \(address)")

[Out:]
John Does lives at 201 Main Street

Both name and address store string text. By wrapping the variable or constant name in a pair of parentheses, prefixed by a backslash (\), we are able to print their stored values in a print statement — this is called string interpolation.

I hope you enjoyed this quick crash course on the basics of Swift! We learned about basic syntax and program structure, how to declare and use Swift variables and constants, type inference, printing, and string interpolation.

If you are interested in learning more about Swift, we have a full Beginning Swift course at Next Tech that you can start for free! In this course we cover:

Other basic programming concepts such as: optionals, tuples, enums, conditionals and loops, methods, structs, and classes.
Creating scripts and command line applications in Swift
Using Swift outside of iOS and macOS development lifecycles

Happy learning!

What tech skill do you want to learn next? 🤓

Lorraine — Mon, 11 Mar 2019 15:53:08 +0000

Last week we announced an exciting partnership with technology publisher Packt. We're working with them to build a library of hands-on courses for learning tech skills, and we'd love to hear what you're interested in learning!

We've already released a number of courses covering topics like:

Software engineering
Web development
Data science
Machine learning
Databases & SQL

Here's a screenshot from a machine learning course that contains a Jupyter Notebook:

Here's another one of a SQL course:

These courses all provide browser-based environments where you can get real-world experience with what you're learning. Getting to actually build something is an important part of the learning process, but too often performing hours of setup gets in the way of mastering a new skill.

We built the Next Tech IDE to solve exactly this problem, and we're very excited about this partnership with Packt, as it means we'll be able to release more hands-on courses than ever before!

What do you want to learn?

Okay, enough about us, tell us about you! What are you interested in learning? Is it a specific language, a general area of technology, or something else?

If you'd like, you can pick a title from the Packt website and just reply with it below! We'll respond and let you know if it's a course we're already working on or let you know when we can release it!

Your input is really valuable to us as Packt has SIX THOUSAND eBooks and videos, so there's a lot for us to pick from! 🥴

In the meantime, want to see what the current courses look like? Head over to our website to check 'em out!

Introduction to Object-Oriented Programming with Ruby

Lorraine — Tue, 26 Feb 2019 22:10:51 +0000

Object-oriented programming (OOP) is a programming paradigm organized around objects. At a high level, OOP is all about being able to structure code so that its functionality can be shared throughout the application. If done properly, OOP can lead to very elegantly written programs that have minimal code duplication.

This is opposed to procedural programming (PP), in which you build programs in sequential order and call methods when you want shared behavior between pages in the application. Common procedural programming languages include C and Go.

In this tutorial, you’ll learn the fundamental concepts of OOP for Ruby, an object-oriented programming language wherein everything is an object. We will be using Ruby since one of its defining attributes — in addition to its elegant syntax and readability — is how it implements OOP techniques. This makes it a great language to start learning OOP with.

We will cover:

Creating classes
Instantiating objects
Initializing arguments
Working with inheritance, and
Private and public methods.

In learning these concept, we will build out our own application: an API connector that communicates dynamically with an application that sends a text message. This will include walking through how to leverage concepts such as inheritance and object instantiation to make our code more scalable and reusable.

This brief tutorial is adapted from Next Tech’s Introduction to Ruby course, which includes an in-browser sandboxed environment and auto-checked interactive tasks to complete.

Creating Classes

Before we begin, let’s define what an object is. At its core, an object is a self-contained piece of code that contains data (“attributes”) and behavior (“methods”) and can communicate with other objects. Objects of the same type are created from classes, which act as blueprints that define properties and behavior.

Creating a class in Ruby is fairly easy. To define a class, simply type the class word followed by the name of the class, and end it with the end word. Anything contained between class and end belongs to this class.

Class names in Ruby have a very specific style requirement. They need to start with a letter and if they represent multiple words, each new word needs also to be an uppercase letter — i.e. “CamelCase”.

We’ll start by creating a class called ApiConnector:

class ApiConnector
end

Classes in Ruby can store both data and methods. In many traditional OOP languages such as Java, you need to create two methods for each data element you want to be included in the class. One method, the setter, sets the value in the class. The other method, the getter, allows you to retrieve the value.

The process of creating setter and getter methods for every data attribute can be tiresome and leads to incredibly long class definitions. Thankfully Ruby has a set of tools called attribute accessors.

Let’s implement some setters and getters for some new data elements for our class. Since it’s an API connector, it would make sense to have data elements such as title, description, and url. We can add these elements with the following code:

class ApiConnector
  attr_accessor :title, :description, :url
end

When you merely create a class, it doesn't do anything — it is simply a definition. In order to work with the class, we need to create an instance of it…we’ll cover that next!

Instantiation

To understand what instantiation is, let’s consider a real-world analogy. Let’s imagine that you’re building a house. The first task is to build a blueprint for the house. This blueprint would contain attributes and features of the house, such as the dimensions for each room, how the plumbing will flow, and so on.

Is the blueprint of the house the actual house? Of course not, it simply lists out the attributes and design elements for how the home will be created. So after the blueprint is completed, the actual home can be built — or, “instantiated”.

As explained in the previous section, in OOP, a class is the blueprint for an object. It simply describes what an object will look like and how it will behave. Therefore, instantiation is the process of taking a class definition and creating an object that you can use in a program.

Let’s create a new instance of our ApiConnector class and store it in a variable called api:

api = ApiConnector.new

Now that we have an object created, we can use the api variable to work with the class attributes. For example, we can run the code:

api.url = "https://next.tech/"
p api.url

[Out:]
https://next.tech

In addition to creating attributes, you can also create methods within a class:

def test_method
  p "testing class call"
end

To access this method, we can use the same syntax that we utilized with the attribute accessors:

api.test_method

Putting this altogether, running the full class code below will result in the url and the test_method message to be printed:

class ApiConnector
  attr_accessor :title , :description , :url

  def test_method
    p  "testing class call"
  end
end

api =  ApiConnector.new

api.url = "https://next.tech/"
p api.url

api.test_method`

[Out:]
"https://next.tech"
"testing class call"

Initializer Method

One thing you may find handy in Ruby development is the ability to create an initializer method. This is simply a method called initialize that will run every time when you create an instance of your class. In this method, you can give values to your variables, call other methods, and do just about anything that you think should happen when a new instance of that class is created.

Let’s update our ApiConnector to utilize an initializer method:

class ApiConnector
  def initialize(title, description, url)
    @title = title
    @description = description
    @url = url
  end
end

Within the initialize method, we created an instance variable for each of the parameters so that we can use these variables in other parts of the application as well.

We also removed the attr_accessor method since the new initialize method will take care of this for us. If you need the ability to call the data elements outside of the class, then you would still need to have the attr_accessor call in place.

To test if the initialize method is working, let’s create another method within the class that prints these values out:

def testing_initializer
  p @title
  p @description
  p @url
end

Finally, we’ll instantiate the class and test the initialize method:

api = ApiConnector.new("My title", "My cool description", "https://next.tech")
api.testing_initializer

[Out:]
"My title"
"My cool description"
"https://next.tech"

Working with optional values

Now, what happens when we want to make one of these values optional? For example, what if we want to give a default value to the URL? To do that, we can update our initialize method with the following syntax:

def initialize(title, description, url = "https://next.tech")

Now our program will have the same output even if we don’t pass the url value while creating a new instance of the class:

api = ApiConnector.new("My title", "My cool description")

Using named arguments

Though this looks simple, passing arguments can get complex in real-world Ruby applications because some methods may take a large number of arguments. In such cases, it becomes difficult to know the order of arguments and what values to assign to them.

To avoid this confusion, you can utilize named arguments, like this:

class ApiConnector
  def initialize(title:, description:, url: "https://next.tech")
    ...
  end
  ...
end

api = ApiConnector.new(title: "My title", description: "My cool description")
api.testing_initializer

You can enter the arguments without having to look at the order in the initialize method, and even change the order of the arguments without causing an error:

api = ApiConnector.new(description: "My cool description", title: "My title")

Overriding default values

What happens if we want to override a default value? We simply update our instantiation call like this:

api = ApiConnector.new(title: "My title", description: "My cool description", url: "https://next.xyz")

This update will override our default value of https://next.tech, and calling api.testing_initializer will now print https://next.xyz as the URL.

Inheritance

Now, we are going to learn about an important object-oriented principle called inheritance. Before going into how it is executed in Ruby, let’s see why it’s important for building applications.

To start with, inheritance means your classes can have a hierarchy. It is best used when different classes have some shared responsibilities, since it would be a poor practice to duplicate code in each class for identical or even similar behavior.

Take our ApiConnector class. Let's say we have different API classes for various platforms, but each class shares a number of common data or processes. Instead of duplicating code in each of the API connector classes, we can have one parent class with the shared data and methods. From there, we can create child classes from this parent class. With the way that inheritance works, each of the child classes will have access to the components provided from the parent class.

For example, say we have three APIs: SmsConnector, PhoneConnector, and MailerConnector. If we wrote code individually for each of these classes, it would look like this:

class SmsConnector
  def initialize(title:, description:, url: "https://next.tech")
    @title = title
    @description = description
    @url = url
  end

  def send_sms
    p "Sending SMS message with the title '#{@title}' and description '#{@description}'"
  end
end

class MailerConnector
  def initialize(title:, description:, url: "https://next.tech")
    @title = title
    @description = description
    @url = url
  end

  def send_mail
    p "Sending mail message with the title '#{@title}' and description '#{@description}'"
  end
end

class PhoneConnector
  def initialize(title:, description:, url: "https://next.tech")
    @title = title
    @description = description
    @url = url
  end

  def place_call
    p "Sending phone call with the title '#{@title}' and description '#{@description}'"
  end
end

As you can see, we are simply repeating the same code across different classes. This is considered a poor programming practice that violates the DRY (Don’t Repeat Yourself) principle of development. Instead, we can make an ApiConnector parent class, and each of the other classes can inherit the common functionality from this class:

class ApiConnector
  def initialize(title:, description:, url: "https://next.tech")
    @title = title
    @description = description
    @url = url
  end
end

class SmsConnector < ApiConnector
  def send_sms
    p "Sending SMS message with the title '#{@title}' and description '#{@description}'"
  end
end

class MailerConnector < ApiConnector
  def send_mail
    p "Sending mail message with the title '#{@title}' and description '#{@description}'"
  end
end

class PhoneConnector < ApiConnector
  def place_call
    p "Sending phone call with the title '#{@title}' and description '#{@description}'"
  end
end

By leveraging inheritance, we were able to cut all of the duplicate code throughout our classes.

The syntax for using inheritance is to define the child class name, followed by the < symbol, then the parent class name — i.e. our SmsConnector, MailerConnector, and PhoneConnector classes inherit from the ApiConnector class .

Each of these child classes now has access to the full set of elements provided in the parent ApiConnector class. For example, if we create a new instance of SmsConnector with the following parameters, we can call the send_smsmethod:

sms = SmsConnector.new(title: "Hi there!", description: "I'm an SMS message")
sms.send_sms

[Out:]
Sending SMS message with the title 'Hi there!' and description 'I'm an SMS message'.

A rule of thumb in OOP is to ensure that a class performs a single responsibility. For example, the ApiConnectorclass should not send SMS messages, make phone calls, or send emails since that would be three core responsibilities.

Private and Public Methods

Before we dive into private and public methods, let’s first go back to our original ApiConnector class and create a SmsConnector class that inherits from ApiConnector. In this class, we will create a method called send_sms that will run a script that contacts an API:

class ApiConnector
  def initialize(title:, url: 'https://next.tech')
    @title = title
    @url = url
  end
end

class SmsConnector < ApiConnector
  def send_sms
    `curl -X POST \
    -d "notification[title]=#{@title}" \
    -d "notification[url]=#{@url}" \
    "http://edutechional-smsy.herokuapp.com/notifications"`
  end
end

This SMS API was created by J. Hudgens (2017).

This method will send a title and url to an API, which will in turn send an SMS message. Now we can instantiate the SmsConnector class and call the send_sms message:

sms = SmsConnector.new(
  title: "Hey there!",
  url: "https://next.tech/xyz/introduction-to-ruby"
  )
sms.send_sms

Running this code will contact the SMS API and send the message. You can go to the bottom of this page to see your message!

Now, using this example, let’s discuss the types of methods provided by classes.

The send_sms method is a public method. This means that anyone working on our class can communicate with this method. This may not seem like a big deal if you are working on an application that no one else is working on. However, if you build an API or code library that is open sourced for others to use, it's vital that your public methods represent elements of functionality that you actually want other developers to use.

Public methods should rarely, if ever, be altered. This is because other developers may be relying on your public methods to be consistent, and a change to a public method may break components of their programs.

So, if you can’t change public methods, how can you work on a production application? That’s where private methods come in. A private method is a method that is only accessed by the class that it is contained in. It should never be called by outside services. This means that you can alter their behavior, assuming that these changes don’t have a domino effect and alter the public methods that they may be called from.

Usually private methods are placed at the end of the file after all the public methods. To designate private methods, we use the private word above the list of methods. Let’s add a private method to our ApiConnector class:

class ApiConnector
  def initialize(title:, url:)
    @title = title
    @url = url
    secret_method
  end

 private

   def secret_method
     p "A secret message from the parent class"
   end
end

api = ApiConnector.new(title: "My Title", url: "https://next.tech")

Notice how we're calling this method from the inside of the initialize method of the ApiConnector class? If we run this code, it will give the following output:

[Out:]
A secret message from the parent class

Now child classes have access to methods in the parent class, right? Well, not always. Let’s remove the secret_method method from the initialize method in ApiConnector and try to call it from our SmsConnector child class, as shown here:

class ApiConnector
  def initialize(title:, url:)
    @title = title
    @url = url
  end

 private

   def secret_method
     p "A secret message from the parent class"
   end
end

class SmsConnector < ApiConnector
  def send_sms
    `curl -X POST \
    -d "notification[title]=#{@title}" \
    -d "notification[url]=#{@url}" \
    "http://edutechional-smsy.herokuapp.com/notifications"`
  end
end

sms = SmsConnector.new(
  title: "Hey there!",
  url: "https://next.tech/xyz/introduction-to-ruby"
  )
sms.secret_method

[Out:]
Traceback (most recent call last):
main.rb:29:in `<main>': private method `secret_method' called for #SmsConnector:0x000056188cfe19b0> (NoMethodError)

This is because the SmsConnector class only has access to the public methods from the parent class. The private methods are, by their nature, private. This means that they can only be accessed by the class that they are defined in.

So a good rule of thumb is to create private methods when they should not be used outside the class and public methods when they have to be available throughout the application or used by outside services.

I hope you enjoyed this quick tutorial on the fundamental concepts of object-oriented programming in Ruby! We covered creating classes, attribute accessors, instantiation, initialization, inheritance, and private and public methods.

Ruby is a powerful object-oriented language used by popular applications, including our own here at Next Tech. With this foundational knowledge of OOP, you’re well on your way to developing your own Ruby apps!

If you’re interested in learning more about programming with Ruby, check out our Introduction to Ruby course here! In this course we cover core programming skills, such as variables, strings, loops, and conditionals, more advanced OOP topics, and error handling.

DEV Community: Lorraine

Cleaning and Transforming Data with SQL

CASE WHEN

COALESCE

NULLIF

LEAST / GREATEST

Casting

DISTINCT

Deep Learning Basics: A Crash Course

Introduction to Deep Learning

Deep Learning Algorithms

Multi-Layer Perceptrons (MLPs)

Convolutional Neural Networks (CNNs)

Recurrent Neural Networks (RNNs)

Autoencoders

Reinforcement Learning (RL)

Applications of Deep Learning

Autonomous vehicles

Image and text recognition

Medical imaging

Medical history analysis

Language translation

Speech recognition and generation

Gaming

Immutable vs Mutable Data Types in Python

Mutable vs Immutable

Mutable Data Types

Lists

Byte Arrays

Sets

Dictionaries

Immutable Data Types

Numeric Data Types

Strings and Bytes

Frozen Sets

Tuples

Database Normalization Explained

First Normal Form (1NF):

Second Normal Form (2NF):

Third Normal Form (3NF):

Our Database: Codey's Construction

First Normal Form

Second Normal Form

Third Normal Form

Introduction to Multilayer Neural Networks with TensorFlow’s Keras API

Multilayer Perceptrons

The MNIST dataset

Building an MLP using TensorFlow's Keras API

K-Means Clustering with scikit-learn

Fundamentals of K-Means Clustering

K-means clustering using scikit-learn

The Elbow Method

Principal Component Analysis for Dimensionality Reduction

Introduction to Principle Component Analysis

Extracting the Principal Components Step By Step

Total and Explained Variance

Feature Transformation

PCA in scikit-learn

Tracking System Metrics with collectd

Step 1: Load a Python environment

Step 2: Installing collectd

Step 3: Configuring collectd

Global

Logging

LoadPlugin section

Plugin configuration

Step 4: Verifying configuration

Step 5: Creating a Flask app

Step 6: Running a Flask app

Summary

Classification and Regression Analysis with Decision Trees

The Fundamentals of Decision Trees

Maximizing Information Gain

Classification Trees

Regression Trees

Learn Swift Basics in 5 Minutes

This crash course is adapted from Next Tech's full Beginning Swift course, which includes an in-browser sandboxed environment with Swift pre-installed. It also includes numerous activities for you to complete. You can check it out for free here!

Swift Syntax

Swift Program Structure — Hello, World!

Swift Variables

`CASE WHEN`

K-means clustering using `scikit-learn`

PCA in `scikit-learn`

Swift Program Structure — `Hello, World`!