DEV Community: Next Tech

Free coding courses through May!

Saul Costa — Tue, 28 Apr 2020 18:26:26 +0000

While you are home staying safe and keeping others safe, we want to make sure you have everything you need to keep learning coding and working on your projects. That's why a couple of weeks ago, we decided to make all of Next Tech's interactive courses and sandboxes free through April. Now, we're extending that through May!

Staying home is incredibly important, but so is your ability to continue learning new skills. With Next Tech courses, you can learn coding, web development, data science, and other skills directly from your browser with hands-on projects. No downloads required!

Here are some of the top courses on Next Tech right now that other dev.to readers have enjoyed:

Just need a spot to write some code right now? Here are a few:

Python

Launch a Python sandbox here.

Java

Launch a Java sandbox here.

Swift

Launch a Swift sandbox here.

And of course, if you're a student, you can apply for the GitHub Student Developer Pack and get one year of Next Tech access completely free!

We hope that this free month of access will help you keep learning and building from home. Stay safe!

Best wishes from Next Tech.

Cleaning and Transforming Data with SQL

Lorraine — Tue, 10 Dec 2019 00:33:36 +0000

One of the first tasks performed when doing data analytics is to create clean the dataset you're working with. The insights you draw from your data are only as good as the data itself, so it's no surprise that an estimated 80% of the time spent by analytics professionals involves preparing data for use in analysis.

SQL can help expedite this important task. In this tutorial, we will discuss different functions commonly used to clean, transform, and remove duplicate data from query outputs that may not be in the form we would like. This means you'll learn about:

CASE WHEN
COALESCE
NULLIF
LEAST / GREATEST
Casting
DISTINCT

We will be using the following sample table, employees, throughout this tutorial to illustrate how our functions work:

id	first_name	last_name	title	age	wage	hire_date
1	Amy	Jordan	Ms	24	15	2019-04-27
2	Bill	Tibb	Mr	61	28	2012-05-02
3	Bill	Sadat		18	12	2019-11-08
4	Christine	Riveles	Mrs	36	20	2018-03-30
5	David	Guerin	Honorable	28	20	2016-11-02

This data is preloaded into a Next Tech sandbox for you to experiment and test the below queries. Connect to the database here for free!

Let’s get started!

This tutorial is adapted from Next Tech’s full SQL for Data Analysis course, which includes an in-browser sandboxed environment and interactive activities and challenges using real datasets. You can get started with this course here!

`CASE WHEN`

CASE WHEN is a function that allows a query to map various values in a column to other values. The general format of a CASE WHEN statement is:

CASE
    WHEN condition1 THEN value1
    WHEN condition2 THEN value2
    ...
    WHEN conditionX THEN valueX
    ELSE else_value
END

Here, condition1 and condition2, through conditionX, are Boolean conditions; value1 and value2, through valueX, are values to map the Boolean conditions; and else_value is the value that is mapped if none of the Boolean conditions are met.

For each row, the program starts at the top of the CASE WHEN statement and evaluates the first Boolean condition. The program then runs through each Boolean condition from the first one. For the first condition from the start of the statement that evaluates as true, the statement will return the value associated with that condition. If none of the statements evaluate as true, then the value associated with the ELSE statement will be returned.

As an example, let's say you wanted to return all rows for employees from the employees table. Additionally, you would like to add a column that labels an employee as being an New employee if they were hired after 2019-01-01. Otherwise, it will mark the employee as a Standard employee. This column will be called employee_type. We can create this table by using a CASE WHEN statement as follows:

SELECT
    *,
    CASE
        WHEN hire_date >= '2019-01-01' THEN 'New'
        ELSE 'Standard'
    END AS employee_type
FROM
    employees;

This query will give the following output:

id	first_name	last_name	title	age	wage	hire_date	employee_type
1	Amy	Jordan	Ms	24	15	2019-04-27	New
2	Bill	Tibb	Mr	61	28	2012-05-02	Standard
3	Bill	Sadat		18	12	2019-11-08	New
4	Christine	Riveles	Mrs	36	20	2018-03-30	Standard
5	David	Guerin	Honorable	28	20	2016-11-02	Standard

The CASE WHEN statement effectively mapped a hire date to a string describing the employee type. Using a CASE WHEN statement, you can map values in any way you please.

COALESCE

Another useful technique is to replace NULL values with a standard value. This can be accomplished easily by means of the COALESCE function. COALESCE allows you to list any number of columns and scalar values, and, if the first value in the list is NULL, it will try to fill it in with the second value. The COALESCE function will keep continuing down the list of values until it hits a non-NULL value. If all values in the COALESCE function are NULL, then the function returns NULL.

To illustrate a simple usage of the COALESCE function, let's say we want a list of the names and titles of our employees. However, for those with no title, we want to instead write the value 'NO TITLE'. We can accomplish this request with COALESCE:

SELECT
    first_name,
    last_name,
    COALESCE(title, 'NO TITLE') AS title
FROM
    employees;

This query produces the following results:

first_name	last_name	title
Amy	Jordan	Ms
Bill	Tibb	Mr
Bill	Sadat	NO TITLE
Christine	Riveles	Mrs
David	Guerin	Honorable

When dealing with creating default values and avoiding NULL, COALESCE will always be helpful.

NULLIF

NULLIF is, in a sense, the opposite of COALESCE. NULLIF is a two-value function and will return NULL if the first value equals the second value.

As an example, imagine that we want a list of the names and titles of our employees. However, this time, we want to replace the title 'Honorable' with NULL. This could be done with the following query:

SELECT
    first_name,
    last_name,
    NULLIF(title, 'Honorable') AS title
FROM
    employees;

This will blot out all mentions of 'Honorable' from the title column and give the following output:

first_name	last_name	title
Amy	Jordan	Ms
Bill	Tibb	Mr
Bill	Sadat
Christine	Riveles	Mrs
David	Guerin

LEAST / GREATEST

Two functions often come in handy for data preparation are the LEAST and GREATEST functions. Each function takes any number of values and returns the least or the greatest of the values, respectively.

A simple use of this variable would be to replace the value if it's too high or low. For example, say the minimum wage increased to $15/hour and we need to change the wages of any employee earning less than that. We can create this using the following query:

SELECT
    id,
    first_name,
    last_name,
    title,
    age,
    GREATEST(15, wage) as wage,
    hire_date
FROM
    employees;

This query will give the following output:

id	first_name	last_name	title	age	wage	hire_date
1	Amy	Jordan	Ms	24	15	2019-04-27
2	Bill	Tibb	Mr	61	28	2012-05-02
3	Bill	Sadat		18	15	2019-11-08
4	Christine	Riveles	Mrs	36	20	2018-03-30
5	David	Guerin	Honorable	28	20	2016-11-02

As you can see, Bill Sadat’s wage has increased from $12 to $15.

Casting

Another useful data transformation is to change the data type of a column within a query. This is usually done to use a function only available to one data type, such as text, while working with a column that is in a different data type, such as a numeric.

To change the data type of a column, you simply need to use the column::datatype format, where column is the column name, and datatype is the data type you want to change the column to. For example, to change the age in the employees table to a text column in a query, use the following query:

SELECT
    first_name,
    last_name,
    age::TEXT
FROM
    employees;

This will convert the age column from an integer to text. You can now apply text functions to this transformed column. There is one final catch; not every data type can be cast to a specific data type. For instance, datetime cannot be cast to float types. Your SQL client will throw an error if you ever make an unexpected strange conversion.

DISTINCT

Often, when looking through a dataset, you may be interested in determining the unique values in a column or group of columns. This is the primary use case of the DISTINCT keyword. For example, if you wanted to know all the unique first names in the employees table, you could use the following query:

SELECT
    DISTINCT first_name
FROM
    employees;

This gives the following result:

first_name
Amy
Bill
Christine
David

You can also use DISTINCT with multiple columns to get all distinct column combinations present.

I hope you enjoyed this tutorial on data cleaning and transformation with SQL. This is just the beginning of what you can use SQL for in data analysis. If you’d like to learn more, Next Tech’s SQL for Data Analysis course covers:

More functions used for data preparation and cleaning
Aggregate functions and window functions
Importing and exporting data
Analytics using complex data types
Writing performant queries

You can get started for free here!

Deep Learning Basics: A Crash Course

Lorraine — Thu, 21 Nov 2019 00:56:11 +0000

Machine learning (ML) and deep learning (DL) techniques are being applied in a variety of fields, and data scientists are being sought after in many different industries.

With machine learning, we identify the processes through which we gain knowledge that is not readily apparent from data in order to make decisions. Applications of machine learning techniques may vary greatly, and are found in disciplines as diverse as medicine, finance, and advertising.

Deep learning makes use of more advanced neural networks than those used during the 1980s. This is not only a result of recent developments in the theory, but also advancements in computer hardware.

In this crash course, we will learn about deep learning and deep neural networks (DNNs), that is, neural networks with multiple hidden layers. We will cover the following topics:

Introduction to deep learning
Deep learning algorithms
Applications of deep learning

Let’s get started!

This crash course is adapted from Next Tech's Python Deep Learning Projects (Part 1: The Fundamentals) course, which explores the basics of deep learning, setting up a DL environment, and building an MLP. It includes an in-browser sandboxed environment with all the necessary software and libraries pre-installed. You can get started here!

Introduction to Deep Learning

Given the universal approximation theorem, you may wonder what the point of using more than one hidden layer is. This is in no way a naive question, and for a long time neural networks were used in this way.

One reason for multiple hidden layers is that approximating a complex function might require a huge number of neurons in the hidden layer, making it impractical to use. A more important reason for using deep networks, which is not directly related to the number of hidden layers but to the level of learning, is that a deep network does not simply learn to predict output Y given input X: it also understands basic features of the input.

Let’s take a look at an example.

In Proceedings of the International Conference on Machine Learning (ICML) (2009) by H. Lee, R. Grosse, R. Ranganath, and A. Ng, the authors train a neural network with pictures of different categories of either objects or animals. In the following image we can see how the different layers of the network learn different characteristics of the input data. In the first layer the network learns to detect some basic features, such as lines and edges, which are common to all images in all categories:

The first layer weights (top) and the second layer weights (bottom) after training

In the next layers, shown in the image below, it combines those lines and edges to compose more complex features that are specific to each category.

Columns 1-4 represent the second layer (top) and third layer (bottom) weights learned for a specific object category (class). Column 5 represents the weights learned for a mixture of four object categories (faces, cars, airplanes, and motobikes

In the top row we can see how the network detects different features of each category. Eyes, noses, and mouths for human faces, doors and wheels for cars, and so on. These features are abstract. That is, the network has learned the generic shape of a feature, such as a mouth or a nose, and can detect this feature in the input data despite variations it might have.

In the second row of the preceding image, we can see how the deeper layers of the network combines these features into even more complex ones, such as faces and whole cars. A strength of deep neural networks is that they can learn these high-level abstract representations themselves by deducing them from the training data.

Deep Learning Algorithms

We could define deep learning as a class of machine learning techniques where information is processed in hierarchical layers to understand representations and features from data in increasing levels of complexity. In practice, all deep learning algorithms are neural networks, which share some common basic properties. They all consist of interconnected neurons that are organized in layers. Where they differ is network architecture (the way neurons are organized in the network), and sometimes the way they are trained.

With that in mind, let's look at the main classes of neural networks. The following list is not exhaustive, but it represents the vast majority of algorithms in use today.

Multi-Layer Perceptrons (MLPs)

A neural network with feedforward propagation, fully-connected layers, and at least one hidden layer.

The above diagram demonstrates a 3-layer fully connected neural network with two hidden layers. The input layer has k input neurons, the first hidden layer has n hidden neurons, and the second hidden layer has m hidden neurons. The output, in this example, is the two classes y1 and y2. On top is the always-on bias neuron. A unit from one-layer is connected to all units from the previous and following layers (hence fully connected).

Convolutional Neural Networks (CNNs)

A CNN is a feedforward neural network with several types of special layers. For example, convolutional layers apply a filter to the input image (or sound) by sliding that filter all across the incoming signal, to produce an n-dimensional activation map. There is some evidence that neurons in CNNs are organized similarly to how biological cells are organized in the visual cortex of the brain. Today, they outperform all other ML algorithms on a large number of computer vision and natural language processing tasks.

Recurrent Neural Networks (RNNs)

This type of network has an internal state (or memory), which is based on all or part of the input data already fed to the network. The output of a recurrent network is a combination of its internal state (memory of previous inputs) and the latest input sample. At the same time, the internal state changes, to incorporate newly input data. Because of these properties, recurrent networks are good candidates for tasks that work on sequential data, such as text or time-series data.

Autoencoders

A class of unsupervised learning algorithms, in which the output shape is the same as the input, that allows the network to better learn basic representations. It consists of an input, hidden (or bottleneck), and output layers. Although it's a single network, we can think of it as a virtual composition of two components:

Encoder: Maps the input data to the network's internal representation.
Decoder: Tries to reconstruct the input from the network's internal data representation.

Reinforcement Learning (RL)

Reinforcement algorithms learn how to achieve a complex objective over many steps by using penalties when they make a wrong decision and rewards when they make a correct decision. It is a method often used to teach a machine how to interact with an environment, similar to the way a human behaviour is shaped by negative and positive feedback. RL is often used in building computer games and autonomous vehicles.

Applications of Deep Learning

Machine learning, particularly deep learning, is producing more and more astonishing results in terms of the quality of predictions, feature detection, and classification. Many of these recent results have even made the news! Here are a just a few of the ways that these techniques can be applied today or in the near future:

Autonomous vehicles

Nowadays, new cars have a suite of safety and convenience features that aim to make the driving experience safer and less stressful. One such feature is automated emergency braking if the car sees an obstacle. Another one is lane-keeping assist, which allows the vehicle to stay in its current lane without the driver needing to make corrections with the steering wheel. To recognize lane markings, other vehicles, pedestrians, and cyclists, these systems use a forward-facing camera. We can speculate that future autonomous vehicles will also use deep networks for computer vision.

Image and text recognition

Both Google's Vision API and Amazon's Rekognition service use deep learning models to provide various computer vision capabilities. These include recognizing and detecting objects and scenes in images, text recognition, face recognition, and so on.

Medical imaging

Medical imaging is an umbrella term for various non-invasive methods of creating visual representations of the inside of the body. Some of these include Magnetic resonance images (MRIs), ultrasound, Computed Axial Tomography (CAT) scans, X-rays, and histology images. Typically, such an image is analyzed by a medical professional to determine the patient's condition. Machine learning, computer vision in particular, is enabling computer-aided diagnosis which can help specialists by detecting and highlighting important features of images.

For example, to determine the degree of malignancy of colon cancer a pathologist would have to analyze the morphology of the glands using histology imaging. This is a challenging task because morphology can vary greatly. A deep neural network could segment the glands from the image automatically, leaving the pathologist to verify the results. This would reduce the time needed for analysis, making it cheaper and more accessible.

Medical history analysis

Another medical area that could benefit from deep learning is the analysis of medical history records. Before a doctor diagnoses a condition and prescribes treatment they consult the patient's medical history for additional insight. A deep learning algorithm could extract the most relevant and important information from those extensive records, even if they are handwritten. In this way the doctor's job can be made easier while also reducing the risk of errors.

Language translation

Google's Neural Machine Translation API uses – you guessed it – deep neural networks for machine translation.

Speech recognition and generation

Google Duplex is another impressive real-world demonstration of deep learning. It's a new system that can carry out natural conversations over the phone. For example, it can make restaurant reservations on a user's behalf. It uses deep neural networks to both to understand the conversation and to generate realistic, human-like replies.

Siri, Google Assistant, and Amazon Alexa also rely on deep networks for speech recognition.

Gaming

Finally, AlphaGo is an artificial intelligence (AI) machine based on deep learning, that made the news in March 2016 for beating the world Go champion, Lee Sedol. AlphaGo had already made the news in January 2016, when it beat the European champion, Fan Hui. Although, at the time, it seemed unlikely that it could go on to beat the world champion. Fast-forward a couple of months and AlphaGo was able to achieve this remarkable feat by sweeping its opponent in a 4-1 victory series.

This was an important milestone because Go has many more possible game variations than other games, such as chess, and it's impossible to consider every possible move in advance. Also, unlike chess, in Go it's very difficult to even judge the current position or value of a single stone on the board. In 2017, DeepMind released an updated version of AlphaGo called AlphaZero.

In this crash course, we explained what deep learning is and how it's related to deep neural networks. We discussed the different types of networks and some real-world applications of deep learning.

This is only the beginning — if you’d like to learn more about deep learning, Next Tech has a Python Deep Learning Projects series that explores real-world deep learning projects across computer vision, natural language processing (NLP), and image processing. Part 1 of the series covers setting up a DL environment and building an MLP. You can get started here.

Immutable vs Mutable Data Types in Python

Lorraine — Thu, 17 Oct 2019 20:07:56 +0000

By now you may have heard the phrase "everything in Python is an object". Objects are abstraction for data, and Python has an amazing variety of data structures that you can use to represent data, or combine them to create your own custom data.

A first fundamental distinction that Python makes on data is about whether or not the value of an object changes. If the value can change, the object is called mutable, while if the value cannot change, the object is called immutable.

In this crash course, we will explore:

The difference between mutable and immutable types
Different data types and how to find out whether they are mutable or immutable

It is very important that you understand the distinction between mutable and immutable because it affects the code you write.

Let’s get started!

This crash course is adapted from Next Tech’s Learn Python Programming course that uses a mix of theory and practicals to explore Python and its features, and progresses from beginner to being skilled in Python. It includes an in-browser sandboxed environment with all the necessary software and libraries pre-installed. You can get started for free here!

Mutable vs Immutable

To get started, it's important to understand that every object in Python has an ID (or identity), a type, and a value, as shown in the following snippet:

age = 42
print(id(age))     # id
print(type(age))   # type
print(age)         # value

10966208
<class 'int'>
42

Once created, the ID of an object never changes. It is a unique identifier for it, and it is used behind the scenes by Python to retrieve the object when we want to use it.

The type also never changes. The type tells what operations are supported by the object and the possible values that can be assigned to it.

The value can either change or not. If it can, the object is said to be mutable, while when it cannot, the object is said to be immutable.

Let's take a look at an example:

age = 42
print(id(age))
print(type(age))
print(age)

age = 43
print(age)
print(id(age))

10966208
<class 'int'>
42
43
10966240

Has the value of age changed? Well, no. 42 is an integer number, of the type int, which is immutable. So, what happened is really that on the first line, age is a name that is set to point to an int object, whose value is 42.

When we type age = 43, what happens is that another object is created, of the type int and value 43 (also, the id will be different), and the name age is set to point to it. So, we didn't change that 42 to 43. We actually just pointed age to a different location.

As you can see from printing id(age) before and after the second object named age was created, they are different.

Now, let's see the same example using a mutable object.

x = [1, 2, 3]
print(x)
print(id(x))

x.pop()
print(x)
print(id(x))

[1, 2, 3]
139912816421064
[1, 2]
139912816421064

For this example, we created a list named m that contains 3 integers, 1, 2, and 3. After we change m by “popping” off the last value 3, the ID of m stays the same!

So, objects of type int are immutable and objects of type list are mutable. Now let’s discuss other immutable and mutable data types!

Mutable Data Types

Mutable sequences can be changed after creation. Some of Python's mutable data types are: lists, byte arrays, sets, and dictionaries.

Lists

As you saw earlier, lists are mutable. Here's another example using the append() method:

a = list(('apple', 'banana', 'clementine'))
print(id(a))

a.append('dates')
print(id(a))

140372445629448
140372445629448

Byte Arrays

Byte arrays represent the mutable version of bytes objects. They expose most of the usual methods of mutable sequences as well as most of the methods of the bytes type. Items are integers in the range [0, 256).

Let's see a quick example with the bytearray type to show that it is mutable:

b = bytearray(b'python')
print(id(bk))

b.replace(b'p', b'P')
print(id(bk))

139963525979808
139963525979808

Sets

Python provides two set types, set and frozenset. They are unordered collections of immutable objects.

c = set(('San Francisco', 'Sydney', 'Sapporo'))
print(id(cl))

c.pop()
print(id(cl))

140494031990344
140494031990344

As you can see, sets are indeed mutable. Later, in the Immutable Data Types section, we will see that frozensets are immutable.

Dictionaries

d = {
    'a': 'alpha',
    'b': 'bravo',
    'c': 'charlie',
    'd': 'delta',
    'e': "echo"
}
print(id(d))

d.update({
    'f': 'foxtrot'
})
print(id(d))

140071114319408
140071114319408

Immutable Data Types

Immutable data types differ from their mutable counterparts in that they can not be changed after creation. Some immutable types include numeric data types, strings, bytes, frozen sets, and tuples.

Numeric Data Types

You have already seen that integers are immutable; similarly, Python’s other built-in numeric data types such as booleans, floats, complex numbers, fractions, and decimals are also immutable!

Strings and Bytes

Textual data in Python is handled with str objects, more commonly known as strings. They are immutable sequences of Unicode code points. Unicode code points can represent a character.

When it comes to storing textual data though, or sending it on the network, you may want to encode it, using an appropriate encoding for the medium you're using. The result of an encoding produces a bytes object, whose syntax and behavior is similar to that of strings.

Both strings and bytes are immutable, as shown in the following snippet:

# string
e = 'Hello, World!'
print(id(e))

e = 'Hello, Mars!'
print(id(e))

140595675113648
140595675113776

# bytes
unicode = 'This is üŋíc0de'     # unicode string: code points
print(type(unicode))
f = unicode.encode('utf-8')     # utf-8 encoded version of unicode string
print(type(f))
print(id(f))

f = b'A bytes object'           # a bytes object
print(id(f))

<class 'str'>
<class 'bytes'>
140595675068152
140595675461360

In the bytes section, we first defined f as an encoded version of our unicode string. As you can see from print(type(f)) this is a bytes type. We then create another bytes object named f whose value is b'A bytes object'. The two f objects have different IDs, which shows that bytes are immutable.

Frozen Sets

As discussed in the previous section, frozensets are similar to sets. However, frozenset objects are quite limited in respect of their mutable counterpart since they cannot be changed. Nevertheless, they still prove very effective for membership test, union, intersection, and difference operations, and for performance reasons.

Tuples

The last immutable sequence type we're going to see is the tuple. A tuple is a sequence of arbitrary Python objects. In a tuple, items are separated by commas. These, too, are immutable, as shown in the following example:

g = (1, 3, 5)
print(id(g))

g = (42, )
print(id(g))

139952252343784
139952253457184

I hope you enjoyed this crash course on the difference between immutable and mutable objects and how to find out which an object is! Now that you understand this fundamental concept of Python programming, you can now explore the methods you can use for each data type.

If you’d like to learn about this and continue to advance your Python skills, Next Tech has a full Learn Python Programming course that covers:

Functions
Conditional programming
Comprehensions and generators
Decorators, object-oriented programming, and iterators
File data persistence
Testing, including a brief introduction to test-driven development
Exception handling
Profiling and performances

You can get started here for free!

Free coding environments and interactive courses!

Saul Costa — Tue, 20 Aug 2019 16:56:20 +0000

Hello Dev.to readers! 👋

We have some exciting news to share with you: we’ve partnered with the GitHub Student Developer Pack to make it easier for students everywhere to learn programming and write your own programs!

What does this mean for you as a student? Well, students now get access to Next Tech’s course library and sandboxed computing environments for free for a whole year! Typically this would cost $228, but with our offer, you can save that money for a ☔ day (or a ☀️ one, for that matter!).

As former and current computing students ourselves, we know that setting up your computer for programming can be tough. It’s especially frustrating when you can see the solution to your homework or next project in your head, but your computer just isn’t working.

This is the exact problem Next Tech was founded to solve! Instead of having to do this set up yourself, you can launch a real programming environment in a couple of seconds and use it directly from your browser. 🤯

With these programming environments, you can learn programming, web development, data science, and more with our library of courses. Or, you can start from scratch and build an entire project using some coding sandboxes. All with no downloads, fast access from any computer anywhere, and now, free for you. 😊

Sound like fun? Head over to the GitHub Student Developer Pack page to claim yours, or if you already have the Pack, come check out Next Tech!

Build a Blackjack Command Line Game

Saul Costa — Tue, 23 Jul 2019 16:16:21 +0000

In this tutorial, we'll cover how to build a command line game for playing Blackjack using Python! You'll get to build the game from start to finish, and when you're done, you'll have a fully functioning game to play from the command line.

While building the game, we'll explore a few handy Python concepts, such as object-oriented programming using classes and how to manage a game loop. This tutorial is also extracted from an entire course on building a Blackjack game using a graphical user environment (GUI), which you can check out here if you're interested.

Sound fun? Let's do it!

What is Blackjack?

Blackjack is a gambling game that requires only a deck of cards. The goal of the game is to get as close as possible to a hand worth 21 points as the dealer flips over your cards – but go over and you're out!

In Blackjack, numbered cards (2 through 10) are worth their face value, picture cards (jack, queen, and king) are worth 10, and an ace is worth either 1 or 11 depending on your other cards. To start a hand, players place their bets and are dealt two cards face up. They can choose to "hit" (receive another card) or "stick" (stay with their current hand) as they attempt to get as close as possible to 21. If they chose to hit and go over 21, they "bust" and lose the hand (and the money they bet!).

Players face off against the dealer, who starts with one card face down and one face up. When all players have chosen to stick or have busted, the dealer then flips over their hidden card and either hits or sticks, their goal being to get a higher hand than any of the players.

If the dealer busts, they pay out the value of each player's wager to that player, provided that the player hasn't already busted. They also need to pay out if they don't get a higher hand than a player.

There are a lot of other rules (of course!) that you can read up on if you're interested, but the above is everything you need to know to build this game.

Okay, let's get started with some coding!

Installing Python

If you don't already have Python installed on your computer, you'll need to do so based on the instructions here. If you'd rather avoid that, you can grab an online coding sandbox with Python and other necessary libraries pre-installed here (sign in required).

Defining Classes

Before we begin coding our blackjack game, it's important we cover how we'll use object-oriented programming, since we will need to utilize classes for our game.

We will begin by defining the classes that will be used in order to separate out different aspects of the game of blackjack. We will model three of the components of the game:

Card: A basic playing card. The card belongs to a suit (hearts ♥, diamonds ♦, spades ♠, or clubs ♣) and is worth a certain value.
Deck: A collection of cards. The deck shrinks as cards are drawn and contains 52 unique cards.
Hand: Each player's assigned cards. A hand is what defines each player's score and thus who wins.

Let's begin with the simplest concept: the Card.

The `Card` class

The Card class will be the first class we define, as both of our other classes will need to use it. Create a Python file called blackjack.py, then add the following code:

import random

class Card:
    def __init__(self, suit, value):
        self.suit = suit
        self.value = value

    def __repr__(self):
        return " of ".join((self.value, self.suit))

The only import we will need for our game is the random module. This will allow us to shuffle our virtual deck of cards at the beginning of every game.

Our first class will be one representing the playing cards. Each card will have a suit (hearts, diamonds, spades, and clubs) and a value (ace through king). We define the __repr__ function in order to change how the card is displayed when we call print on it. Our function will return the value and the suit, for example, King of Spades. That's all we need to do for a Card!

Next up, we need to create a Deck of these Card classes.

The `Deck` class

The Deck will need to contain 52 unique cards and must be able to shuffle itself. It will also need to be able to deal cards and decrease in size as cards are removed. Create the Deck class in the blackjack.py file using the below code:

class Deck:
    def __init__(self):
        self.cards = [Card(s, v) for s in ["Spades", "Clubs", "Hearts",
                      "Diamonds"] for v in ["A", "2", "3", "4", "5", "6", 
                      "7", "8", "9", "10", "J", "Q", "K"]]

    def shuffle(self):
        if len(self.cards) > 1:
            random.shuffle(self.cards)

    def deal(self):
        if len(self.cards) > 1:
            return self.cards.pop(0)

When creating an instance of the Deck, we simply need to have a collection of every possible card. We achieve this by using a list comprehension containing lists of every suit and value. We pass each combination over to the initialization for our Card class to create 52 unique Card instances.

Our Deck will need to be able to be shuffled so that every game is different. We use the shuffle function in the random library to do this for us (how fitting). To avoid any potential errors, we will only shuffle a deck which still has two or more cards in it, since shuffling one or zero cards is pointless.

After shuffling, we will need to deal cards too. We utilize the pop function of a list (which is the data structure holding our cards) to return the top card and remove it from the deck so that it cannot be dealt again.

That's it for the Deck class! The final utility class to be created for our game to work is the Hand. All players have a hand of cards, and each hand is worth a numerical value based on the cards it contains.

The `Hand` class

A Hand class will need to contain cards just like the Deck class does. It will also be assigned a value by the rules of the game based on which cards it contains. Since the dealer's hand should only display one card, we also keep track of whether the Hand belongs to the dealer to accommodate this rule.

Start with the below to create the Hand class in the blackjack.py file:

class Hand:
    def __init__(self, dealer=False):
        self.dealer = dealer
        self.cards = []
        self.value = 0

    def add_card(self, card):
        self.cards.append(card)

Much like the Deck, a Hand will hold its cards as a list of Card instances. When adding a card to the hand, we simply add the Card instance to our cards list.

Within the Hand class, calculating the currently held cards value is where the rules of the game come into play the most:

    def calculate_value(self):
        self.value = 0
        has_ace = False
        for card in self.cards:
            if card.value.isnumeric():
                self.value += int(card.value)
            else:
                if card.value == "A":
                    has_ace = True
                    self.value += 11
                else:
                    self.value += 10

        if has_ace and self.value > 21:
            self.value -= 10

    def get_value(self):
        self.calculate_value()
        return self.value

You may note that the above code is already indented. This is intentional and done below too! This way, you don't need to perform the indents yourself and can focus on reading the instructions and code instead of chasing down whitespace errors.

In this code, we first initialize the value of the hand to 0 and assume the player does not have an ace (since this is a special case).

Then, we loop through the Card instances and try to add their value as a number to the player's total, using the following logic:

If the card's value is numerical, we add its value to the value of this hand (self.value).
If it is not numerical, we check to see whether the card is an ace. If it is, we add 11 to the hand's value and set the has_ace flag to True.
If it is not an ace, we simply add 10 to the value of the hand.

Once this is done, we check to see if there was an ace and the increase of 11 points brought the hand's value over 21. If so, we make the ace worth 1 point instead by subtracting 10 from the hand's value.

Now, we need some way for the game to display each hand's cards, so we use a simple function to print each card in the hand, and the value of the player's hand too. The dealer's first card is face down, so we print hidden instead:

    def display(self):
        if self.dealer:
            print("hidden")
            print(self.cards[1])
        else:
            for card in self.cards:
                print(card)
            print("Value:", self.get_value())

Now that we have all of our underlying data structures written, it's time for the main game loop!

The Game Loop

We will define the game's main loop within its play method, so that to start a game, you will simply need to create an instance of the Game class and call .play() on it:

class Game:
    def __init__(self):
        pass

    def play(self):
        playing = True

        while playing:
            self.deck = Deck()
            self.deck.shuffle()

            self.player_hand = Hand()
            self.dealer_hand = Hand(dealer=True)

            for i in range(2):
                self.player_hand.add_card(self.deck.deal())
                self.dealer_hand.add_card(self.deck.deal())

            print("Your hand is:")
            self.player_hand.display()
            print()
            print("Dealer's hand is:")
            self.dealer_hand.display()

The above code is pretty lengthy, so let's break it down:

We start off our loop with a Boolean (playing) which will be used to track whether or not we are still playing the game.
If we are, we need a shuffled Deck and two Hand instances—one for the dealer and one for the player.
We use the range function to deal two cards each to the player and the dealer. Our deal method will return a Card instance, which is passed to the add_card method of our Hand instances.
Finally, we display the hands to our player. We can use the display method on our Hand instances to print this to the screen.

This marks the end of the code that needs to run at the beginning of every new game. Now, we enter a loop that will run until a winner is decided. We again control this with a Boolean (game_over):

            game_over = False

            while not game_over:
                player_has_blackjack, dealer_has_blackjack = self.check_for_blackjack()

Before continuing, we first need to check for blackjack. If either player has been dealt an ace and a picture card, their hand will total 21, so they automatically win. Let's create the method to do this (under the play method):

    def check_for_blackjack(self):
        player = False
        dealer = False
        if self.player_hand.get_value() == 21:
            player = True
        if self.dealer_hand.get_value() == 21:
            dealer = True

        return player, dealer

We need to keep track of which player may have blackjack, so we will keep a Boolean for the player (player) and the dealer (dealer).

Next, go back to the while not game_over loop inside the play() method. We need to check whether either hand totals 21, which we will do using two if statements. If either has a hand value of 21, their Boolean is changed to True.

If either of the Booleans are True, then we have a winner, and will print the winner to the screen and continue, thus breaking us out of the game loop. To accomplish this, add the below directly underneath the player_has_blackjack, dealer_has_blackjack = self.check_for_blackjack() line of code:

                if player_has_blackjack or dealer_has_blackjack:
                    game_over = True
                    self.show_blackjack_results(
                        player_has_blackjack, dealer_has_blackjack)
                    continue

We must once again pause to create the method show_blackjack_results(), which will print the winner to the screen. We do this by adding the code below underneath the check_for_blackjack method:

    def show_blackjack_results(self, player_has_blackjack, dealer_has_blackjack):
        if player_has_blackjack and dealer_has_blackjack:
            print("Both players have blackjack! Draw!")

        elif player_has_blackjack:
            print("You have blackjack! You win!")

        elif dealer_has_blackjack:
            print("Dealer has blackjack! Dealer wins!")

If neither player had blackjack, the game loop will continue.

The player can now make a choice—whether or not to add more cards to their hand (hit) or submit their current hand (stick). To do this, add the below to the play method:

                choice = input("Please choose [Hit / Stick] ").lower()
                while choice not in ["h", "s", "hit", "stick"]:
                    choice = input("Please enter 'hit' or 'stick' (or H/S) ").lower()

We use the input function to collect a choice from the user. This will always return us a string containing the text the user typed into the command line.

Since we have a string, we can cast the user's input to lowercase using the lower function to avoid having to check combinations of upper case and lower case when parsing their reply.

If their input is not recognized, we will simply keep asking for it again until it is:

                if choice in ['hit', 'h']:
                    self.player_hand.add_card(self.deck.deal())
                    self.player_hand.display()

Should the player choose to hit, they will need to add an extra card to their hand. This is done in the same way as before with the deal() and add_card() methods.

Since their total has changed, we will now need to check whether they are over the allowed limit of 21. Let's define a method that does this now:

    def player_is_over(self):
        return self.player_hand.get_value() > 21

This method simply checks whether the player's hand value is over 21 and returns the information as a Boolean.

Now, back in the play method, add the following inside the if choice in ['hit', 'h'] block:

                    if self.player_is_over():
                        print("You have lost!")
                        game_over = True

If the player’s hand has a value over 21, they have lost, so the game loop needs to break and we set game_over to True (indicating that the dealer has won).

Okay, now let's handle when the player decides to stick with their hand. If they do this, it's time for their score to be compared with the dealer's. To do this, add the below aligned with the if choice in ['hit', 'h'] statement:

                else:
                    player_hand_value = self.player_hand.get_value()
                    dealer_hand_value = self.dealer_hand.get_value()

                    print("Final Results")
                    print("Your hand:", player_hand_value)
                    print("Dealer's hand:", dealer_hand_value)

                    if player_hand_value > dealer_hand_value:
                        print("You Win!")
                    elif player_hand_value == dealer_hand_value:
                        print("Tie!")
                    else:
                        print("Dealer Wins!")
                    game_over = True

We use the else statement here because we have already established that the user's answer was either hit or stick, and we have just checked hit. This means we will only get into this block when the user wants to stick.

The value of both the player's and the dealer's hand are printed to the screen to give the final results. We then compare the values of each hand to see which is higher.

If the player's hand is a higher value than the dealer's, we print You Win!. If the scores are equal, then we have a tie, so we print Tie!. Otherwise, the dealer must have a higher hand than the player, so we show Dealer wins!.

That completes the logic required for a user to play a single game. Now, let's make it possible for them to play another game by adding the following at the end of the play method, outside of the while loop:

            again = input("Play Again? [Y/N] ")
            while again.lower() not in ["y", "n"]:
                again = input("Please enter Y or N ")
            if again.lower() == "n":
                print("Thanks for playing!")
                playing = False
            else:
                game_over = False

We once again use the combination of lower and a while loop to ensure our answer is a y or n. If the player answers with n, we thank them for playing and set our playing Boolean to False, thus breaking us out of the main game loop and ending the program. If not, they must have answered y, so we set game_over to False and let our main loop run again. This will take us right back to the top at self.deck = Deck() to set up a brand new game.

Running the Game

We've completed the game! Now, it's time to run this code. To do this, we simply create an instance of the Game class at the end of the file and call the play() method:

if __name__ == "__main__":
    game = Game()
    game.play()

Now we have a game, give it a play. You can start the game by typing python3 blackjack.py into your command line (or pressing the blue "Run" button, if you're using the sandbox mentioned earlier).

You should see something like the following printed onto your screen:

workspace $ python3 blackjack.py
Your hand is:
A of Diamonds
5 of Clubs
Value: 16

Dealer's hand is:
hidden
A of Clubs
Please choose [Hit / Stick] H
A of Diamonds
5 of Clubs
10 of Hearts
Value: 16
Please choose [Hit / Stick] H
A of Diamonds
5 of Clubs
10 of Hearts
2 of Clubs
Value: 18
Please choose [Hit / Stick] S
Final Results
Your hand: 18
Dealer's hand: 16
You Win!
Play Again? [Y/N] N
Thanks for playing!

Wrapping Up

Congrats on working your way through this tutorial! In it, we covered how to build handy concepts like object-oriented programming, game flow design, and even the basics of Blackjack.

If you got stuck, the complete solution for this project can be found here. You can also launch an online coding sandbox with it preloaded here.

Two limitations of this game are that the dealer will never hit and there is no concept of betting. Feel free to add these features yourself if you'd like! Because a dealer is required to hit or stick at certain hand values, you can develop a program that mimic the dealer exactly.

You can also check out the full course behind this tutorial, if you'd like!

Building Your First React Website

Andrew Sverdrup — Wed, 17 Jul 2019 14:03:35 +0000

React is one of the most popular web frameworks out there. It has been growing steadily in popularity for years, passing Angular for the first time in the 2019 Stack Overflow developer survey.

This post will show you how to create your own React website in just a few minutes. If you're interested in learning more after completing this tutorial, checkout the Beginning React course I just created on Next Tech to further improve your React skills.

For now, let's dive right into building a website with React!

Prerequisites

To complete these steps you'll need to have the Node Package Manager (npm) installed. If you don't have it installed yet, head on over to https://www.npmjs.com/get-npm to download and install npm.

This will also install npx which we'll use to run create-react-app.

Create React App

Create React App is an excellent way to quickly get a React website up and running. Create React App was created by Facebook (the same company that created React!). In their docs, they describe it as:

Create React App is an officially supported way to create single-page React applications. It offers a modern build setup with no configuration.

Knowing that Create React App is supported by the creators of React is a huge plus. Let's use it to get started with our website!

Run the following command to create your site:

npx create-react-app hello-react

Note that it may take a couple minutes for this command to complete.

Viewing the React website

Next, run the following commands to start the React development server:

cd hello-react
npm start

At this point a browser tab should open showing your React site. If it doesn't, visit http://localhost:3000 in your favorite browser to see your React site!

Updating the site

Now, let's make a change to update the site. Open the hello-react/src/App.js file, then replace the following line:

Edit <code>src/App.js</code> and save to reload.

with

My first React website!

If you open the web page again you'll see that it updated without you having to refresh the page! Live reloading is one of the awesome features that Create React App configures for you.

Creating a React Component

Next, we'll create a new React component. First, create a folder in the src folder named components. Then create a file called HomepageImage.js in the src/components folder. This file will hold our new homepage image component.

We'll create this component by adding the following code to the HomepageImage.js file:

import React from 'react';

function HomepageImage() {
  const url = 'https://cdn.filestackcontent.com/XYrHCaFGRSaq0EPKY1S6';
  return (
    <img src={url} style={{width: 650}} alt='Image of Golden Gate Bridge' />
  );
}

export default HomepageImage;

Then, in App.js, replace

<img src={logo} className="App-logo" alt="logo" />

with

<HomepageImage />

We also need to import the component at the top of App.js by adding the following code to the top of the file:

import HomepageImage from './components/HomepageImage'

Since we removed the image of the React logo, you can then remove this import for the logo as well:

import logo from './logo.svg';

The final App.js file should look like this:

import React from 'react';
import './App.css';
import HomepageImage from './components/HomepageImage'

function App() {
  return (
    <div className="App">
      <header className="App-header">
        <HomepageImage />
        <p>
          My first React website!
        </p>
        <a
          className="App-link"
          href="https://reactjs.org"
          target="_blank"
          rel="noopener noreferrer"
        >
          Learn React
        </a>
      </header>
    </div>
  );
}

export default App;

Now, open http://localhost:3000 again in your browser. If everything is working, you should see the following page:

Congratulations on creating your first website using React 🎉!

Next steps

This tutorial was a quick introduction to creating web pages with React. If you want to gain a better understanding of React so you can build awesome sites using it, checkout the course I just released that teaches React!

Have you built a site with React? Feel free to share your URL or a link to your project on GitHub in the comments below to show it off!

Thanks for reading,

Andrew, Software Engineer @ Next Tech

Special thanks to Maarten van den Heuvel for taking the photo of the Golden Gate Bridge used in this post!

Database Normalization Explained

Lorraine — Mon, 01 Jul 2019 19:29:45 +0000

Normalization is a technique for organizing data in a database. It is important that a database is normalized to minimize redundancy (duplicate data) and to ensure only related data is stored in each table. It also prevents any issues stemming from database modifications such as insertions, deletions, and updates.

The stages of organization are called normal forms. In this tutorial we will be redesigning a database for a construction company and ensuring that it satisfies the three normal forms:

First Normal Form (1NF):

Data is stored in tables with rows uniquely identified by a primary key
Data within each table is stored in individual columns in its most reduced form
There are no repeating groups

Second Normal Form (2NF):

Everything from 1NF
Only data that relates to a table’s primary key is stored in each table

Third Normal Form (3NF):

Everything from 2NF
There are no in-table dependencies between the columns in each table

Note that there are actually six levels of normalization; however, the third normal form is considered the highest level necessary for most applications so we will only be discussing the first three forms.

Let's get started!

This tutorial is adapted from Next Tech's Database Fundamentals course which comes with an in-browser MySQL database and interactive tasks and projects to complete. You can get started here for free!

Our Database: Codey's Construction

Codey's Construction's database schema with a new table that causes the database to violate the rules of normalization.

The database we will be working with in this tutorial is for Codey's Construction company (Codey is a helpful coding bot that works with you in the course mentioned earlier). As you can see from the schema above, the database contains the tables projects, job_orders, employees, project_employees. Recently, they have decided to add the table customers to store customer data.

Unfortunately, this table has not designed in a way that satisfies the three forms of normalization... Let's fix that!

First Normal Form

First normal form relates to the duplication and over-grouping of data in tables and columns.

Codey’s Construction's table customers violates all three rules of 1NF.

There is no primary key! A user of the database would be forced to look up companies by their name, which is not guaranteed to be unique (since unique company names are registered on a state-by-state basis).
The data is not in its most reduced form. The column contact_person_and_role can be further divided into two columns, such as contact_person and contact_role.
There are two repeating groups of columns - (project1_id, project1_feedback) and (project2_id, project2_feedback).

The following SQL statement was used to create the customers table:

CREATE TABLE customers (
    name                          VARCHAR(255),
    industry                      VARCHAR(255),
    project1_id                   INT(6),
    project1_feedback             TEXT,
    project2_id                   INT(6),
    project2_feedback             TEXT,
    contact_person_id             INT(6),
    contact_person_and_role       VARCHAR(300),
    phone_number                  VARCHAR(12),
    address                       VARCHAR(255),
    city                          VARCHAR(255),
    zip                           VARCHAR(5)
  );

Example data for `customers` table.

By modifying some columns, we can help redesign this table so that it satisfies 1NF.

First, we need to add a primary key column called id with data type INT(6):

ALTER TABLE customers
    ADD COLUMN id INT(6) AUTO_INCREMENT PRIMARY KEY FIRST;

With this statement, we added an automatically incrementing primary key as the first column in the table.

To satisfy the second condition, we need to split the contact_person_and_role column:

ALTER TABLE customers
    CHANGE COLUMN contact_person_and_role contact_person VARCHAR(300);

ALTER TABLE customers
    ADD COLUMN contact_person_role VARCHAR(300) AFTER contact_person;

Here, we simply renamed it as contact_person, and added a column contact_person_role immediately after it.

To satisfy the third condition, we need to move the columns containing project IDs and project feedback to a new table called project_feedbacks. First, let's drop these columns from the customers table:

ALTER TABLE customers
    DROP COLUMN project1_id,
    DROP COLUMN project1_feedback,
    DROP COLUMN project2_id,
    DROP COLUMN project2_feedback;

And then create the project_feedbacks table:

CREATE TABLE project_feedbacks (
    id                  INT(6) AUTO_INCREMENT PRIMARY KEY,
    project_id          INT(6),
    customer_id         INT(6),
    project_feedback    TEXT
);

Here's what the database schema looks like now:

Modified schema that now satisfies 1NF.

As you can see, there are no more repeating groups in either the project_feedbacks table or the customers table. We still know which customer said what since project_feedbacks.customer_id refers back to the customers table.

Now our customers table satisfies 1NF! Let's move on to second normal form.

Second Normal Form

To achieve second normal form, a database must first satisfy all the conditions for 1NF. After this, satisfying 2NF requires that all data in each table relates directly to the record that the primary key of the table identifies.

We are in violation of 2NF because the contact_person, contact_person_role and phone_number columns track data that relate to the contact person, not the customer. If the contact person for a customer changes, we would have to edit all of these columns, running the risk that we will change the values in one of the columns but forget to change another.

To help Codey's Construction fix this table to satisfy 2NF, these columns should be moved to a table containing data on the contact person. First, let's remove the columns in customers that are not related to our primary key:

ALTER TABLE customers
    DROP COLUMN contact_person,
    DROP COLUMN contact_person_role,
    DROP COLUMN phone_number;

Note that we kept the contact_person_id so we still know who to contact. Now, let's create our new table contact_persons so we have somewhere to store data about each contact.

CREATE TABLE contact_persons (
  id            INT(6) PRIMARY KEY,
  name          VARCHAR(300),
  role          VARCHAR(300),
  phone_number  VARCHAR(15)
);

Codey's Construction's database schema now looks like this:

Modified schema that now satisfies 2NF.

Now, if the contact person for a customer changes, the construction company just has to insert a record into the contact_persons table and change the contact_person_id in the customers table.

Third Normal Form

For a database to be in third normal form, it must first satisfy all the criteria for 2NF (and therefore, also 1NF).

Then, each column must be non-transitively dependent on the table’s primary key. This means that all columns in a table should rely on the primary key and no other column. If column_a relies on the primary key and also on column_b then column_a is transitively dependent on the primary key so the table does not satisfy 3NF.

Does your brain hurt from reading that? Don't worry! It's explained more below.

This is how the customers table looks after we have satisfied 1NF and 2NF:

Example data for modified `customers` table.

The table currently has transitively dependent columns. The transitively dependent relationship is between city and zip. The city in which a customer is located relies on the customer, so this satisfies 2NF; however, the city also depends on the zip code. If a customer relocates, there may be a chance we update one column but not the other. Because this relationship exists, the database is not in 3NF.

To fix our database to satisfy 3NF, we need to drop the city column from customers, and create a new table zips to store this data:

ALTER TABLE customers
    DROP COLUMN city;

CREATE TABLE zips (
  zip   VARCHAR(5) PRIMARY KEY, 
  city  VARCHAR(255)
);

Modified schema that now satisfies 3NF.

That's it! Finding issues that violate 3NF can be difficult, but it's worth it to ensure that your database is resilient to errors caused by only partially updating data.

I hope you enjoyed this tutorial on database normalization! Codey's Construction's database now satisfies the three forms of normalization.

If you'd like to continue learning about databases, Next Tech's Database Fundamentals course covers all you need to know to get started with databases and SQL. By helping an interactive coding bot named Codey, you will learn how to create and design databases, modify data, and how to write SQL queries to answer business problems. You can get started for free here!

Introduction to Multilayer Neural Networks with TensorFlow’s Keras API

Lorraine — Tue, 11 Jun 2019 20:26:43 +0000

The development of Keras started in early 2015. As of today, it has evolved into one of the most popular and widely used libraries built on top of Theano and TensorFlow. One of its prominent features is that it has a very intuitive and user-friendly API, which allows us to implement neural networks in only a few lines of code.

Keras is also integrated into TensorFlow from version 1.1.0. It is part of the contrib module (which contains packages developed by contributors to TensorFlow and is considered experimental code).

In this tutorial we will look at this high-level TensorFlow API by walking through:

The basics of feedforward neural networks
Loading and preparing the popular MNIST dataset
Building an image classifier
Train a neural network and evaluate its accuracy

Let's get started!

This tutorial is adapted from Part 4 of Next Tech’s Python Machine Learning series, which takes you through machine learning and deep learning algorithms with Python from 0 to 100. It includes an in-browser sandboxed environment with all the necessary software and libraries pre-installed, and projects using public datasets. You can get started for free here!

Multilayer Perceptrons

Multilayer feedforward neural networks are a special type of fully connected network with multiple single neurons. They are also called Multilayer Perceptrons (MLP). The following figure illustrates the concept of an MLP consisting of three layers:

The MLP depicted in the preceding figure has one input layer, one hidden layer, and one output layer. The units in the hidden layer are fully connected to the input layer, and the output layer is fully connected to the hidden layer. If such a network has more than one hidden layer, we also call it a deep artificial neural network.

We can add an arbitrary number of hidden layers to the MLP to create deeper network architectures. Practically, we can think of the number of layers and units in a neural network as additional hyperparameters that we want to optimize for a given problem task.

As shown in the preceding figure, we denote the i^th activation unit in the i^th layer as ai^(l). To make the math and code implementations a bit more intuitive, we will use the in superscript for the input layer, the h superscript for the hidden layer, and the o superscript for the output layer.

For instance, ai⁽ⁱⁿ⁾ refers to the i^th value in the input layer, ai^(h) refers to the i^th unit in the hidden layer, and ai^(out) refers to the i^th unit in the output layer. Here, the activation units a0⁽ⁱⁿ⁾ and a0^(h) are the bias units, which we set equal to 1. The activation of the units in the input layer is just its input plus the bias unit:

Each unit in layer l is connected to all units in layer l + 1 via a weight coefficient. For example, the connection between the k^th unit in layer l to the j^th unit in layer l + 1 will be written as wk, j^(l). Referring back to the previous figure, we denote the weight matrix that connects the input to the hidden layer as W^(h), and we write the matrix that connects the hidden layer to the output layer as W^(out).

We summarize the weights that connect the input and hidden layers by a matrix W^(h) ∈ ℝ ^{m × d}, where d is the number of hidden units and m is the number of input units including the bias unit. Since it is important to internalize this notation to follow the concepts later in this lesson, let's summarize what we have just learned in a descriptive illustration of a simplified 3-4-3 multilayer perceptron:

The MNIST dataset

To see what neural network training via the tensorflow.keras (tf.keras) high-level API looks like, let's implement a multilayer perceptron to classify the handwritten digits from the popular Mixed National Institute of Standards and Technology (MNIST) dataset that serves as a popular benchmark dataset for machine learning algorithm.

To follow along with the code snippets in this tutorial, you can use this Next Tech sandbox, which has the MNIST dataset and all necessary packages installed. Otherwise, you can use your local environment and download the dataset here.

The MNIST dataset in four parts, as listed here:

Training set images: train-images-idx3-ubyte.gz — 60,000 samples
Training set labels: train-labels-idx1-ubyte.gz — 60,000 labels
Test set images: t10k-images-idx3-ubyte.gz — 10,000 samples
Test set labels: t10k-labels-idx1-ubyte.gz — 10,000 labels

The training set consists of handwritten digits from 250 different people (50% high school students, 50% employees from the Census Bureau). The test set contains handwritten digits from different people.

Note that TensorFlow also provides the same dataset as follows:

import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

However, we work with the MNIST dataset as an external dataset to learn all the steps of data preprocessing separately. This way, you learn what you need to do with your own dataset.

The first step is to unzip the four parts of the MNIST dataset by running the following commands in your Terminal:

cd mnist/
gzip *ubyte.gz -d

Our images are stored in byte format, and we will read them into NumPy arrays that we will use to train and test our MLP implementation. In order to do that, we will define the following helper function:

import os
import struct

def load_mnist(path, kind='train'):
    """Load MNIST data from `path`"""
    labels_path = os.path.join(
        path, f'{kind}-labels-idx1-ubyte'
    )
    images_path = os.path.join(
        path, f'{kind}-images-idx3-ubyte'
    )

    with open(labels_path, 'rb') as lbpath:
        magic, n = struct.unpack('>II', lbpath.read(8))
        labels = np.fromfile(lbpath, dtype=np.uint8)

    with open(images_path, 'rb') as imgpath:
        magic, num, rows, cols = struct.unpack(">IIII", imgpath.read(16))
        images = np.fromfile(imgpath, dtype=np.uint8).reshape(len(labels), 784)
        images = ((images / 255.) - .5) * 2

    return images, labels

The load_mnist function returns two arrays, the first being an n x m dimensional NumPy array (images), where n is the number of samples and m is the number of features (here, pixels). The images in the MNIST dataset consist of 28 x 28 pixels, and each pixel is represented by a gray scale intensity value. Here, we unroll the 28 x 28 pixels into one-dimensional row vectors, which represent the rows in our images array (784 per row or image). The second array (labels) returned by the load_mnist function contains the corresponding target variable, the class labels (integers 0-9) of the handwritten digits.

Then, the dataset is loaded and prepared as follows:

# loading the data
X_train, y_train = load_mnist('./mnist/', kind='train')
print(f'Rows: {X_train.shape[0]},  Columns: {X_train.shape[1]}')

X_test, y_test = load_mnist('./mnist/', kind='t10k')
print(f'Rows: {X_test.shape[0]},  Columns: {X_test.shape[1]}')

# mean centering and normalization:
mean_vals = np.mean(X_train, axis=0)
std_val = np.std(X_train)

X_train_centered = (X_train - mean_vals)/std_val
X_test_centered = (X_test - mean_vals)/std_val

del X_train, X_test

print(X_train_centered.shape, y_train.shape)
print(X_test_centered.shape, y_test.shape)

[Out:]
Rows: 60000,  Columns: 784
Rows: 10000,  Columns: 784
(60000, 784) (60000,)
(10000, 784) (10000,)

To get an idea of how those images in MNIST look, let's visualize examples of the digits 0-9 via Matplotlib's imshowfunction:

import matplotlib.pyplot as plt

fig, ax = plt.subplots(nrows=2, ncols=5,
                       sharex=True, sharey=True)
ax = ax.flatten()
for i in range(10):
    img = X_train_centered[y_train == i][0].reshape(28, 28)
    ax[i].imshow(img, cmap='Greys')

ax[0].set_yticks([])
ax[0].set_xticks([])
plt.tight_layout()
plt.show()

We should now see a plot of the 2 x 5 subfigures showing a representative image of each unique digit:

Now let’s start building our model!

Building an MLP using TensorFlow's Keras API

First, let's set the random seed for NumPy and TensorFlow so that we get consistent results:

import tensorflow.contrib.keras as keras

np.random.seed(123)
tf.set_random_seed(123)

To continue with the preparation of the training data, we need to convert the class labels (integers 0-9) into the one-hot format. Fortunately, Keras provides a convenient tool for this:

y_train_onehot = keras.utils.to_categorical(y_train)

print('First 3 labels: ', y_train[:3])
print('\nFirst 3 labels (one-hot):\n', y_train_onehot[:3])

First 3 labels:  [5 0 4]

First 3 labels (one-hot):
 [[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]]

Now, let's implement our neural network! Briefly, we will have three layers, where the first two layers (the input and hidden layers) each have 50 units with the tanh activation function and the last layer (the output layer) has 10 layers for the 10 class labels and uses softmax to give the probability of each class. Keras makes these tasks very simple:

# initialize model
model = keras.models.Sequential()

# add input layer
model.add(keras.layers.Dense(
    units=50,
    input_dim=X_train_centered.shape[1],
    kernel_initializer='glorot_uniform',
    bias_initializer='zeros',
    activation='tanh') 
)
# add hidden layer
model.add(
    keras.layers.Dense(
        units=50,
        input_dim=50,
        kernel_initializer='glorot_uniform',
        bias_initializer='zeros',
        activation='tanh')
    )
# add output layer
model.add(
    keras.layers.Dense(
        units=y_train_onehot.shape[1],
        input_dim=50,
        kernel_initializer='glorot_uniform',
        bias_initializer='zeros',
        activation='softmax')
    )

# define SGD optimizer
sgd_optimizer = keras.optimizers.SGD(
    lr=0.001, decay=1e-7, momentum=0.9
)
# compile model
model.compile(
    optimizer=sgd_optimizer,
    loss='categorical_crossentropy'
)

First, we initialize a new model using the Sequential class to implement a feedforward neural network. Then, we can add as many layers to it as we like. However, since the first layer that we add is the input layer, we have to make sure that the input_dim attribute matches the number of features (columns) in the training set (784 features or pixels in the neural network implementation).

Also, we have to make sure that the number of output units (units) and input units (input_dim) of two consecutive layers match. Our first two layers have 50 units plus one bias unit each. The number of units in the output layer should be equal to the number of unique class labels — the number of columns in the one-hot-encoded class label array.

Note that we used glorot_uniform to as the initialization algorithm for weight matrices. Glorot initialization is a more robust way of initialization for deep neural networks. The biases are initialized to zero, which is more common, and in fact the default setting in Keras.

Before we can compile our model, we also have to define an optimizer. We chose a stochastic gradient descent optimization. Furthermore, we can set values for the weight decay constant and momentum learning to adjust the learning rate at each epoch. Lastly, we set the cost (or loss) function to categorical_crossentropy.

The binary cross-entropy is just a technical term for the cost function in the logistic regression, and the categorical cross-entropy is its generalization for multiclass predictions via softmax).

After compiling the model, we can now train it by calling the fit method. Here, we are using mini-batch stochastic gradient with a batch size of 64 training samples per batch. We train the MLP over 50 epochs, and we can follow the optimization of the cost function during training by setting verbose=1.

The validation_split parameter is especially handy since it will reserve 10% of the training data (here, 6,000 samples) for validation after each epoch so that we can monitor whether the model is overfitting during training:

# train model
history = model.fit(
    X_train_centered, y_train_onehot,
    batch_size=64, epochs=50,
    verbose=1, validation_split=0.1
)

Printing the value of the cost function is extremely useful during training to quickly spot whether the cost is decreasing during training and stop the algorithm earlier. Otherwise, hyperparameter values will need to be tuned.

To predict the class labels, we can then use the predict_classes method to return the class labels directly as integers:

y_train_pred = model.predict_classes(X_train_centered, verbose=0)
print('First 3 predictions: ', y_train_pred[:3])

[Out:]
First 3 predictions: [5 0 4]

Finally, let's print the model accuracy on training and test sets:

# calculate training accuracy
y_train_pred = model.predict_classes(X_train_centered, verbose=0)
correct_preds = np.sum(y_train == y_train_pred, axis=0)
train_acc = correct_preds / y_train.shape[0]

print(f'Training accuracy: {(train_acc * 100):.2f}')

# calculate testing accuracy
y_test_pred = model.predict_classes(X_test_centered, verbose=0)
correct_preds = np.sum(y_test == y_test_pred, axis=0)
test_acc = correct_preds / y_test.shape[0]

print(f'Test accuracy: {(test_acc * 100):.2f}')

[Out:]
Training accuracy: 98.81
Test accuracy: 96.27

I hope you enjoyed this tutorial on using TensorFlow's keras API to build and train a multilayered neural network for image classification! Note that this is just a very simple neural network without optimized tuning parameters.

In practice you need to know how to optimize the model by tweaking learning rate, momentum, weight decay, and number of hidden units. You also need to learn how to deal with the vanishing gradient problem, wherein error gradients become increasingly small as more layers are added to a network.

We cover these topics in Next Tech's Python Machine Learning (Part 4) course, as well as:

Breaking down the mechanics of TensorFlow, such as tensors, activation functions computation graphs, variables, and placeholders
Low-level TensorFlow and another high-level API, Layers
Modeling sequential data using recurrent neural networks (RNN) and long short-term memory (LSTM) networks
Classifying images with deep convolutional neural networks (CNN).

You can get started here for free!

K-Means Clustering with scikit-learn

Lorraine — Thu, 30 May 2019 18:59:06 +0000

Clustering (or cluster analysis) is a technique that allows us to find groups of similar objects, objects that are more related to each other than to objects in other groups. Examples of business-oriented applications of clustering include the grouping of documents, music, and movies by different topics, or finding customers that share similar interests based on common purchase behaviors as a basis for recommendation engines.

In this tutorial, we will learn about one of the most popular clustering algorithms, k-means, which is widely used in academia as well as in industry. We will cover:

The basic concepts of k-means clustering
The mathematics behind the k-means algorithm
The advantages and disadvantages of k-means
How to implement the algorithm on a sample dataset using scikit-learn
How to visualize clusters
How to choose the optimal k using the elbow method

Let’s get started!

This tutorial is adapted from Part 3 of Next Tech’s Python Machine Learning series, which takes you through machine learning and deep learning algorithms with Python from 0 to 100. It includes an in-browser sandboxed environment with all the necessary software and libraries pre-installed, and projects using public datasets. You can get started for free here!

Fundamentals of K-Means Clustering

As we will see, the k-means algorithm is extremely easy to implement and is also computationally very efficient compared to other clustering algorithms, which might explain its popularity. The k-means algorithm belongs to the category of prototype-based clustering.

Prototype-based clustering means that each cluster is represented by a prototype, which can either be the centroid (average) of similar points with continuous features, or the medoid (the most representative or most frequently occurring point) in the case of categorical features.

While k-means is very good at identifying clusters with a spherical shape, one of the drawbacks of this clustering algorithm is that we have to specify the number of clusters, k, a priori. An inappropriate choice for k can result in poor clustering performance — we will discuss later in this tutorial how to choose k.

Although k-means clustering can be applied to data in higher dimensions, we will walk through the following examples using a simple two-dimensional dataset for the purpose of visualization.

You can follow along with the code in this tutorial by using a Next Tech sandbox, which has all the necessary libraries pre-installed, or if you’d prefer, you can run the snippets in your own local environment.

Once your sandbox loads, let’s import the toy dataset from scikit-learn and visualize the datapoints:

import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# create dataset
X, y = make_blobs(
    n_samples=150, n_features=2,
    centers=3, cluster_std=0.5,
    shuffle=True, random_state=0
)

# plot
plt.scatter(
    X[:, 0], X[:, 1],
    c='white', marker='o',
    edgecolor='black', s=50
)
plt.show()

The dataset that we just created consists of 150 randomly generated points that are roughly grouped into three regions with higher density, which is visualized via a two-dimensional scatterplot.

In real-world applications of clustering, we do not have any ground truth category information (information provided as empirical evidence as opposed to inference) about those samples; otherwise, it would fall into the category of supervised learning. Thus, our goal is to group the samples based on their feature similarities, which can be achieved using the k-means algorithm that can be summarized by the following four steps:

Randomly pick k centroids from the sample points as initial cluster centers.
Assign each sample to the nearest centroid μ^(j), j ∈ {1, ..., k}.
Move the centroids to the center of the samples that were assigned to it.
Repeat steps 2 and 3 until the cluster assignments do not change or a user-defined tolerance or maximum number of iterations is reached.

Now, the next question is how do we measure similarity between objects? We can define similarity as the opposite of distance, and a commonly used distance for clustering samples with continuous features is the squared Euclidean distance between two points x and y in m-dimensional space:

Note that in the preceding equation, the index j refers to the j^th dimension (feature column) of the sample points x and y. We will use the superscripts i and j to refer to the sample index and cluster index, respectively.

Based on this Euclidean distance metric, we can describe the k-means algorithm as a simple optimization problem, an iterative approach for minimizing the within-cluster Sum of Squared Errors (SSE), which is sometimes also called cluster inertia:

Here, μ^(j) is the centroid for cluster j, and

w^{(i, j)} = 1 if the sample x⁽ⁱ⁾ is in cluster j
= 0 otherwise

Note that when we are applying k-means to real-world data using a Euclidean distance metric, we want to make sure that the features are measured on the same scale and apply z-score standardization or min-max scaling if necessary.

K-means clustering using `scikit-learn`

Now that we have learned how the k-means algorithm works, let's apply it to our sample dataset using the KMeans class from scikit-learn's cluster module:

from sklearn.cluster import KMeans

km = KMeans(
    n_clusters=3, init='random',
    n_init=10, max_iter=300, 
    tol=1e-04, random_state=0
)
y_km = km.fit_predict(X)

Using the preceding code, we set the number of desired clusters to 3. We set n_init=10 to run the k-means clustering algorithms 10 times independently with different random centroids to choose the final model as the one with the lowest SSE. Via the max_iter parameter, we specify the maximum number of iterations for each single run (here, 300).

Note that the k-means implementation in scikit-learn stops early if it converges before the maximum number of iterations is reached. However, it is possible that k-means does not reach convergence for a particular run, which can be problematic (computationally expensive) if we choose relatively large values for max_iter.

One way to deal with convergence problems is to choose larger values for tol, which is a parameter that controls the tolerance with regard to the changes in the within-cluster sum-squared-error to declare convergence. In the preceding code, we chose a tolerance of 1e-04 (= 0.0001).

A problem with k-means is that one or more clusters can be empty. However, this problem is accounted for in the current k-means implementation in scikit-learn. If a cluster is empty, the algorithm will search for the sample that is farthest away from the centroid of the empty cluster. Then it will reassign the centroid to be this farthest point.

Now that we have predicted the cluster labels y_km, let's visualize the clusters that k-means identified in the dataset together with the cluster centroids. These are stored under the cluster_centers_ attribute of the fitted KMeans object:

# plot the 3 clusters
plt.scatter(
    X[y_km == 0, 0], X[y_km == 0, 1],
    s=50, c='lightgreen',
    marker='s', edgecolor='black',
    label='cluster 1'
)

plt.scatter(
    X[y_km == 1, 0], X[y_km == 1, 1],
    s=50, c='orange',
    marker='o', edgecolor='black',
    label='cluster 2'
)

plt.scatter(
    X[y_km == 2, 0], X[y_km == 2, 1],
    s=50, c='lightblue',
    marker='v', edgecolor='black',
    label='cluster 3'
)

# plot the centroids
plt.scatter(
    km.cluster_centers_[:, 0], km.cluster_centers_[:, 1],
    s=250, marker='*',
    c='red', edgecolor='black',
    label='centroids'
)
plt.legend(scatterpoints=1)
plt.grid()
plt.show()

In the resulting scatterplot, we can see that k-means placed the three centroids at the center of each sphere, which looks like a reasonable grouping given this dataset.

The Elbow Method

Although k-means worked well on this toy dataset, it is important to reiterate that a drawback of k-means is that we have to specify the number of clusters, k, before we know what the optimal k is. The number of clusters to choose may not always be so obvious in real-world applications, especially if we are working with a higher dimensional dataset that cannot be visualized.

The elbow method is a useful graphical tool to estimate the optimal number of clusters k for a given task. Intuitively, we can say that, if k increases, the within-cluster SSE (“distortion”) will decrease. This is because the samples will be closer to the centroids they are assigned to.

The idea behind the elbow method is to identify the value of k where the distortion begins to increase most rapidly, which will become clearer if we plot the distortion for different values of k:

# calculate distortion for a range of number of cluster
distortions = []
for i in range(1, 11):
    km = KMeans(
        n_clusters=i, init='random',
        n_init=10, max_iter=300,
        tol=1e-04, random_state=0
    )
    km.fit(X)
    distortions.append(km.inertia_)

# plot
plt.plot(range(1, 11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()

As we can see in the resulting plot, the elbow is located at k = 3, which is evidence that k = 3 is indeed a good choice for this dataset.

I hope you enjoyed this tutorial on the k-means algorithm! We explored the basic concepts and mathematics behind the k-means algorithm, how to implement k-means, and how to select an optimal number of clusters, k.

If you’d like to learn more, Next Tech’s Python Machine Learning (Part 3) course further explores clustering algorithms and techniques such as:

Silhouette plots, another method used to select the optimal k
k-means++, a variant of k-means, that improves clustering results through more clever seeding of the initial cluster centers.
Other categories of clustering algorithms, such as hierarchical and density-based clustering, that do not require us to specify the number of clusters upfront or assume spherical structures in our dataset.

The course also explores regression analysis, sentiment analysis, and how to deploy a dynamic machine learning model to a web application. You can get started here!

Launch VS Code in online computing environments

Saul Costa — Tue, 28 May 2019 21:26:40 +0000

I'm excited to share that Next Tech has released alpha support for Visual Studio Code!

VS Code is hands-down one of (if not) the best code editors out there. It's trounced others (like Sublime and Atom) in popularity in recent years, thanks to the incredible features Microsoft has added and the vast library of extensions tens of thousands of developers have created.

Installing VS Code is pretty straightforward and you can do it yourself on your computer if you'd like. However, we've found that our hosted environments for Python, Node, Go, Haskell, and many other programming languages make it super easy to get started with a new programming project in just a few seconds.

So we thought, wouldn't it be great if you could use these environments… and VS Code?!

Well now, you can. This includes the ability to install extensions, change your settings, debug programs, push to GitHub, deploy to Azure, and much more.

This guide walks you through how to get started with VS Code on Next Tech. If you just want to jump into a sandbox with VS Code, click here. In a few seconds you'll have VS Code running in your browser:

Or, read on for the details!

Why VS Code?

Over the years we've received many requests for innovative features like code collaboration, debuggers, version control integration, and much more, but our focus is increasingly on our infrastructure offerings. As such, we see VS Code as being able to address a number of these requests as we continue to develop innovative infrastructure offerings.

VS Code is also the most popular code editor by far. Here's Stack Overflow's 2019 developer survey results for the most popular development environments:

This is also coming at a cost to other editors. Here's the Google Trends data for VS Code (blue) versus Sublime (red) and Atom (yellow):

So we feel that our investment into integrating VS Code will be well worth it as we'll be able to provide a well-loved tool inside our infrastructure.

This feature is currently in a very early alpha state (current limitations are documented here). However, in the coming months we'll be rolling out a more tightly integrated version and many other related improvements.

Here's what's currently supported:

Intelligent code completion (IntelliSense).
Powerful debugging functionality.
Numerous different ways to configure your interface (multiple tabs, zen mode, etc.).
Integrated terminals.
Multiple themes.
VS Code's command line interface.
…and many other extensions you can install.

Very soon, we'll also be adding support for:

An integrated web browser.
VS Code's Live Share feature, which allows you to collaborate with others in real-time.

For now, I hope you'll take a look and share your feedback!

Getting Started

To get started, head over to the sandbox launchpad. Once you're there, pick the language you'd like to use (this guide uses Golang), then check Use Visual Studio Code, as shown below:

You'll be shown a dialog that contains an explanation of the current limitations of this feature (also detailed at the end of this page). Just click the Sounds fun, let's go! button and your sandbox will load with the VS Code interface:

You may notice that VS Code is just another tab type in the sandbox interface. If you click the + to create a new tab, you may notice that some options are now hidden:

Eventually these will all be hidden as we integrate them directly inside of the VS Code interface.

For the best experience, you can click the square in the top right corner of the VS Code interface to make VS Code full screen:

To get started, you can click File, then New File:

(note that the Ctrl+N will not work in the browser)

Save your file as main.go, then put this code in it:

Now, head over to the extensions marketplace and install the Go extension:

(you may need to press Ctrl+Shift+P, type “reload”, then select Developer: Reload Window to get things working correctly)

You'll be prompted in the bottom right to install several packages. Go for it!

You can try typing b in the code editor to see the intelligent autocomplete kicking in. Here, it sees that b is actually an array with 5 ints in it:

You can run code using the software normally installed in your sandbox. To run this file, click Terminal, then New Terminal:

Then you can use the already installed go program from your sandbox:

And there you have it! You've just used VS Code in the cloud to write a Go program.

Feedback

If you try this new feature, I'd love to hear what you think. Feel free to respond to this post or submit a ticket here.

Thanks for reading!

Shout-out

This integration uses an adapted version of this awesome open source project by Coder!

Principal Component Analysis for Dimensionality Reduction

Lorraine — Fri, 24 May 2019 17:37:17 +0000

In the modern age of technology, increasing amounts of data are produced and collected. In machine learning, however, too much data can be a bad thing. At a certain point, more features or dimensions can decrease a model’s accuracy since there is more data that needs to be generalized — this is known as the curse of dimensionality.

Dimensionality reduction is way to reduce the complexity of a model and avoid overfitting. There are two main categories of dimensionality reduction: feature selection and feature extraction. Via feature selection, we select a subset of the original features, whereas in feature extraction, we derive information from the feature set to construct a new feature subspace.

In this tutorial we will explore feature extraction. In practice, feature extraction is not only used to improve storage space or the computational efficiency of the learning algorithm, but can also improve the predictive performance by reducing the curse of dimensionality—especially if we are working with non-regularized models.

Specifically, we will discuss the Principal Component Analysis (PCA) algorithm used to compress a dataset onto a lower-dimensional feature subspace with the goal of maintaining most of the relevant information. We will explore:

The concepts and mathematics behind PCA
How to execute PCA step-by-step from scratch using Python
How to execute PCA using the Python library scikit-learn

Let’s get started!

This tutorial is adapted from Part 2 of Next Tech’s Python Machine Learning series, which takes you through machine learning and deep learning algorithms with Python from 0 to 100. It includes an in-browser sandboxed environment with all the necessary software and libraries pre-installed, and projects using public datasets. You can get started for free here!

Introduction to Principle Component Analysis

Principle Component Analysis (PCA) is an unsupervised linear transformation technique that is widely used across different fields, most prominently for feature extraction and dimensionality reduction. Other popular applications of PCA include exploratory data analyses and de-noising of signals in stock market trading, and the analysis of genome data and gene expression levels in the field of bioinformatics.

PCA helps us to identify patterns in data based on the correlation between features. In a nutshell, PCA aims to find the directions of maximum variance in high-dimensional data and projects it onto a new subspace with equal or fewer dimensions than the original one.

The orthogonal axes (principal components) of the new subspace can be interpreted as the directions of maximum variance given the constraint that the new feature axes are orthogonal to each other, as illustrated in the following figure:

In the preceding figure, x1 and x2 are the original feature axes, and PC1 and PC2 are the principal components.

If we use PCA for dimensionality reduction, we construct a d x k–dimensional transformation matrix W that allows us to map a sample vector x onto a new k–dimensional feature subspace that has fewer dimensions than the original d–dimensional feature space:

As a result of transforming the original d-dimensional data onto this new k-dimensional subspace (typically k ≪ d), the first principal component will have the largest possible variance, and all consequent principal components will have the largest variance given the constraint that these components are uncorrelated (orthogonal) to the other principal components — even if the input features are correlated, the resulting principal components will be mutually orthogonal (uncorrelated).

Note that the PCA directions are highly sensitive to data scaling, and we need to standardize the features prior to PCA if the features were measured on different scales and we want to assign equal importance to all features.

Before looking at the PCA algorithm for dimensionality reduction in more detail, let’s summarize the approach in a few simple steps:

Standardize the d-dimensional dataset.
Construct the covariance matrix.
Decompose the covariance matrix into its eigenvectors and eigenvalues.
Sort the eigenvalues by decreasing order to rank the corresponding eigenvectors.
Select k eigenvectors which correspond to the k largest eigenvalues, where k is the dimensionality of the new feature subspace (k ≤ d).
Construct a projection matrix W from the “top” k eigenvectors.
Transform the d-dimensional input dataset X using the projection matrix W to obtain the new k-dimensional feature subspace.

Let’s perform a PCA step by step, using Python as a learning exercise. Then, we will see how to perform a PCA more conveniently using scikit-learn.

Extracting the Principal Components Step By Step

We will be using the Wine dataset from The UCI Machine Learning Repository in our example. This dataset consists of 178 wine samples with 13 features describing their different chemical properties. You can find out more here.

In this section we will tackle the first four steps of a PCA; later we will go over the last three. You can follow along with the code in this tutorial by using a Next Tech sandbox, which has all the necessary libraries pre-installed, or if you’d prefer, you can run the snippets in your own local environment.

Once your sandbox loads, we will start by loading the Wine dataset directly from the repository:

import pandas as pd

df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/'
                      'machine-learning-databases/wine/wine.data',
                      header=None)
df_wine.head()

Next, we will process the Wine data into separate training and test sets — using a 70:30 split — and standardize it to unit variance:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# split into training and testing sets
X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3,
    stratify=y, random_state=0
)
# standardize the features
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

After completing the mandatory preprocessing, let’s advance to the second step: constructing the covariance matrix. The symmetric d x d-dimensional covariance matrix, where d is the number of dimensions in the dataset, stores the pairwise covariances between the different features. For example, the covariance between two features xj and xk on the population level can be calculated via the following equation:

Here, μj and μk are the sample means of features j and k, respectively.

Note that the sample means are zero if we standardized the dataset. A positive covariance between two features indicates that the features increase or decrease together, whereas a negative covariance indicates that the features vary in opposite directions. For example, the covariance matrix of three features can then be written as follows (note that Σ stands for the Greek uppercase letter sigma, which is not to be confused with the sum symbol):

The eigenvectors of the covariance matrix represent the principal components (the directions of maximum variance), whereas the corresponding eigenvalues will define their magnitude. In the case of the Wine dataset, we would obtain 13 eigenvectors and eigenvalues from the 13 x 13-dimensional covariance matrix.

Now, for our third step, let’s obtain the eigenpairs of the covariance matrix. An eigenvector v satisfies the following condition:

Here, λ is a scalar: the eigenvalue. Since the manual computation of eigenvectors and eigenvalues is a somewhat tedious and elaborate task, we will use the linalg.eig function from NumPy to obtain the eigenpairs of the Wine covariance matrix:

import numpy as np

cov_mat = np.cov(X_train_std.T)
eigen_vals, eigen_vecs = np.linalg.eig(cov_mat)

Using the numpy.cov function, we computed the covariance matrix of the standardized training dataset. Using the linalg.eig function, we performed the eigendecomposition, which yielded a vector (eigen_vals) consisting of 13 eigenvalues and the corresponding eigenvectors stored as columns in a 13 x 13-dimensional matrix (eigen_vecs).

Total and Explained Variance

Since we want to reduce the dimensionality of our dataset by compressing it onto a new feature subspace, we only select the subset of the eigenvectors (principal components) that contains most of the information (variance). The eigenvalues define the magnitude of the eigenvectors, so we have to sort the eigenvalues by decreasing magnitude; we are interested in the top k eigenvectors based on the values of their corresponding eigenvalues.

But before we collect those k most informative eigenvectors, let’s plot the variance explained ratios of the eigenvalues. The variance explained ratio of an eigenvalue λj is simply the fraction of an eigenvalue λj and the total sum of the eigenvalues:

Using the NumPy cumsum function, we can then calculate the cumulative sum of explained variances, which we will then plot via matplotlib’s step function:

import matplotlib.pyplot as plt

# calculate cumulative sum of explained variances
tot = sum(eigen_vals)
var_exp = [(i / tot) for i in sorted(eigen_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)

# plot explained variances
plt.bar(range(1,14), var_exp, alpha=0.5,
        align='center', label='individual explained variance')
plt.step(range(1,14), cum_var_exp, where='mid',
         label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.show()

The resulting plot indicates that the first principal component alone accounts for approximately 40% of the variance. Also, we can see that the first two principal components combined explain almost 60% of the variance in the dataset.

Feature Transformation

After we have successfully decomposed the covariance matrix into eigenpairs, let’s now proceed with the last three steps of PCA to transform the Wine dataset onto the new principal component axes.

We will sort the eigenpairs by descending order of the eigenvalues, construct a projection matrix from the selected eigenvectors, and use the projection matrix to transform the data onto the lower-dimensional subspace.

We start by sorting the eigenpairs by decreasing order of the eigenvalues:

# Make a list of (eigenvalue, eigenvector) tuples
eigen_pairs = [(np.abs(eigen_vals[i]), eigen_vecs[:, i]) for i in range(len(eigen_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eigen_pairs.sort(key=lambda k: k[0], reverse=True)

Next, we collect the two eigenvectors that correspond to the two largest eigenvalues, to capture about 60% of the variance in this dataset. Note that we only chose two eigenvectors for the purpose of illustration, since we are going to plot the data via a two-dimensional scatter plot later in this subsection. In practice, the number of principal components has to be determined by a trade-off between computational efficiency and the performance of the classifier:

w = np.hstack((eigen_pairs[0][1][:, np.newaxis], eigen_pairs[1][1][:, np.newaxis]))
print('Matrix W:\n', w)

[Out:]
Matrix W:
 [[-0.13724218  0.50303478]
 [ 0.24724326  0.16487119]
 [-0.02545159  0.24456476]
 [ 0.20694508 -0.11352904]
 [-0.15436582  0.28974518]
 [-0.39376952  0.05080104]
 [-0.41735106 -0.02287338]
 [ 0.30572896  0.09048885]
 [-0.30668347  0.00835233]
 [ 0.07554066  0.54977581]
 [-0.32613263 -0.20716433]
 [-0.36861022 -0.24902536]
 [-0.29669651  0.38022942]]

By executing the preceding code, we have created a 13 x 2-dimensional projection matrix W from the top two eigenvectors.

Using the projection matrix, we can now transform a sample x (represented as a 1 x 13-dimensional row vector) onto the PCA subspace (the principal components one and two) obtaining x′, now a two-dimensional sample vector consisting of two new features:

X_train_std[0].dot(w)

Similarly, we can transform the entire 124 x 13-dimensional training dataset onto the two principal components by calculating the matrix dot product:

X_train_pca = X_train_std.dot(w)

Lastly, let’s visualize the transformed Wine training set, now stored as an 124 x 2-dimensional matrix, in a two-dimensional scatterplot:

colors = ['r', 'b', 'g']
markers = ['s', 'x', 'o']
for l, c, m in zip(np.unique(y_train), colors, markers):
    plt.scatter(X_train_pca[y_train==l, 0], 
                X_train_pca[y_train==l, 1], 
                c=c, label=l, marker=m) 
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.legend(loc='lower left')
plt.show()

As we can see in the resulting plot, the data is more spread along the x-axis — the first principal component — than the second principal component (y-axis), which is consistent with the explained variance ratio plot that we created previously. However, we can intuitively see that a linear classifier will likely be able to separate the classes well.

Although we encoded the class label information for the purpose of illustration in the preceding scatter plot, we have to keep in mind that PCA is an unsupervised technique that does not use any class label information.

PCA in `scikit-learn`

Although the verbose approach in the previous subsection helped us to follow the inner workings of PCA, we will now discuss how to use the PCA class implemented in scikit-learn. The PCA class is another one of scikit-learn’s transformer classes, where we first fit the model using the training data before we transform both the training data and the test dataset using the same model parameters.

Let’s use the PCA class on the Wine training dataset, classify the transformed samples via logistic regression:

from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA

# intialize pca and logistic regression model
pca = PCA(n_components=2)
lr = LogisticRegression(multi_class='auto', solver='liblinear')

# fit and transform data
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)
lr.fit(X_train_pca, y_train)

Now, using a custom plot_decision_regions function, we will visualize the decision regions:

from matplotlib.colors import ListedColormap

def plot_decision_regions(X, y, classifier, resolution=0.02):
    # setup marker generator and color map
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])

    # plot the decision surface
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                           np.arange(x2_min, x2_max, resolution))
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())

    # plot class samples
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], 
                    y=X[y == cl, 1],
                    alpha=0.6, 
                    c=[cmap(idx)],
                    edgecolor='black',
                    marker=markers[idx], 
                    label=cl)# plot decision regions for training set


plot_decision_regions(X_train_pca, y_train, classifier=lr)
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.legend(loc='lower left')
plt.show()

By executing the preceding code, we should now see the decision regions for the training data reduced to two principal component axes.

For the sake of completeness, let’s plot the decision regions of the logistic regression on the transformed test dataset as well to see if it can separate the classes well:

# plot decision regions for test set
plot_decision_regions(X_test_pca, y_test, classifier=lr)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend(loc='lower left')
plt.show()

After we plotted the decision regions for the test set by executing the preceding code, we can see that logistic regression performs quite well on this small two-dimensional feature subspace and only misclassifies very few samples in the test dataset.

If we are interested in the explained variance ratios of the different principal components, we can simply initialize the PCA class with the n_components parameter set to None, so all principal components are kept and the explained variance ratio can then be accessed via the explained_variance_ratio_ attribute:

pca = PCA(n_components=None)
X_train_pca = pca.fit_transform(X_train_std)
pca.explained_variance_ratio_

Note that we set n_components=None when we initialized the PCA class so that it will return all principal components in a sorted order instead of performing a dimensionality reduction.

I hope you enjoyed this tutorial on principal component analysis for dimensionality reduction! We covered the mathematics behind the PCA algorithm, how to perform PCA step-by-step with Python, and how to implement PCA using scikit-learn. Other techniques for dimensionality reduction are Linear Discriminant Analysis (LDA) and Kernel PCA (used for non-linearly separable data).

These other techniques and more topics to improve model performance, such as data preprocessing, model evaluation, hyperparameter tuning, and ensemble learning techniques are covered in Next Tech’s Python Machine Learning (Part 2) course.

You can get started here for free!

DEV Community: Next Tech

Free coding courses through May!

Python

Java

Swift

Cleaning and Transforming Data with SQL

CASE WHEN

COALESCE

NULLIF

LEAST / GREATEST

Casting

DISTINCT

Deep Learning Basics: A Crash Course

Introduction to Deep Learning

Deep Learning Algorithms

Multi-Layer Perceptrons (MLPs)

Convolutional Neural Networks (CNNs)

Recurrent Neural Networks (RNNs)

Autoencoders

Reinforcement Learning (RL)

Applications of Deep Learning

Autonomous vehicles

Image and text recognition

Medical imaging

Medical history analysis

Language translation

Speech recognition and generation

Gaming

Immutable vs Mutable Data Types in Python

Mutable vs Immutable

Mutable Data Types

Lists

Byte Arrays

Sets

Dictionaries

Immutable Data Types

Numeric Data Types

Strings and Bytes

Frozen Sets

Tuples

Free coding environments and interactive courses!

Build a Blackjack Command Line Game

What is Blackjack?

Installing Python

Defining Classes

The Card class

The Deck class

The Hand class

The Game Loop

Running the Game

Wrapping Up

Building Your First React Website

Prerequisites

Create React App

Viewing the React website

Updating the site

Creating a React Component

Next steps

Database Normalization Explained

First Normal Form (1NF):

Second Normal Form (2NF):

Third Normal Form (3NF):

Our Database: Codey's Construction

First Normal Form

Second Normal Form

Third Normal Form

Introduction to Multilayer Neural Networks with TensorFlow’s Keras API

Multilayer Perceptrons

The MNIST dataset

Building an MLP using TensorFlow's Keras API

K-Means Clustering with scikit-learn

Fundamentals of K-Means Clustering

K-means clustering using scikit-learn

The Elbow Method

Launch VS Code in online computing environments

Why VS Code?

Here's what's currently supported:

Very soon, we'll also be adding support for:

Getting Started

Feedback

`CASE WHEN`

The `Card` class

The `Deck` class

The `Hand` class

K-means clustering using `scikit-learn`

PCA in `scikit-learn`