Ziad Alezzi

Posted on Aug 15

Answering Some Common Questions In Deep Learning Foundations

Based On FastAI Book Chapter 4 Questionnare

At the end of every FastAI Chapter, there's a series of questions to encourage further research. Even without taking FastAI, some of these answers may surprise you!

Q1: How is a grayscale image represented on a computer? How about a color image?

A1: A greyscale image, for computer vision models, is represented as a matrix of values ranging from 0 to 255. Each element in the matrix represents a pixel, and it's number shows it's color (pixel intensity value).
The closer to 0, the darker the pixel
The closer to 255, the lighter the pixel

Here's an example:

Can you guess what number that is? ;]

To explain how a color image is represented on a computer, we first need to understand how a color image even works. Any color can be made by mixing shades of Red Green and Blue. That's why a pixel has those three colors. Therefore, each of a color image's pixels is simply a combination of three numbers representing how much Red Green or Blue is in that pixel.

Now how is represented in computer vision? Well, you might have a hint already. It is represented not as a matrix, but as 3 matricies stacked on top of each other. The first matrix, could represent how much red is in the image, the second matrix represents blue, the third representing green. And it is again, with each element of each matrix containing a number from 0-255 for the intensity.

Here's an example:

Q2: Explain how the "pixel similarity" approach to classifying digits works.

In the book, instead of taking the traditional approach to computer vision (Training Data --> Train), a simpler method was used to introduce how computer vision works.

They took hundreds of greyscale images of say the number 3, and for each element they averaged all previous pixel intensities of the same pixel location in every image to construct a new image. This new image is kinda like a "Average" of every other image.
The "clasification" was just comparing the pixel intensities in the input image, to the pixel intensities in the "Average" image.
So there are no parameters, and no optimization.

Q3: What is a list comprehension? Create one now that selects odd numbers from a list and doubles them.

A list comprehension is an efficient and concise way to itterate over data, and append it to an array.
While usually, this would require a couple lines of code, list comprehension is like merging a for loop and append() in one line.
Here's how it looks:

doubled_list = [i*2 for i in numbers_array]

You can add an if statement at the end optionally too.

Q4: What is a "rank-3 tensor"?

A 3D cube.
Rank 1: Vector
Rank 2: Matrix
Rank 3: Cube

Q5: What is the difference between tensor rank and shape? How do you get the rank from the shape?

A tensor rank is simply its dimension. An image, as previously stated, is a matrix of pixel intensities. An image is 2 dimensional (2D) thus it is rank-2
A shape is the number of elements in each axis of the data. In a matrix, that'd be it's columns and rows.
The rank is the number of axis in the shape.

Q6: What are RMSE and L1 norm?

Both of these are methods for testing the accuracy of the model. RMSE (Root Mean Square Error) is very similar to the regular MSE (mean square error) which is the prediction of the model subtracted by the actual label then squared.
(Pred - Label)**2

L1 norm, in math, is the distance or magnitude between two vectors. This is usful for telling how "off" the model is.

Q7: How can you apply a calculation on thousands of numbers at once, many thousands of times faster than a Python loop?

Using parallel computation (its as it sounds). Which requires a gpu. For this, its best not to use Python loops, and always go for built-in alternatives like in pytorch. Since it'd be implemented with C/CUDA that would use gpu acceleration.

Q8: Create a 3×3 tensor or array containing the numbers from 1 to 9. Double it. Select the bottom-right four numbers.

import torch

tensor_numbers = torch.tensor([1, 2, 3], 
                              [4, 5, 6],
                              [7, 8, 9])
tensor_numbers * 2
tensor_numbers[-2:, -2:]

[-2:, -2:] means the final 2 rows, and final 2 collomns to select the numbers [5, 6, 8, 9]

Q9: What is broadcasting?

When we did tensor_numbers * 2 onto the 3D matrix, what pytorch did was turn the number 2 into a 3D matrix of twos.

2 --> [2, 2, 2]
      [2, 2, 2]
      [2, 2, 2]

That's broadcasting in a nutshell.

Q10: Are metrics generally calculated using the training set, or the validation set? Why?

Metrics are calculated using the validation set, as to avoid falsely thinking the model would do good on new data when it's actually simply overfitting the training set.

Q11: What is SGD?

Stochiastic Gradient Descent. A very simple optimization method used to update the parameters of a model according to a loss function.

Q12: Why does SGD use mini-batches?

During training, you can use a batch the size of the training set or a size of 1, or in-between.

What's the difference? Well using the entire training set would take very long to do 1 single step of gradient descent. Sometimes an epoch (training on entire dataset) could take weeks or months. This is impractical for any changes you might want to do (changing learning rate, finding mislabeled data, testing different optimization techniques or model architectures) since you'd have to wait a whole month for every test.

Training on a batch the size of 1 (a single training example) would be too little information to properly run gradient descent, resulting in the results being inconsitent and may take longer to diverge (successfully train)

Thus, it's best to go in the middle, and use mini-batches of the training data.

Q13: What are the seven steps in SGD for machine learning?

Initialize Parameters --> Select Mini-Batch --> Compute Predictions --> Compute the Loss --> Compute Gradients --> Update Parameters --> Repeat From Step 2

Q14: How do we initialize the weights in a model?

import pytorch
import pytorn.nn as nn

linear_layer = nn.Linear((shape of model))

Q15: Why can't we always use a high learning rate?

This image will explain:

Q16: Do you need to know how to calculate gradients yourself?

If you're using numpy, yes you will use math. If using pytorch then no, pytorch automatically tracks your gradients which can be activated by using requires_grad=True when initializing a tensor. And then performing some calculation (for example, a loss function) and calling .backwards on the loss function like:

x = torch.tensor(arr, requires_grad=True)
loss = (x - y) ** 2
loss.backwards

You can access the gradients using x.grads and zero the gradients with x.grads.zero_()

Q17: Why can't we use accuracy as a loss function?

Accuracy is showing the perfomance of the model to a human, a loss function is showing the perfomance to a computer.
The loss can go from 0.1 to 0.2 (model getting worse), but the accuracy would still show as 60% (for example)
Thus, the gradients will be zero, and the model will be stuck.

Basically: The loss is a more fine and precise measure of performance

Q18: What does the `DataLoader` class do?

Say you have some data: data = [[1, 2], [4, 5], [6, 7], [8, 9]]
A DataLoader class will give you an easy way to organize this into a dataset for your model. Offering features like mini-batches and shuffling the data.

data_set = DataLoader(data, batch_size=2, shuffle=True)

Q19: Create a function that, if passed two arguments `[1,2,3,4]` and `'abcd'`, returns `[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]`. What is special about that output data structure?

Alright so first I wanna show you abit more of how the DataLoader looks like:

data = range(10)
loader = DataLoader(data, batch_size=5)
print(list(loader))

--> [tensor([0, 1, 2, 3, 4]), 
     tensor([5, 6, 7, 8, 9]),]

That's the basic idea of a dataloader. However when training a model, we also need labels for the data (x = data, y = labels)
Therefore we introduce a very basic dataset, which is basically just a collection of tuples:

dataset = L(enumerate(string.ascii_lowercase))
print(dataset)

--> [(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd'), (4, 'e'), (5, 'f'), (6, 'g')...

Now passing this into the DataLoader, We get mini batches of these tuples with their label attached

data_loader = DataLoader(dataset, batch_size=3)
print(list(data_loader))

--> [(tensor([0, 1, 2]), ('a', 'b', 'c')), 
     (tensor([3, 4, 5]), ('d', 'e', 'f')), 
     (tensor([6, 7, 8]), ('g', 'h', 'i')),]

What's special about this, is well, we have a functioning paired dataset now.

Q20: What does `view` do in PyTorch?

view allows you to reshape your tensors.

Example:

tensors = torch.arange(6)  # tensor([0, 1, 2, 3, 4, 5])
tensors.shape  # torch.Size([6])

tensors.view(2, 3) # Reshaping to 2 rows and 3 collumns

Output:

tensor([[0, 1, 2],
        [3, 4, 5]])

Q21: Why do we have to zero the gradients?

In pytorch, when your gradients are saved in grad, they are not overwritten but rather accumulate.
Meaning that, the gradients from training session 1, will be added with the gradients from training session 2.

You might ask why? Well it's because your gpu can only handle a limited ammount of space.
Another reason why mini-batches exist, is that sometimes you dont have enough ram/vram to store the entire dataset. Thus you break it down into mini-batches.
However, you can simulate running larger batches if you accumilate all the gradients, and add all gradients from each mini-batch before taking one large training step.

But what if you dont want to simulate multiple larger mini-batches than your gpu can handle? Well in that case after every training step you must zero your gradients using .zero_grad()

Q22: What information do we have to pass to `Learner`?

To create a Learner that allows you to call easy functions like .fit(), you must pass the following information:

Your DataLoader
Your Model Architecture (activation)
Your Optimizer (Gradient Descent)
Your Loss Function
Optionally: A Metric

Here's an example:

learn = Learner(dataloader, nn.Linear(28*28,1), opt_func=SGD,
                loss_func=mnist_loss, metrics=batch_accuracy)

Each one of these were predefined functions.

We can then easily run:

learn.fit(10, lr=0.01)

10 means how many epochs/training rounds, lr means the learning rate.

Q23: The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more?

Wowie, praise the jargon. Lets break down this mouthful.
The Universal Approximation Theorem (how fancy) simply states: "You can solve any problem using just one reaaally big layer and non-linearity"

Basically, instead of having a neural network thats 100 layers, you'd have a neural network with just 1 really big layer with alot of neurons/units, and an output non-linearity (sigmoid, tanh, etc)

Here's the issue with this, it is EXTREMELY inneffecient. To "approximate" (solve the problem), it might need millions of neurons whereas a deeper neural network can solve the same problem with far less neurons.
Especially in Computer Vision, where each layer detects different features, with more complex features the deeper you go.
Here's an example of the different features that a deep neural network with multiple layers can learn:

As you can see, the deeper the neural network goes, the more complex features you can extract.

So while yes, a neural network with 1 absurdly big layer can indeed solve any problem in deep learning, however it is so inneffecient that you're better off just slapping in a couple more layers.

It'll save you time, space, and sanity.

Conclusion

Thank you so much for reading this far into my nerdy litle blog!! At the end of every fastai chapter, there's a questionnare. So I decided to solve it in a blog as maybe it can help teach or simply intruige someone out there!!

As a wise man from CS50 once said: "This Was Deep Learning"

Check out my github, where you'll see many more of my nerdy projects. Like the time I made a Cancer Classification Neural Network from scratch with just Numpy and Math! ;]

lucirie (Ziad Alezzi) · GitHub

lucirie has 20 repositories available. Follow their code on GitHub.

github.com

DEV Community

Answering Some Common Questions In Deep Learning Foundations

Q1: How is a grayscale image represented on a computer? How about a color image?

Q2: Explain how the "pixel similarity" approach to classifying digits works.

Q3: What is a list comprehension? Create one now that selects odd numbers from a list and doubles them.

Q4: What is a "rank-3 tensor"?

Q5: What is the difference between tensor rank and shape? How do you get the rank from the shape?

Q6: What are RMSE and L1 norm?

Q7: How can you apply a calculation on thousands of numbers at once, many thousands of times faster than a Python loop?

Q8: Create a 3×3 tensor or array containing the numbers from 1 to 9. Double it. Select the bottom-right four numbers.

Q9: What is broadcasting?

Q10: Are metrics generally calculated using the training set, or the validation set? Why?

Q11: What is SGD?

Q12: Why does SGD use mini-batches?

Q13: What are the seven steps in SGD for machine learning?

Q14: How do we initialize the weights in a model?

Q15: Why can't we always use a high learning rate?

Q16: Do you need to know how to calculate gradients yourself?

Q17: Why can't we use accuracy as a loss function?

Q18: What does the `DataLoader` class do?

Q19: Create a function that, if passed two arguments `[1,2,3,4]` and `'abcd'`, returns `[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]`. What is special about that output data structure?

Q20: What does `view` do in PyTorch?

Q21: Why do we have to zero the gradients?

Q22: What information do we have to pass to `Learner`?

Q23: The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more?

Conclusion

lucirie (Ziad Alezzi) · GitHub

Top comments (0)

Q1: How is a grayscale image represented on a computer? How about a color image?

Q2: Explain how the "pixel similarity" approach to classifying digits works.

Q3: What is a list comprehension? Create one now that selects odd numbers from a list and doubles them.

Q4: What is a "rank-3 tensor"?

Q5: What is the difference between tensor rank and shape? How do you get the rank from the shape?

Q6: What are RMSE and L1 norm?

Q7: How can you apply a calculation on thousands of numbers at once, many thousands of times faster than a Python loop?

Q8: Create a 3×3 tensor or array containing the numbers from 1 to 9. Double it. Select the bottom-right four numbers.

Q9: What is broadcasting?

Q10: Are metrics generally calculated using the training set, or the validation set? Why?

Q11: What is SGD?

Q12: Why does SGD use mini-batches?

Q13: What are the seven steps in SGD for machine learning?

Q14: How do we initialize the weights in a model?

Q15: Why can't we always use a high learning rate?

Q16: Do you need to know how to calculate gradients yourself?

Q17: Why can't we use accuracy as a loss function?

Q18: What does the DataLoader class do?

Q19: Create a function that, if passed two arguments [1,2,3,4] and 'abcd', returns [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]. What is special about that output data structure?

Q20: What does view do in PyTorch?

Q21: Why do we have to zero the gradients?

Q22: What information do we have to pass to Learner?

Q23: The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more?

Conclusion

Q18: What does the `DataLoader` class do?

Q19: Create a function that, if passed two arguments `[1,2,3,4]` and `'abcd'`, returns `[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]`. What is special about that output data structure?

Q20: What does `view` do in PyTorch?

Q22: What information do we have to pass to `Learner`?