Lesson 5 notes — Part 1 v3

Photo by Antoine Dautry on Unsplash

Image classification — Currently our algorithms do a very good job in this area.
NLP — Deep learning is starting to get better results
Tabular and collaborative filtering — Deep learning is getting as good results as other approaches without needing to do feature engineering.

Following courses, we are going to learn the theory behind those things. We have seen how these things work but next, we learn to implement these by ourselves.

Review from the last lesson:

Parameters/Weights = things that your model learns.
Activations = numbers that are calculated.

People often ask from Jeremy where this number came from. His answer is to think is the number parameter or activation. All numbers are one of those classes, except input.

When our model outputs something we calculate the loss between those and the actual numbers. Then we use that loss to calculate gradient respect to all parameters. Finally, we subtract gradient times learning rate from these parameters. This whole thing is called backpropagation.


ResNet is trained using ImageNet data and there are 1000 classes in ImageNet. That is why there are 1000 columns in the last parameter matrix.

There is also 1000 outputs

Now if you want to use these parameters and model for other tasks, there is this problem that the last layer is useless because it’s predicting ImageNet categories. That is why we throw this weight matrix away. In Fastai create_cnn will automatically do this for us when pretrained model is given.

Fastai deletes this last parameter matrix and replaces it with two new parameter matrix and ReLU between those. The first matrix column number is some default number but the second is the number of classes in your task. You can see the number of columns in the last matrix with data.c

Those new layers are randomly initialized but because we only changed the last layer the model is actually much better than random. If the model is recognizing images, then the first layer might be recognizing simple shapes. Further we go more specific things the layer is recognizing and that is why we need to train it more. Parameter matrix that we deleted recognized full objects from ImageNet and now we need to train it to recognize objects we care about.

Freezing means that we don’t backpropagate gradients back to those layers. So basically we don’t change parameters that were frozen. When we use transfer learning as a default all layers are frozen except the two latest matrices we added.

After training some time the newest layers and having all other layers frozen, we unfreeze all layers. Unfreezing means that we train all layers like normally we would with a new model. But because we have pretty good weights already, we don’t want to change those a lot and that is why we decrease our learning rates. We also often use multiple learning rates. First layers should be trained with smaller learning rates than newer and when we use multiple learning rate it will automatically split layers to equal size groups that are trained with certain learning rate.

1e-3 = train all layers using this learning rate
slice(1e-3) = last layers are trained with 1e-3 and all the other layers with (1e-3)/3
slice(1e-5,1e-3) = Last layers are trained with 1e-5, first layers with 1e-3, and all the others get learning rates that are equally spread between these numbers. For example, if there are three layers, learning rates will be 1e-5,1e-4,1e-3

matrix multiplication = linear layer = affine function

In excel, if you do, for example, matrix multiplication and the result will be placed more than one cell, you need to press ctrl + shift + enter instead of just enter.

These things do exactly the same thing but the technique below is faster. Why so? Well the above technique is calculating a lot of matrix multiplications where result is zero wasting time on something that doesn’t matter.

The slower version is called one hot encoding. In that version, if you have let’s say, 10 different classes, you create for every item an array where are 9 zeros and 1 one. The faster version is called embedding. Embedding means just simply that you do array look up and replace a number with its weights.

For example, if we have 10 classes in our neural network input. Some people would replace that input by adding row vector where there is one in the index of that class. But the better way is to just create n length row vector for every class and use those directly. It is doing exactly the same thing but the later version is much faster.

After training a while you notice that some of the items in the row vectors corresponding to a different kind of things.

This could be a row vector for some movie. Weights tell for example how much comedy, or George Clooney there is in the movie.

We don’t define these hidden features but model instead make those by itself. That way we give it a freedom to choose the most important features. These are also called latent factors.

Problem is now that if a movie is just overall bad. For example, if George Clooney is the main character but the movie is just boring. We need a way to tell that if a user likes George Clooney, recommend Clooney movies except if it is this certain bad movie. By adding bias we can solve this. Bias is just one number which is added to the dot product. We add own bias to every movie and user.

Jeremy said that everyone will face the error below, at some point in their life.

movies = pd.read_csv(path/'u.item', delimite='|', header=None,  
names=['movieId','title','date','N','url',*[f'g{i}' for i
in range(19)]])

This means that csv isn’t unicode. We solve this by adding encoding='latin-1'

movies = pd.read_csv(path/'u.item',delimite='|', encoding='latin-1', 
header=None, names=['movieId','title','date','N','url',*
[f'g{i}' for i in range(19)]])

Sometimes it can be other than latin-1 so you just need to try what it might be or read from the dataset documentation.

One more thing about embedding is the size of it. How long our embedding should be if we have n different features in input? There are no easy answers but what Jeremy uses is the number of unique features in that input divided by two and max 50. But sometimes it is good to test different numbers because there is not yet any techniques.

We can reduce the number of columns in torch tensor using pca

movie_w = learn.weight(top_movies, is_item=True)
OUTPUT: torch.Size([1000, 40])movie_pca = movie_w.pca(3)
OUTPUT: torch.Size([1000, 3])

Weight decay

Many people think that more parameter there is in a model more complex it is. That is not true because if most of the parameters are just something really close to zero it is the same thing than not having those at all. By adding more parameters we give the model a chance to adjust more widely if needed. This is why we want to have a lot of parameters but same time penalize complexity.

One way to penalize the complexity is, to sum up the square of the parameters. Then we just add that number to the loss. But there is the problem that sum can be so big that it is better for the model to just set all parameters into zero. That is why we multiply the sum with some small hyperparameter. In Fastai that is called wd (weight decay) which generally should be 1e-1. People test a different kind of numbers instead of 1e-1 but it seems to be working best. By default in Fastai library, the weight decay is 1e-2 It is less than it should because in rare cases too big weight decay is causing that model doesn’t learn and that might cause hard to recognize problem for beginners. Jeremy recommends using 1e-1 instead of the default because now when you understand that if parameters become zero then the weight decay is too high. Having too small weight decay is just going to overfit earlier so it doesn’t break the model right away.

**kwargs = parameters that are going to get passed up the chain to the next thing we call.

Sometimes wd can’t be seen as the parameter in learner function although it is there. It is just behind **kwargs.

Weight decay explained using math notations

Jeremy used a function called map when he wanted to made his data into torch.tensor.

x_train,y_train,x_valid,y_valid = map(torch.tensor,(x_train,y_train,x_valid,y_valid))

Idea in map function is that the first parameter is a function you want to use and the second is parameters which you want to pass in that function one by one. Code above is same thing as code below.

x_train = torch.tensor(x_train)
y_train = torch.tensor(y_train)
x_valid = torch.tensor(x_valid)
y_valid = torch.tensor(y_valid)

In PyTorch we transfer data into mini-batches by first creating dataset.

train_ds = TensorDataset(x_train,y_train)
valid_ds = TensorDataset(x_valid,y_valid)

Dataset is just something where x and y values are mapped to each other so we can get nth x and y value using index.

When we have our data in dataset we create dataloader.

data = DataBunch.create(train_ds, valid_ds, bs=64)

This put our data into mini batches. This mean that we can iterate our data through one batch at a time.

x,y = next(iter(data.train_dl))
(torch.Size([64,784]), torch.Size([64]))

If you want to create own model you need to subclass nn.Module

class Mnist_Logistic(nn.Module):
def __init__(self):
# This line is important to remember.
# Inputs,outputs
self.lin = nn.Linear(784, 10, bias=True)
def forward(self, xb): return self.lin(xb)

Every model need to have forward function. In that function, we tell the computer how to use our layers.

As a homework try to create nn.Linear function from scratch.

We need to put the parameters into GPU manually when we are using our own model.

model = Mnist_Logistic().cuda()

We can look at our parameters in the model using the following code.

[p.shape for p in model.parameters()]

Then Jeremy showed how to implement weight decay.

loss_func = nn.CrossEntropyLoss()
def update(x,y,lr):
wd = 1e-5
y_hat = model(x)
# weight decay
w2 = 0.
for p in model.parameters(): w2 += (p**2).sum()
# add to regular loss
loss = loss_func(y_hat, y) + w2*wd
with torch.no_grad():
for p in model.parameters():
p.sub_(lr * p.grad)
# This is just turning PyTorch tensor into normal Python number
return loss.item()

We start by looking at this equation.

mse(m(x,w),y) + wd * w²

When we calculate the gradient of this thing it is the same than calculating both separately and then adding together. If we calculate the gradient of wd * w² we end up following equation.

2wd * w

And because wd is something we defined, we can just remove two from the equation.

wd * w

So it’s just multiplying weights with some number which is less than 1 and that way decreasing the weights.

wd * w² = L2 regularization
wd * w = Weight decay

Although we have different names for both of these they are most of the time mathematically same. Jeremy will explain later which cases these are not the same thing.

Momentum idea using math notations.

S_t = alpha * g + (1-alpha) * S_(t-1)

Every time multiple gradient by a small number and add old number times 1-alpha to it. (1-alpha) * S_(t-1)it is important to understand that because we are taking every time the last value, and the value before that also matters but not so much. It’s called an exponentially weighted average of last values.

RMSprop is just the same thing as momentum but instead of multiplying the gradient with learning rate we multiply gradient squared with learning rate. Then we calculate new weights by subtracting from old weights gradient times learning rate divided by square root of the momentum.

RMSprop = old_RMSprop * 0.9 + learning_rate * gradient²
weights = old_weights - (gradient * learning_rate) / RMSprop

Adam is doing both momentum and RMSprop same time.

adam = (momentum * learning_rate) / sqrt(RMSprop)
weights = old_weights - adam

My notes:
lesson 1 notes
lesson 2 notes
lesson 3 notes
lesson 4 notes
lesson 5 notes
lesson 6 notes
lesson 7 notes