Lesson 12 notes — Part 2 v3

Photo by Mat Reding on Unsplash

Today’s topics are transfer learning and NLP.

Today we will see how the state of the art Imagenet model is built.

Mixup / Label smoothing


This is something that might replace all of the other augmentation techniques. It generalizes well to all domains and definitely something everyone should pay attention to.

The idea is to combine two images by taking some amount of another image and some amount of another. We also do this for the labels. So, for example, we might take 30% of a plane image and 70% of a dog image and then label for that combination will be 30% of a plane and 70% of a dog.

Combination of dog and gas pump

How we decide what amount of certain picture to take? When we should do this?

Beta distribution

The graph above is used for deciding how much to take a certain image. It shows that in most of the cases we take zero percent of another and one percent of another but in occasionally we take something between these numbers.

To use beta distribution we need to use gamma function. Let’s look at how it works.

Gamma function. We use Greece letter for this because it is just a math function.

To use Greece letter easily you need to find compose key in your keyboard. I have Ubuntu 18 and setting up compose key was pretty time-consuming. This conversation will take you closer to the result.

Factorial = n!
This is what beta distributions looks like. It is important to have options to tune and it is also important to print different options to see that the result is something you expected to get.

linear combination = exponentially weighted moving average

In softmax, there is one number a lot higher than others. This is not good for the mixup. Label smoothing is something where we don’t use one-hot-encoding but something like 0.9-hot-encoding. It means that instead of trying to give one for some class it tries to give 0.9 for one class and 0.1 for other classes.

This is a simple but very effective technique for noisy data. You actually want to use this almost always unless you are certain that there is only one right label.

Mixed Precision Training


Mixed precision training is a technique where instead of using 32-bit floats we use 16-bit floats. This will speed up the training about 3x. It is only working on modern Nvidia drivers and here is the explanation.

We can’t do all in 16-bit because it is not accurate. From above graph, you can see what things are done in 16-bit and what should be done in 32-bit. As you can see forward pass and backward pass are done in 16-bit which are the most time-consuming things.

In the notebook, there is shown how to use APEX library to do this in practice.

In Fastai there is a callback for this which can be easily added.


This is like ResNet but there are a few tweaks. 1. tweak is called ResNet-C. The idea is that instead of using 7x7 kernel we use 3 times 3x3 kernel. 2. tweak is that we initialize the batch norm to sometimes have weights of 0 and sometimes weights of 1. The idea behind this is that then sometimes ResBlock can be ignored. This way we can train very deep models with high learning rates because if the model doesn’t need the layer it can skip it by keeping the batch norm weights to zero. 3. tweak is to move stride two one convolution up.

The paper Jeremy is talking about: (bag of tricks)

Big companies try to brag with how big batches they can train once. For us, normal people, increasing the learning rate is something we want. That way we can speed training and generalize better.

Jeremy showed how using these techniques he made 3rd best ImageNet model. The two models above this are much bigger and require a lot of computation power.


Using the same techniques we can also make good models for transfer learning.

To save a model we call learn.model.state_dict() We didn’t write this but according to Jeremy, it is just three lines of code.

Keys of the dictionary.
With key we can get values.
This is how we save the parameters with PyTorch. is about the same as using pickle.

Then to use these pretrained weights we need to remove the last linear layer and replace it another layer that have the right number of outputs.

The important thing to notice when fine-tuning is that batch norm will mess the accuracy in case it is frozen. The solution to this is to only freeze the layers that don’t contain batch norm.

Discriminative LR and param groups

Discriminative LR was the thing where we have different learning rates for different layers.

With this function we can split parameters into two groups

Question: Why you are against cross-validation set? It’s good if you don’t have enough data to build proper validation set but in case you have even a thousand data points, it is often pointless. So Jeremy is not against it but in most of the cases, it is pointless because there is enough data.

Question: Best practices to debug deep learning model? Don’t make mistakes in the first place. Try to build so simple model that there can’t be mistakes. Test every step during training to see that everything is working in a way you expected. Keep track of everything you do to make it easier to later see what might have gone wrong. Debugging is really hard and that is why you should be constantly testing that everything works. Tests inside the code are great because you will notice right away if something is broken before it is too late.

Natural Language Processing

ULMFiT is transfer learning applied to AWD-LSTM.

Although many people think that ULMFiT is just for text, it actually can be used in many different tasks. This can be used many tasks that contain sequential data.

Language model = predict the next item in a sequence

Next, we are going to see how ULMFiT works from scratch. There are four different notebooks that show steps of building the final model.


First, we start by importing IMDb dataset that contains 50,000 label reviews and 50,000 unlabeled reviews. That way there is a train, test, and unsup (unsupervised = no labels) datasets.

We can’t give pure text to the model and that is why we tokenize and numeralize the text.

There are two things happening before normal tokenization. First of all, we change things like <br /> into \n to make the text consistent. We also define some special tokens that we can use in text to make it easier for the model to understand.

Duplicate characters we represent using xxrep token.

This kind of preprocessing makes it easier for the model to understand the text. When there is a token for repetitive characters it is easier for the model to understand the similarity of 5 question marks and 6 question marks.

As always we need a way to use text as batches. Let’s look at an example to understand better how the batching is happening.

Let’s use this text
This is what it looks like when we have 6 batches
If bptt is 5 we have three mini-batches

This is where people get confused. Normally in vision tasks batch is something that contains many images but in NLP one text document might be split into many batches. These batches are used in order to make the text consistent.

Question: Why you didn’t use normal NLP preprocessing techniques like removing stop words, stemming, or lemmatization? These are useful in traditional NLP but when using neural networks these just take important information away from the model.

Question: How to choose bptt and bs ratio? Try different things. There is no research about this.

Something we need to also do is create a way to use different size of texts. First thing is to sort the batches using length. The first batch is the text with most tokens and the last is the text with least tokens. There is some amount of randomness but this is what it in average looks like. Then for sorter batches, we add “padding” which is just some token that is used as blank in every place.


Now we have imported and preprocessed the data. Next, we are going to create AWD-LSTM. First, we start by creating RNN from scratch.

Basic RNN idea is that there is a layer for every input.
We have a lot of different variations of the basic RNN and LSTM is one of those. Credit:

Sigmoid function (σ) = [0,1]
tanh = [-1,1]
Xₜ = input
hₜ-₁ = hidden state
cₜ-₁ = cell (Like hidden state just a rank one tensor)

First, we multiply input and hidden state with some weight matrix as usually and then combine those. We split the result into four equal sizes of tensors which we then feed into four different places. The first path is called forget gate. The idea is that it is multiplied with cell and when the values are between zero and one it can remove some of the cell values. Then comes the input gate and cell gate. Input gates are multiplied with a cell gate and then the result is added to forget gate result. Then sum of these three gates is going to the next layer and it is also run through tanh and after that added to the fourth input part.

The code is pretty simple.

Dropout. Dropout just creates a bunch of 0s and 2s (two because the std is 1 which mean that dropout can be removed without affecting the scaling of activations) In RNN it is important to dropout the full sequence at a time. There is also something called Weight Dropout which is just dropout for weights. The third kind of dropout is Embedding Dropout which dropouts full word embedding at a time.

All these things combined are called AWD LSTM. In the language model, we have one linear layer top of AWD LSTM to get the size of word embedding output that is the prediction of the next word.


This notebook literally combines the previous two notebooks. High level there is two steps:

  • import and preprocess data
  • create the model we made the previous notebook


Now we use the model we trained the previous notebook. An important thing to notice is that in IMDb dataset we used different vocabulary than in wikitext 103. To solve this we just combine the two vocabularies by overwriting others and using embeddings from wikitext 103.

This is the whole thing. There is nothing special in ULMFiT. It is just using transfer learning to get better results.


In the end, Jeremy wanted to say a few words about Swift.

Question: What is the best way to learn Swift and Tensorflow for someone who doesn’t have any knowledge of these? We don’t require you to understand Swift in the course.

A quick summary of this part: Swift is a great language to learn. It is probably important in the future and if you don’t learn it now you need to do it at some point. The community is now really small so it is a great way to be part of something that might become really big in the future. This might be one of those moments that you later look back and regret choosing the wrong answer so spend some time thinking it. Watch at least following lessons and make the decision to learn more after you have some level of understanding.

This has nothing to do with the course but I just want to add a link to the project I have been working on from early 2019. TrimmedNews is the fastest way to discuss news. You can join or create private groups where users can comment interesting news. I’m using a lot of deep learning almost everywhere and one cool feature is summarized news. I realized how much time I spend reading news that aren’t something I expected. With this app, I can find interesting news much more easily than just reading the headlines.

lesson 8 notes
lesson 9 notes
lesson 10 notes
lesson 11 notes
lesson 12 notes
lesson 13 notes
lesson 14 notes