fast.ai 2020 — Lesson 8
Language model = a model that tries to predict the next word of a sentence.
A language model works well on transfer learning as the base model because it knows something about language as it can predict the next word of a sentence.

The base language model should be trained to task specific language model instead of directly using it in classifier.
Language model from scratch
- Create a list of all possible unique words (vocab)
- Replace words with the indexes
- Embedding matrix for each word
- Embedding matrix as an input to a neural network
- Combine big document into one long string
- Independent variable is sequence of words starting with the first word and ending the second last and dependent variable is sequence of words starting from the second word and ending the last.
Tokenization = convert the text into a list of words (or characters, substrings, depending the model)
Numericalization = give each unique word an index and replace words with these indexes

fastai has Tokenizer
class that adds some tags like the next word's first character is capitalized. This way all the words are the same format and capitalized is handled the same way as non-capitalized but model still uses this information because there is a tag before the word.

tkn = Tokenizer(spacy)

Subword
In Chinese there are no spaces or specific rules what’s a word and that is why in those kind of languages it’s required to use subword tokenization which means that the most commonly occurring groups of letters are grouped as a word to vocab.
def subword(sz):
sp = SubwordTokenizer(vocab_sz=sz)
sp.setup(txts)
return ' '.join(first(sp([txt]))[:40])



Jeremy predicted that this approach become the most popular approach.
90 tokens split into 6 batches where sequence length is 15.

If we try to split much longer text into 6 batches the length would be really big. That is why it’s possible to split them vertically in a way that the first batch text continues to the second chunk’s first batch.






Classifier
dls_clas = DataBlock(
blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
get_y = parent_label,
get_items=partial(get_text_files, folders=['train', 'test']),
splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

In classification independent variable can be any size and not fixed the same way in language model. Like in vision it’s possible to handle these different size of sequences by adding padding.
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5,
metrics=accuracy).to_fp16()
learn = learn.load_encoder('finetuned')

Data augmentation is starting to become big thing in text. There are techniques like translating the text into some other language and back to get different wordings. More in this paper:
Often it’s better to experience with small dataset to get results quickly.
Next we see how language model from scratch works. The dataset is fastai’s human numbers.





After importing data we need to split it into independent and dependent variables.

The same thing with tokens.

Then let’s put it to DataLoaders
.

Language model.
class LMModel1(Module):
def __init__(self, vocab_sz, n_hidden):
self.i_h = nn.Embedding(vocab_sz, n_hidden)
self.h_h = nn.Linear(n_hidden, n_hidden)
self.h_o = nn.Linear(n_hidden,vocab_sz)
def forward(self, x):
h = F.relu(self.h_h(self.i_h(x[:,0])))
h = h + self.i_h(x[:,1])
h = F.relu(self.h_h(h))
h = h + self.i_h(x[:,2])
h = F.relu(self.h_h(h))
return self.h_o(h)

The same things as above but some refactoring. Because it’s refactored this way it’s called recurrent neural network (RNN).
class LMModel2(Module):
def __init__(self, vocab_sz, n_hidden):
self.i_h = nn.Embedding(vocab_sz, n_hidden)
self.h_h = nn.Linear(n_hidden, n_hidden)
self.h_o = nn.Linear(n_hidden,vocab_sz)
def forward(self, x):
h = 0
for i in range(3):
h = h + self.i_h(x[:,i])
h = F.relu(self.h_h(h))
return self.h_o(h)

To keep the state. detach
removes the gradient from memory but keeps activations.
class LMModel3(Module):
def __init__(self, vocab_sz, n_hidden):
self.i_h = nn.Embedding(vocab_sz, n_hidden)
self.h_h = nn.Linear(n_hidden, n_hidden)
self.h_o = nn.Linear(n_hidden,vocab_sz)
self.h = 0.
def forward(self, x):
for i in range(3):
self.h = self.h + self.i_h(x[:,i])
self.h = F.relu(self.h_h(self.h))
out = self.h_o(self.h)
self.h = self.h.detach()
return out
def reset(self): self.h = 0
Back-propagation through time (BPTT) means that we calculate gradients through the whole loop.
In training model is reset at the beginning of training, validation, and fit. cbs=ModelReseter

Previously we used data in a way that first x was 1–3 words and y 4th. Then the next time x was 5–7 and y was 8th. It’s better to use first time 1–3 as x and 4th as y and then the next time 2–4 as x and 5th as y.

RNN uses the same parameters in almost every layer. This means that it’s not that complex.
Multilayer RNN


class LMModel5(Module):
def __init__(self, vocab_sz, n_hidden, n_layers):
self.i_h = nn.Embedding(vocab_sz, n_hidden)
self.rnn = nn.RNN(n_hidden, n_hidden, n_layers, batch_first=True)
self.h_o = nn.Linear(n_hidden, vocab_sz)
self.h = torch.zeros(n_layers, bs, n_hidden)
def forward(self, x):
res,h = self.rnn(self.i_h(x), self.h)
self.h = h.detach()
return self.h_o(res)
def reset(self): self.h.zero_()
Very deep models are hard to train because the activations might disappear or explode. Multiplying two big numbers multiple times create really big number and the same with small numbers except the result becomes really small. This also affects the gradients destroying the training.
This same thing happens not just RNNs but also all other kinds of neural networks. To train these the learning rate needs to be really small (bad approach) or use some techniques.
Long Short-Term Memory (LSTM)
Solves exploding and disappearing gradient problem. The idea is that there are mini neural networks inside the layer that decide how much of the previous state is kept, how much is thrown away, and how much of the new state is added.

class LSTMCell(Module):
def __init__(self, ni, nh):
self.forget_gate = nn.Linear(ni + nh, nh)
self.input_gate = nn.Linear(ni + nh, nh)
self.cell_gate = nn.Linear(ni + nh, nh)
self.output_gate = nn.Linear(ni + nh, nh) def forward(self, input, state):
h,c = state
h = torch.stack([h, input], dim=1)
forget = torch.sigmoid(self.forget_gate(h))
c = c * forget
inp = torch.sigmoid(self.input_gate(h))
cell = torch.tanh(self.cell_gate(h))
c = c + inp * cell
out = torch.sigmoid(self.output_gate(h))
h = outgate * torch.tanh(c)
return h, (h,c)class LMModel6(Module):
def __init__(self, vocab_sz, n_hidden, n_layers):
self.i_h = nn.Embedding(vocab_sz, n_hidden)
self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)
self.h_o = nn.Linear(n_hidden, vocab_sz)
self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]
def forward(self, x):
res,h = self.rnn(self.i_h(x), self.h)
self.h = [h_.detach() for h_ in h]
return self.h_o(res)
def reset(self):
for h in self.h: h.zero_()
One common way to initialize hidden layers is to use identity matrix because this way at the beginning the result is the same as input because identity matrix times something is the same something.
Regularization helps to reduce the overfitting.
Dropout

In each mini-batch dropout deletes randomly some activations. By randomly deleting activations it makes sure one activation can’t become too specialized into some specific thing but it learns to use all of them.
class Dropout(Module):
def __init__(self, p): self.p = p
def forward(self, x):
if not self.training: return x
mask = x.new(*x.shape).bernoulli_(1-p)
return x * mask.div_(1-p)
Activation Regularization (AR)
loss += alpha * activations.pow(2).mean()
Temporal Activation Regularization (TAR)
loss += beta * (activations[:,1:] - activations[:,:-1]).pow(2).mean()
Weight Tying
class LMModel7(Module):
def __init__(self, vocab_sz, n_hidden, n_layers, p):
self.i_h = nn.Embedding(vocab_sz, n_hidden)
self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)
self.drop = nn.Dropout(p)
self.h_o = nn.Linear(n_hidden, vocab_sz)
**self.h_o.weight = self.i_h.weight**
self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]
def forward(self, x):
raw,h = self.rnn(self.i_h(x), self.h)
out = self.drop(raw)
self.h = [h_.detach() for h_ in h]
return self.h_o(out),raw,out
def reset(self):
for h in self.h: h.zero_()
This is the same
learn = Learner(dls, LMModel7(len(vocab), 64, 2, 0.5),
loss_func=CrossEntropyLossFlat(), metrics=accuracy,
cbs=[ModelReseter, RNNRegularizer(alpha=2, beta=1)])
as this
learn = TextLearner(dls, LMModel7(len(vocab), 64, 2, 0.4),
loss_func=CrossEntropyLossFlat(), metrics=accuracy)
Originally posted: https://www.notion.so/lankinen/Folder-Fast-ai-Part-1-2020-e6bc5e0f9bce4d4d9f494ec8259b1119