fast.ai 2020 — Lesson 7
Weight decay (L2 regularization)
The idea is to add the sum of all the weights squared to the loss function. This way the model tries to keep the weights as small as possible because bigger weights will increase the final loss.
loss_with_wd = loss + wd * (parameters**2).sum()
It’s often inefficient to calculate big sum and that is why it’s possible to add it directly to the gradient.
weight.grad += wd * 2 * weight
Creating own Embedding module
def create_params(size):
return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))class DotProductBias(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
self.user_factors = create_params([n_users, n_factors])
self.user_bias = create_params([n_users])
self.movie_factors = create_params([n_movies, n_factors])
self.movie_bias = create_params([n_movies])
self.y_range = y_range
def forward(self, x):
users = self.user_factors[x[:,0]]
movies = self.movie_factors[x[:,1]]
res = (users*movies).sum(dim=1)
res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
return sigmoid_range(res, *self.y_range)model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)
How Google Play decide what to recommend (combination of tabular and collab model):

Ordinal columns (strings that has natural ordering) should be ordered in a way that they make sense. e.g. “small”, “medium”, “big”.
Decision tree = ask binary (yes or no) questions and gives some answer by following the answers.

To create automatically decision tree:
- Try splitting every column into two groups with some questions.
- Find average target value (so if target value is price it’s average price for each group).
- Now there is a simple model for each column and then the idea is to pick the one that gives the smallest RMSE as the first question.
- Then this goes back to start and do the same thing for both leaves seperately.
cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
procs = [Categorify, FillMissing]
splits = (list(train_idx),list(valid_idx))
dep_var = 'SalePrice'to = TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits)
Decision tree stops the loop of doing binary decisions either when there are no columns with two different values or it’s very useful to define some bigger number than one.
Categorical columns can be turned into numbers and if they are not ordinal it doesn’t matter that much in what order they are labeled.
Random forest = multiple decision trees trained with subset of the data and final prediction is average of predictions from these decision trees.
out-of-bag error (OOB) = Error is calculated for each row in training set by only using trees where the row wasn’t part of the calculation.
With random forest it’s possible to see following things about the data:
- Confidence of the predictions by taking standard deviation of the predictions across the trees.
- Feature importance that tells how important certain column is for the predictions. So if the feature importance of certain column is near zero, the model might do as well without it.
- Find correlating columns. One of these can again be removed without affecting the model accuracy too much.
- Partial dependence. Plots that show certain column against partial dependence. Partial instead of just the average target value because some other columns might change too and we want to see the change without them. This partial thing is calculated by replacing all values in
YearMade
with 1950 and then we calculate the prediction. Then this same thing is repeated for 1951, 1952,... - Tree interpreter shows how certain columns change the prediction.

Random forest can’t predict values over the values in training set and that is a problem in time series data. In general this means that random forest is not good with the data that it has never seen.
To see if there is out of domain data, it’s possible to combine training and validation sets and then try to predict if certain row is valid. If the model can predict that certain data is in valid it means that they are different. With feature importance it’s possible to see what columns exactly helped to differentiate these.
Neural network and random forest has their own advantages and disadvantages and Jeremy got better results by predicting average of both models.
Boosting
- Train a small model which under fits dataset
- Subtract the predictions from the targets (=residuals)
- Go back to step one but instead of using targets use residuals. This loop continues until it reaches some stopping criteria like maximum number of trees.
Some research showed that by first creating embeddings with neural network and then using them in random forest and some other models gave much better results.
Originally posted: https://www.notion.so/lankinen/Folder-Fast-ai-Part-1-2020-e6bc5e0f9bce4d4d9f494ec8259b1119