Training a State-of-the-Art Model

  • the dataset you are given is not necessarily the dataset you want.
  • aim to have an iteration speed of no more than a couple of minutes
    • think about how you can cut down your dataset, or simplify your model to improve your experimentation speed
  • the more experiments your can do the better

Imagenette

from fastai.vision.all import *

path = untar_data(URLs.IMAGENETTE)
path
Path('/home/innom-dt/.fastai/data/imagenette2')

parent_label


parent_label
<function fastai.data.transforms.parent_label(o)>

dblock = DataBlock(blocks=(
    # TransformBlock for images
    ImageBlock(), 
    # TransformBlock for single-label categorical target
    CategoryBlock()),
                   # recursively load image files from path
                   get_items=get_image_files,
                   # label images using the parent folder name
                   get_y=parent_label,
                   # presize images to 460px
                   item_tfms=Resize(460),
                   # Batch resize to 224 and perform data augmentations
                   batch_tfms=aug_transforms(size=224, min_scale=0.75))
dls = dblock.dataloaders(path, bs=64, num_workers=8)

xresnet50
<function fastai.vision.models.xresnet.xresnet50(pretrained=False, **kwargs)>

CrossEntropyLossFlat


CrossEntropyLossFlat
fastai.losses.CrossEntropyLossFlat

# Initialize the model without pretrained weights
model = xresnet50(n_out=dls.c)
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(5, 3e-3)
epoch train_loss valid_loss accuracy time
0 1.672769 3.459394 0.301718 00:59
1 1.224001 1.404229 0.552651 01:00
2 0.968035 0.996460 0.660941 01:00
3 0.699550 0.709341 0.771471 01:00
4 0.578120 0.571692 0.820388 01:00

# Initialize the model without pretrained weights
model = xresnet50(n_out=dls.c)
# Use mixed precision
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy).to_fp16()
learn.fit_one_cycle(5, 3e-3)
epoch train_loss valid_loss accuracy time
0 1.569645 3.962554 0.329724 00:33
1 1.239950 2.608771 0.355489 00:33
2 0.964794 0.982138 0.688200 00:34
3 0.721289 0.681677 0.791636 00:33
4 0.606473 0.581621 0.824122 00:33

Normalization

  • normalized data: has a mean value of 0 and a standard deviation of 1
  • it is easier to train models with normalized data
  • normalization is especially important when using pretrained models
    • make sure to use the same normalization stats the pretrained model was trained on
x,y = dls.one_batch()
x.mean(dim=[0,2,3]),x.std(dim=[0,2,3])
(TensorImage([0.4498, 0.4448, 0.4141], device='cuda:0'),
 TensorImage([0.2893, 0.2792, 0.3022], device='cuda:0'))

Normalize

Normalize
fastai.data.transforms.Normalize

Normalize.from_stats
<bound method Normalize.from_stats of <class 'fastai.data.transforms.Normalize'>>

def get_dls(bs, size):
    dblock = DataBlock(blocks=(ImageBlock, CategoryBlock),
                   get_items=get_image_files,
                   get_y=parent_label,
                   item_tfms=Resize(460),
                   batch_tfms=[*aug_transforms(size=size, min_scale=0.75),
                               Normalize.from_stats(*imagenet_stats)])
    return dblock.dataloaders(path, bs=bs)

dls = get_dls(64, 224)

x,y = dls.one_batch()
x.mean(dim=[0,2,3]),x.std(dim=[0,2,3])
(TensorImage([-0.2055, -0.0843,  0.0192], device='cuda:0'),
 TensorImage([1.1835, 1.1913, 1.2377], device='cuda:0'))

model = xresnet50(n_out=dls.c)
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy).to_fp16()
learn.fit_one_cycle(5, 3e-3)
epoch train_loss valid_loss accuracy time
0 1.545518 3.255928 0.342046 00:35
1 1.234556 1.449043 0.560866 00:35
2 0.970857 1.310043 0.617252 00:35
3 0.736170 0.770678 0.758402 00:36
4 0.619965 0.575979 0.822629 00:36

Progressive Resizing

  • start training with smaller images and end training with larger images
    • gradually using larger and larger images as you train
  • used by a team of fast.ai students to win the DAWNBench competition in 2018
  • smaller images helps training complete much faster
  • larger images helps makes accuracy much higher
  • progressive resizing serves as another form of data augmentation
    • should result in better generalization
  • progressive resizing might hurt performance when using transfer learning
    • most likely to happen if your pretrained model was very similar to your target task and the dataset it was trained on had similar-sized images
dls = get_dls(128, 128)
learn = Learner(dls, xresnet50(n_out=dls.c), loss_func=CrossEntropyLossFlat(), 
                metrics=accuracy).to_fp16()
learn.fit_one_cycle(4, 3e-3)
epoch train_loss valid_loss accuracy time
0 1.627504 2.495554 0.393951 00:21
1 1.264693 1.233987 0.613518 00:21
2 0.970736 0.958903 0.707618 00:21
3 0.740324 0.659166 0.794996 00:21

learn.dls = get_dls(64, 224)
learn.fine_tune(5, 1e-3)
epoch train_loss valid_loss accuracy time
0 0.828744 1.024683 0.669529 00:36
epoch train_loss valid_loss accuracy time
0 0.670041 0.716627 0.776326 00:36
1 0.689798 0.706051 0.768857 00:36
2 0.589789 0.519608 0.831217 00:35
3 0.506784 0.436529 0.870426 00:36
4 0.453270 0.401451 0.877147 00:36

Test Time Augmentation

  • during inference or validation, creating multiple versions of each image using augmentation, and then taking the average or maximum of the predictions for each augmented version of the image
  • can result in dramatic improvements in accuracy, depending on the dataset
  • does not change the time required to train
  • will increase the amount of time required for validation or inference

Learner.tta

learn.tta
<bound method Learner.tta of <fastai.learner.Learner object at 0x7f75b4be5f40>>

preds,targs = learn.tta()
accuracy(preds, targs).item()
0.882001519203186

Mixup

  • a powerful data augmentation technique that can provide dramatically higher accuracy, especially when you don’t have much data and don’t have a pretrained model
  • introduced in the 2017 paper mixup: Beyond Empirical Risk Minimization
    • “While data augmentation consistently leads to improved generalization, the procedure is dataset-dependent, and thus requires the use of expert knowledge
  • Mixup steps
    1. Select another image from your dataset at random
    2. Pick a weight at random
    3. Take a weighted average of the selected image with your image, to serve as your independent variable
    4. Take a weighted average of this image’s labels with your image’s labels, to server as your dependent variable
  • target needs to be one-hot encoded
  • $\tilde{x} = \lambda x_{i} + (1 - \lambda) x_{j} \text{, where } x_{i} \text{ and } x_{j} \text{ are raw input vectors}$
  • $\tilde{y} = \lambda y_{i} + (1 - \lambda) y_{j} \text{, where } y_{i} \text{ and } y_{j} \text{ are one-hot label encodings}$
  • more difficult to train
  • less prone to overfitting
  • requires far more epochs to to train to get better accuracy
  • can be applied to types of data other than photos
  • can even be used on activations inside of model
  • resolves the issue where it is not typically possible to achieve a perfect loss score
    • our labels are 1s and 0s, but the outputs of softmax and sigmoid can never equal 1 or 0
    • with Mixup our labels will only be exactly 1 or 0 if two images from the same class are mixed
  • Mixup is “accidentally” making the labels bigger than 0 or smaller than 1
    • can be resolved with Label Smoothing
# Get two images from different classes
church = PILImage.create(get_image_files_sorted(path/'train'/'n03028079')[0])
gas = PILImage.create(get_image_files_sorted(path/'train'/'n03425413')[0])
# Resize images
church = church.resize((256,256))
gas = gas.resize((256,256))

# Scale pixel values to the range [0,1]
tchurch = tensor(church).float() / 255.
tgas = tensor(gas).float() / 255.

_,axs = plt.subplots(1, 3, figsize=(12,4))
# Show the first image
show_image(tchurch, ax=axs[0]);
# Show the second image
show_image(tgas, ax=axs[1]);
# Take the weighted average of the two images
show_image((0.3*tchurch + 0.7*tgas), ax=axs[2]);

png


model = xresnet50()
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy, cbs=MixUp).to_fp16()
learn.fit_one_cycle(15, 3e-3)
epoch train_loss valid_loss accuracy time
0 2.332906 1.680691 0.431292 00:21
1 1.823880 1.699880 0.481329 00:21
2 1.660909 1.162998 0.650112 00:21
3 1.520751 1.302749 0.582524 00:21
4 1.391567 1.256566 0.595967 00:21
5 1.308175 1.193670 0.638163 00:21
6 1.224825 0.921357 0.706871 00:21
7 1.190292 0.846658 0.733383 00:21
8 1.124314 0.707856 0.780807 00:21
9 1.085013 0.701829 0.778193 00:21
10 1.028223 0.509176 0.851008 00:21
11 0.992827 0.518169 0.845780 00:21
12 0.945492 0.458248 0.864078 00:21
13 0.923450 0.418989 0.871546 00:21
14 0.904607 0.416422 0.876400 00:21

Label Smoothing

  • Rethinking the Inception Architecture for Computer Vision
  • in the theoretical expression of loss, in Classification problems, our targets are one-hot encoded
    • the model is trained to return 0 for all categories but one, for which it is trained to return 1
    • this encourages overfitting and gives your a model at inference time that is not going to give meaningful probabilities
    • this can be harmful if your data is not perfectly labeled
  • label smoothing: replace all our 1s with a number that is a bit less than 1, and our 0s with a number that is a bit more then 0
    • encourages your model to be less confident
    • makes your training more robust, even if there is mislabeled data
    • results in a model that generalizes better at inference
  • Steps
    1. start with one-hot encoded labels
    2. replace all 0s with $\frac{\epsilon}{N}$ where $N$ is the number of classes and $\epsilon$ is a parameter (usually 0.1)
    3. replace all 1s with $1 - \epsilon + \frac{\epsilon}{N}$ to make sure the labels add up to 1

model = xresnet50()
learn = Learner(dls, model, loss_func=LabelSmoothingCrossEntropy(), metrics=accuracy).to_fp16()
learn.fit_one_cycle(15, 3e-3)
epoch train_loss valid_loss accuracy time
0 2.796061 2.399328 0.513443 00:21
1 2.335293 2.222970 0.584391 00:21
2 2.125152 2.478721 0.490291 00:21
3 1.967522 1.977260 0.690441 00:21
4 1.853788 1.861635 0.715459 00:21
5 1.747451 1.889759 0.699776 00:21
6 1.683000 1.710128 0.770351 00:21
7 1.610975 1.672254 0.780807 00:21
8 1.534964 1.691175 0.769231 00:21
9 1.480721 1.490685 0.842420 00:21
10 1.417200 1.463211 0.852502 00:21
11 1.360376 1.395671 0.867812 00:21
12 1.312882 1.360292 0.887603 00:21
13 1.283740 1.346170 0.890217 00:21
14 1.264030 1.339298 0.892830 00:21

Label Smoothing, Mixup and Progressive Resizing

dls = get_dls(128, 128)
model = xresnet50()
learn = Learner(dls, model, loss_func=LabelSmoothingCrossEntropy(), metrics=accuracy, cbs=MixUp).to_fp16()
learn.fit_one_cycle(15, 3e-3)
epoch train_loss valid_loss accuracy time
0 3.045166 2.561215 0.449589 00:21
1 2.642317 2.906508 0.405900 00:21
2 2.473271 2.389416 0.516804 00:21
3 2.356234 2.263084 0.557506 00:21
4 2.268788 2.401770 0.544436 00:21
5 2.181318 2.040797 0.650485 00:21
6 2.122742 1.711615 0.761762 00:21
7 2.068317 1.961520 0.688200 00:21
8 2.022716 1.751058 0.743839 00:21
9 1.980203 1.635354 0.792009 00:21
10 1.943118 1.711313 0.758028 00:21
11 1.889408 1.454949 0.854742 00:21
12 1.853412 1.433971 0.862584 00:21
13 1.847395 1.412596 0.867438 00:22
14 1.817760 1.409608 0.875280 00:23

learn.dls = get_dls(64, 224)
learn.fine_tune(10, 1e-3)
epoch train_loss valid_loss accuracy time
0 1.951753 1.672776 0.789395 00:36
epoch train_loss valid_loss accuracy time
0 1.872399 1.384301 0.892457 00:36
1 1.860005 1.441491 0.864078 00:36
2 1.876859 1.425859 0.867438 00:36
3 1.851872 1.460640 0.863331 00:36
4 1.840423 1.413441 0.880508 00:36
5 1.808990 1.444332 0.863704 00:36
6 1.777755 1.321098 0.910754 00:36
7 1.761589 1.312523 0.912621 00:36
8 1.756679 1.302988 0.919716 00:36
9 1.745481 1.304583 0.918969 00:36

Papers and Math

References