Sunday, December 17, 2017

Week 2 : - 1 v2 - CNNs, Differential Learning Rates, Data Augmentation, State of the art Image Classifiers

Jeremy emphasized that a learning rate that is too low is a better problem than a learning rate that is too high. If the learning rate is too high, you'll probably overshoot the global minima.

From the Cyclical Learning Rate paper, for each mini-batch iteration, we gradually increase the learning rate until we get to a point where the loss starts to worsen. From the graph of loss vs learning rate below, we find that the optimal learning rate is usually one or two orders of magnitude smaller from the minimum point on the graph (In the graph below learning_rate = 0.01)

Courtesy of
Data Augmentation

The most important strategy to improve your model is to get more data which also helps prevent overfitting. One way to do this is via data augmentation. The library supports multiple data augmentation strategies. Some simple transforms include horizontal and vertical flips. Which transform you use depends on the orientation of your images. Data augmentation also helps us train our model to recognize images from different angles.
  • It's always a good idea to rerun the learning rate finder when you make significant changes to your architecture like unfreezing early layers.
  • precomputed activations use the activations identified from the earlier layers. To take advantage of data augmentation in the library we use learn.precompute = False so that the system has a chance to define new activations for the augmented data.
  • We typically want to decrease the learning rate (learning rate annealing / stepwise annealing) as we get closer to the global min. A better way to do learning rate annealing is to use a functional form like the one half of a cosine curve or cosine annealing.
  • In the diagram below the image on the left uses a regular learning rate schedule and it converges to a global minimum at the end of training whereas the image on the right has multiple jumps. Each jump is considered an annealing cycle and is controlled by the cycle_len parameter in the library. The learning rate schedule for the image on the right also uses cosine annealing where cycle_len=1 denotes one annealing cycle per epoch. This is also described in depth in the Snapshot Ensembles paper. This strategy helps you get unstuck from local minima and gives you lots of opportunities to find the global minima. The image on the right is using Stochastic Gradient Descent with restarts.

Courtesy of Huang et al from Snapshot Ensemble paper

Courtesy of

  • When finetuning a pre-trained network, the early layers usually need little or no refitting since they capture general purpose features (eg. lines and edges) whereas the deeper (later) layers typically need to be fine-tuned on your dataset (since they capture the higher level representations) especially if the images in your dataset are very different than the images the pre-trained network was trained on, eg training on satellite images using a pre-trained Imagenet network. We can also define an array of learning rates (differential learning rates) and then use these different learning rates on different parts of the network architecture including the unfrozen pre-trained layers.
  • We can also vary the length of a cycle when using SGD with restarts with the cycle_mult parameter in the library. You can think of this as a way to adjust the explore/exploit ability of the model and helps it find the global minimum faster
Courtesy of

  • We also utilized Test Time Augmentation (TTA)  to improve our model further. When we make predictions on each image in the validation data set, we also take 4 random augmentations for each individual image, make label predictions on those augmentations + origin validation image and then average the predictions. This technique typically reduces validation error by 10% - 20%
  • As a general rule of thumb, to adjust differential learning rate for a dataset very similar to the data your pre-trained network was trained on, you'd want a 10x difference between the learning rates otherwise a 3x difference is recommended.
  • One way to avoid overfitting is to start you model training on smaller images over a few epochs and then switch to larger images to continue training.

Summary of steps to build state of the art Image classifier using
  • Enable data augmentation, and precompute=True
  • Use lr_find() to find highest learning rate where loss is still clearly improving
  • Train last layer from precomputed activations for 1-2 epochs
  • Train last layer with data augmentation (i.e. precompute=False) for 2-3 epochs with cycle_len=1
  • Unfreeze all layers
  • Set earlier layers to 3x-10x lower learning rate than next higher layer
  • Use lr_find() again
  • Train full network with cycle_mult=2 until over-fitting

Some Useful Links

No comments:

Post a Comment