Sunday, December 31, 2017

Year in Review

To say 2017 was an interesting year would be quite the understatement.

We saw everything from the explosion of Bitcoin, blockchain and crypto-currencies. Too bad if you didn't get in on the crypto craze before the wave.

We also witnessed the growth and maturity of a several  Deep Learning frameworks:
We were wowed once again with the Model 3 launch and Tesla Semi / Roadster v2 unveil. And, we got more insights on Model Y and Falcon Heavy launch.

We also witnessed more products and services powered by Deep Learning / AI and more maturity in the different DL / ML frameworks out there. I really do feel a lot of the worthwhile things in the space going forward will be applying these tools and techniques to solving real problems that have real impacts on our lives vs things that just end up in research papers (even though that's important too) and making these tools more accessible.

I'll leave you with a summary of the awesome things other folks did in 2017

Week 3 : - 1 v2 - More CNN Internals, Structured Data

Jeremy did a more in-depth run through of the ConvNet pipeline in excel.

Architectures that make use of fully connected layers like VGG have a lot of weights which also means they could have trouble overfitting and they can also be very slow to train.

The library automatically switches into multi-label mode for multi-label classification problems by switching the final activation from softmax to sigmoid

  • Using differential learning rates applies your array of learning rates across different layer groups which are handled automatically by the library. The last learning rate is typically applied to your fully connected layer while the other n-1 learning rates are applied evenly over the layer groups in your network.
  • data.resize() reduces the input image sizes which helps to speed up your training especially if you're working with large images
  • You can define custom metrics within the library
  • You typically want to fine-tune your fully connected layers first if you're using a pre-trained network before you unfreeze the pre-trained weights in your convolution layers.
  • learn.summary() shows you the model architecture summary
  • add_datepart() pulls interesting features from time series data
  • pandas feather data format is a really efficient way to dump data in binary format  

Some Useful links

Sunday, December 17, 2017

Week 2 : - 1 v2 - CNNs, Differential Learning Rates, Data Augmentation, State of the art Image Classifiers

Jeremy emphasized that a learning rate that is too low is a better problem than a learning rate that is too high. If the learning rate is too high, you'll probably overshoot the global minima.

From the Cyclical Learning Rate paper, for each mini-batch iteration, we gradually increase the learning rate until we get to a point where the loss starts to worsen. From the graph of loss vs learning rate below, we find that the optimal learning rate is usually one or two orders of magnitude smaller from the minimum point on the graph (In the graph below learning_rate = 0.01)

Courtesy of
Data Augmentation

The most important strategy to improve your model is to get more data which also helps prevent overfitting. One way to do this is via data augmentation. The library supports multiple data augmentation strategies. Some simple transforms include horizontal and vertical flips. Which transform you use depends on the orientation of your images. Data augmentation also helps us train our model to recognize images from different angles.
  • It's always a good idea to rerun the learning rate finder when you make significant changes to your architecture like unfreezing early layers.
  • precomputed activations use the activations identified from the earlier layers. To take advantage of data augmentation in the library we use learn.precompute = False so that the system has a chance to define new activations for the augmented data.
  • We typically want to decrease the learning rate (learning rate annealing / stepwise annealing) as we get closer to the global min. A better way to do learning rate annealing is to use a functional form like the one half of a cosine curve or cosine annealing.
  • In the diagram below the image on the left uses a regular learning rate schedule and it converges to a global minimum at the end of training whereas the image on the right has multiple jumps. Each jump is considered an annealing cycle and is controlled by the cycle_len parameter in the library. The learning rate schedule for the image on the right also uses cosine annealing where cycle_len=1 denotes one annealing cycle per epoch. This is also described in depth in the Snapshot Ensembles paper. This strategy helps you get unstuck from local minima and gives you lots of opportunities to find the global minima. The image on the right is using Stochastic Gradient Descent with restarts.

Courtesy of Huang et al from Snapshot Ensemble paper

Courtesy of

  • When finetuning a pre-trained network, the early layers usually need little or no refitting since they capture general purpose features (eg. lines and edges) whereas the deeper (later) layers typically need to be fine-tuned on your dataset (since they capture the higher level representations) especially if the images in your dataset are very different than the images the pre-trained network was trained on, eg training on satellite images using a pre-trained Imagenet network. We can also define an array of learning rates (differential learning rates) and then use these different learning rates on different parts of the network architecture including the unfrozen pre-trained layers.
  • We can also vary the length of a cycle when using SGD with restarts with the cycle_mult parameter in the library. You can think of this as a way to adjust the explore/exploit ability of the model and helps it find the global minimum faster
Courtesy of

  • We also utilized Test Time Augmentation (TTA)  to improve our model further. When we make predictions on each image in the validation data set, we also take 4 random augmentations for each individual image, make label predictions on those augmentations + origin validation image and then average the predictions. This technique typically reduces validation error by 10% - 20%
  • As a general rule of thumb, to adjust differential learning rate for a dataset very similar to the data your pre-trained network was trained on, you'd want a 10x difference between the learning rates otherwise a 3x difference is recommended.
  • One way to avoid overfitting is to start you model training on smaller images over a few epochs and then switch to larger images to continue training.

Summary of steps to build state of the art Image classifier using
  • Enable data augmentation, and precompute=True
  • Use lr_find() to find highest learning rate where loss is still clearly improving
  • Train last layer from precomputed activations for 1-2 epochs
  • Train last layer with data augmentation (i.e. precompute=False) for 2-3 epochs with cycle_len=1
  • Unfreeze all layers
  • Set earlier layers to 3x-10x lower learning rate than next higher layer
  • Use lr_find() again
  • Train full network with cycle_mult=2 until over-fitting

Some Useful Links

Sunday, December 10, 2017

Week 1 : - 1 v2 - CNNs, Kernels, Optimal Learning Rate

Deep Learning is essentially a particular way of doing Machine Learning where you give your system a bunch of examples and then it learns the rules and representations vs manually programming the rules

We have interesting applications today like Cancer diagnosis, Language Translation, Inbox by Google, Style Transfer, Data Center Costs optimization, Playing the Game Of Go among others that are powered by Deep Learning. Jeremy also emphasizes all the negatives that come with the growth of Deep learning like algorithmic bias, societal implications, job automation, etc

With Deep Learning we need an infinitely flexible mathematical function that can solve any problem. Such a function would be quite large with lots of parameters and we have to fit those parameters in a fast and scalable way using GPUs.

The philosophy is closely modeled after some of the concerns Paul Lockhart voiced in his essay A Mathematician's Lament which pushes you to start by doing right away and then gradually peel back the layers, modify and look under the hood. The general feeling out there is that there is a survival bias problem in the Deep Learning space which is typified by this Hacker News post. The only currency that should matter is how well you're able to use these tools to solve problems and generate value.


CNNs are the most important architecture for Deep Neural Networks. They're the state of the art for solving problems in many areas of Image Processing, NLP, Speech Processing, etc

The basic structure of a CNN is the convolution which on its own is a relatively straightforward process. A convolution is a linear operation that finds interesting features in an image. Performing one instance of the convolution operation (element-wise multiplication and addition) requires the following steps

  • Identify kernel matrix - this is typically a 3 x 3 matrix or in some cases a 1 x 1 matrix
  • Pass kernel matrix over image (see figure below)
  • Perform element-wise multiplication between kernel and overlaying image pixels (see red box in image below)
  • Sum all the elements in the resulting matrix ( in the figure below, the sum is 297) 
  • Assign the sum as the new pixel value for the center pixel in the overlayed image crop in the activation map
  • This operation is repeated until you've completed passes over the entire image

There are other parameters like kernel stride and padding that determine the dimension of the activation maps. I'll be doing a more in-depth post on Convolutional Neural Network to discuss theses and the full CNN pipeline.
Courtesy of
In the figure above, we used the sharpen kernel. There are also a few other predefined kernels like sobel, emboss, etc. In a typical CNN pipeline, we start with randomly initialized convolution filters, apply a non-linear ReLU activation (remove negatives) and then use SGD + backpropagation to find the best convolution filters. If we do this with enough filters and data we end up with a state of the art image recognizer. CNNs are able to learn filters that detect edges and other simple image characteristics in the lower layers and then use those to detect more complex image features and objects in the deeper layers.

To train an image classifier using the library you need to
  • Select a starting architecture: resnet34 is a good option to start with.
  • Determine the number of epochs: start with 1, observe results and then run multiple epochs if needed
  • Determine the learning rate: Using the strategy in the Cyclical Learning Rate paper, we keep increasing learning rate until the loss stars decreasing. This will probably take less than one epoch if you have a small batch size. From the figure below, we want to pick the largest learning rate as long as the loss is still decreasing  (in this case learning_rate = 0.01)
  • Train learn object
Courtesy of
  • We used a pre-trained ResNet34 to train a CNN on data from the Cats vs Dogs Kaggle competition and obtained > 99% accuracy 
  • Used a new method (Cyclical Learning Rates for Training Neural Networks) to determine the optimal learning rate which determines how quickly we update our weights. 

Some Useful Links

Sunday, December 3, 2017

Week 0 : - 1 v2 - Setting Up, GPU instances and some housekeeping

I think has been one of the many exciting developments in the Deep Learning / Machine Learning education space over the past 1.5 years. really champions the top-down approach where the goal is to make Deep Learning more accessible and get you to become productive using these tools to solve a variety of problems without spending years working on a Ph.D.

I've had some false starts with Deep Learning. I literally have stacks of books, papers and PDFs on the subject and a multitude of unfinished Coursera courses and MOOCs but I feel what's usually missing is how you distill all that knowledge to help you become productive right away.

I had the opportunity to watch several videos from the previous MOOC and I really liked the teaching style so I decided to sign up for the second iteration of the MOOC and I must say I haven't been disappointed. Another added benefit is that the course is taught by Jeremy Howard.  The next series of posts will highlight and summarize learnings from the class sessions. My goal with this is to help spread's message of making start-of-the-art Deep Learning Research / tools more accessible.

This is really powerful stuff and what you want is for people to use these Deep Learning techniques on some of the problems they're working on which may help push the envelope in many ways. 

Hardware Setup

A lot has already been written on why GPUs and specialized hardware are great for Deep Learning so I'll cut to the chase and run you through several setup options you could use for the course

AWS : This is a great guide to set up an AWS instance.  A p2.xlarge GPU instance is probably okay for the course and it runs about 90 cents an hour. If your p2 instance limit is 0, you may have to request an instance limit increase through the AWS support center which usually takes a day or two to be approved. Jeremy also set up a fastai Community AMI in the Oregon and N.Virginia regions and a few other regions around the world (Mumbai, Sydney, Ireland) with all the needed software pre-installed.

Once you're done setting up your instance, the next steps would be to get your ssh keys and then finish the ssh config. My current setup is shown below.  Notice the config file at  ~/.ssh/config and its contents.

The setup allows you to ssh into your AWS instance by using the command ssh fast-aiNotice that the Hostname is blank, you'll have to fill that in with your instance IPv4 Publib IP which you find on your EC2 dashboard. Do keep in mind that this changes whenever you stop/start your instance. You can either update this manually or write a script to replace the hostname each time.

Also, notice the last line in the config file. LocalForward establishes tunneling between your system and the AWS instance allowing you launch a jupyter instance on the AWS instance and then launch the actual notebook on your local system. Running the command jupyter notebook gives the following output. We can then paste the url in a browser on our system to launch the notebook.

Crestle :  is the easiest way to get a GPU enabled notebook. It's built on AWS so you get access to an AMI via ssh. It also has the datasets loaded in by default and it was built by a former student. The setup process is quick compared to AWS and Paperspace

Paperspace: Paperspace is a cloud platform that enables you to spin up a GPU enabled fully-fledged Ubuntu desktop in a browser. They have faster / cheaper GPUs than AWS and you're charged for compute (by the hour) and storage 

They also have a public template so it's a really quick way to set things up for the course.

Python / Anaconda : I also finally switched my python version from 2.7 to 3.6. This is trivial to do via Anaconda and you can essentially switch between version. The pytorch library requires python 3.6. 

Personal Workstation : It could make economic sense to build you own Deep Learning workstation. An NVIDIA 1080ti GPU would cost you around $700 or so and you can supercharge with multiple GPU cards. I have box with a 1060 card but I'm still running into some issues setting it up.