Deep Learning Learning Plan

This is my plan to on-board myself with recent deep learning practice (as of the publishing date of this post). Comments and recommendations via GitHub issues are welcome and appreciated! This plan presumes some probability, linear algebra, and machine learning theory already, but if you’re following along Part 1 of the Deep Learning book gives an overview of prerequisite topics to cover.

My notes on these sources are publicly available, as are my experiments.

Intro tutorials/posts.
- Karpathy
- Skim lectures from weeks 1-6, 9-10 of Hinton’s Coursera course
Scalar supervised learning theory
- Read Chapters 6, 7, 8, 9, 11, 12 of Dr. Goodfellow’s Deep Learning Book and Efficient Backprop
Scalar supervised learning practice
- Choose an enviornment.
  - Should be TensorFlow-based, given the wealth of ecosystem around it; stuff like Sonnet and T2T.
  - I tried TF-Slim and and TensorLayer, but I still found Keras easiest to rapidly prototype in (and expand). TensorFlow is still pretty easy to drop down into from the Keras models.
  - Even with Keras, TF is awkward to prototype in: it’s also worth considering PyTorch.
- Google MNIST
- Lessons 0-4 from USF
- Assignments 1-4 from Udacity
- CIFAR-10
  - Extend to multiple GPUs
  - Visualizations (with Tensorboard): histogram summary for weights/biases/activations and layer-by-layer gradient norm recordings (+ how does batch norm affect them), graph visualization, cost over time
  - Visualizations for trained kernels: most-activating image from input set as viz, direct kernel image visualizations + maximizing image from input set as the viz per maximizing inputs, activations direct image viz (per Yosinki et al 2015). For maximizing inputs use regularization from Yosinki paper.
  - Faster input pipeline and timing metrics for each stage of operation input pipeline notes.
- Assignment 2 from Stanford CS20S1
- Lab 1 from MIT 6.S191
- Stanford CS231n
- Try out slightly less common techniques: compare initialization (orthogonal vs LSUV vs uniform), weight normalization vs batch normalization vs layer normalization, Bayesian-inspired weight decay vs early stopping vs proximal regularization
- Replicate ResNet by He et al 2015, Dropconnect, Maxout, Inception (do a fine-tuning example with Inception per this paper).
- Do an end-to-end application from scratch. E.g., convert an equation image to LaTeX.
Sequence supervised learning
- Gentle introductions
  - Lessons 5-7 from USF
  - Assignments 5-6 from Udacity
  - Karpathy RNN post
  - Weeks 7-8 of Hinton’s Coursera course
- Theory
  - Chapter 10 from Goodfellow
- Practice
  - Lab 2 from MIT 6.S191
  - End-to-end application from scratch: a Swype keyboard (Reddit tips)
- Paper recreations
  - Machine translation Sutskever et al 2014
  - NLP Vinyals et al 2015
  - Dense captioning Karpathy 2016
  - Pointer nets
  - Attention
Unsupervised and semi-supervised approaches
- Theory
  - Weeks 11-16 of Hinton’s Coursera course
  - Chapters 13, 16-20 from Goodfellow
  - See also my links for VAE and RBM notes here
- Practice
  - Remaining deeplearning.net tutorials, based on interest.
  - Notebooks 06, 11 from nlintz/TensorFlow-Tutorials.
- Paper recreations
  - WGAN
  - VAE
  - IAF VAE
  - Ladder Nets

Vlad Feinberg

Deep Learning Learning

Deep Learning Learning Plan