Understanding the Difficulty of Training Deep Feedforward Neural Networks

Looking for more research papers to read, I scanned my Hands-On Machine Learning notes for the many papers that were referenced there. This is one of those papers. These papers are mainly on machine learning and deep learning topics.

Reference Understanding the Difficulty of Training Deep Feedforward Neural Networks Paper

The purpose of this paper is to understand why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. The paper observes the influence of the non-linear activation functions. The authors find that logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. The paper finds a new non-linearity that saturates less. The paper also studies how activations and gradients vary across layers and during training with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1.

Deep Learning methods aim at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features. They include learning methods for a wide array of deep architectures, including neural networks with many hidden layers and graphical models with many levels of hidden variables.


Initialization from unsupervised pre-training yields substantial improvements on deep architectures even with very large data sets. Two things that want to be avoided and that can be revealed from the evolution of activations is excessive saturation of activation functions on one hand (then gradients will not propagate well), and overly linear units (they will not compute something interesting).


  • The more classical neural networks with sigmoid or hyperbolic tangent units and standard initialization fare rather poorly, converging more slowly and apparently towards ultimately poorer local minima
  • The softsign networks seem to be more robust to the initialization procedure than the tanh networks, presumably because of their gentler non-linearity
  • For tanh networks, the proposed normalized initializion can be quite helpful, presumably because the layer-to-layer transformations maintain magnitudes of activations (floating upward) and gradients (flowing backward)
  • Monitoring activations and gradients across layers and training iterations is a powerful investigative tool for understanding difficulties in deep nets.
  • Sigmoid activations (not symmetric around 0) should be avoided when initializing from small random weights, because they yield poor learning dynamics, with initial saturation of the top hidden layer
  • Keeping the layer-to-layer transformations such that both activations and gradients flow well (i.e. with a Jacobian around 1) appears helpful, and allows to eliminate a good part of the discrepancy between purely supervised deep networks and ones pre-trained with unsupervised learning
  • Many of our observations remain unexplained,s ugessting further investigationns to better understand gradients and training dynamics with deep architectures

