Understanding the Difficulty of Training Deep Feedforward Neural Networks

Looking for more research papers to read, I scanned my Hands-On Machine Learning notes for the many papers that were referenced there. This is one of those papers. These papers are mainly on machine learning and deep learning topics.

Reference Understanding the Difficulty of Training Deep Feedforward Neural Networks Paper

Date Created:
Last Edited:
1 33

Introduction

The purpose of this paper is to understand why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. The paper observes the influence of the non-linear activation functions. The authors find that logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. The paper finds a new non-linearity that saturates less. The paper also studies how activations and gradients vary across layers and during training with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1.

Deep Learning methods aim at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features. They include learning methods for a wide array of deep architectures, including neural networks with many hidden layers and graphical models with many levels of hidden variables.

Notes

Initialization from unsupervised pre-training yields substantial improvements on deep architectures even with very large data sets. Two things that want to be avoided and that can be revealed from the evolution of activations is excessive saturation of activation functions on one hand (then gradients will not propagate well), and overly linear units (they will not compute something interesting).

Conclusions:

  • The more classical neural networks with sigmoid or hyperbolic tangent units and standard initialization fare rather poorly, converging more slowly and apparently towards ultimately poorer local minima
  • The softsign networks seem to be more robust to the initialization procedure than the tanh networks, presumably because of their gentler non-linearity
  • For tanh networks, the proposed normalized initializion can be quite helpful, presumably because the layer-to-layer transformations maintain magnitudes of activations (floating upward) and gradients (flowing backward)
  • Monitoring activations and gradients across layers and training iterations is a powerful investigative tool for understanding difficulties in deep nets.
  • Sigmoid activations (not symmetric around 0) should be avoided when initializing from small random weights, because they yield poor learning dynamics, with initial saturation of the top hidden layer
  • Keeping the layer-to-layer transformations such that both activations and gradients flow well (i.e. with a Jacobian around 1) appears helpful, and allows to eliminate a good part of the discrepancy between purely supervised deep networks and ones pre-trained with unsupervised learning
  • Many of our observations remain unexplained,s ugessting further investigationns to better understand gradients and training dynamics with deep architectures

You can read more about how comments are sorted in this blog post.

User Comments

Insert Math Markup

ESC
About Inserting Math Content
Display Style:

Embed News Content

ESC
About Embedding News Content

Embed Youtube Video

ESC
Embedding Youtube Videos

Embed TikTok Video

ESC
Embedding TikTok Videos

Embed X Post

ESC
Embedding X Posts

Embed Instagram Post

ESC
Embedding Instagram Posts

Insert Details Element

ESC

Example Output:

Summary Title
You will be able to insert content here after confirming the title of the <details> element.

Insert Table

ESC
Customization
Align:
Preview:

Insert Horizontal Rule

#000000

Preview:


View Content At Different Sizes

ESC

Edit Style of Block Nodes

ESC

Edit the background color, default text color, margin, padding, and border of block nodes. Editable block nodes include paragraphs, headers, and lists.

#ffffff
#000000

Edit Selected Cells

Change the background color, vertical align, and borders of the cells in the current selection.

#ffffff
Vertical Align:
Border
#000000
Border Style:

Edit Table

ESC
Customization:
Align:

Upload Files

ESC

Upload a .lexical file. If the file type matches the type of the current editor, then a preview will be shown below the file input.

Upload Jupyter Notebook

ESC

Upload a Jupyter notebook and embed the resulting HTML in the text editor.

Insert Custom HTML

ESC

Edit Image

ESC
#ffffff

Insert Columns Layout

ESC
Column Type:

Select Code Language

ESC
Select Coding Language

Upload Previous Version of Editor State

ESC