Performance Metrics in Machine Learning

Want to learn more about common loss functions used in machine / deep learning.

Date Created:
1 82

References



Notes


The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation function is nonlinear.

Modern activation functions include the sigmoid function, the ReLU, and the smooth version of ReLU, the GELU. Aside from empirical performance, activation functions also have different mathematical properties:

  • Nonlinear
    • When the activation function is non-linear, then a two-layer neural network can be proven to be a universal function approximator. This is known as the Universal Approximation Theorem. When multiple layers use the identity activation function, the entire network is equivalent to a single-layer model.
  • Range
    • When the range of the activation function is finite, gradient-based training methods tend to be more stable, because pattern presentations significantly affect only limited weights. When the range is infinite, training is generally more efficient because pattern presentations significantly affect most of the weights. In the latter case, smaller learning rates are typically necessary.
  • Continuously differentiable
    • This property is desirable for enabling gradient-based optimization methods.

The most common activation functions can be divided into three categories (seen below). An activation function is saturating if . It is nonsaturating. If it is . Non-saturating activation functions, such as ReLU, may be better than saturating activation functions, because they are less likely to suffer from the vanishing gradient problem.

  1. Ridge Activation Functions
  1. Radial Activation Functions
  1. Fold Functions

You can read more about how comments are sorted in this blog post.

User Comments