Convolution
I was going over neural style transfer and I want to remind myself of how the convolution process works.
References
Definitions
- Expected Value
- In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first moment) is a generalization of the weighted average. Informally, the expected value is the mean of the possible values a random variable can take, weighted by the probability of those outcomes. Since it is obtained through arithmetic, the expected value sometimes may not even be included in the sample data set; it is not the value you would
expect
to get in reality. - The expected value of a random variable with a finite number of outcomes is a weighted average of all possible outcomes. In the case of a continuum of possible outcomes, the expectation is defined by integration.
- The expected value of a random variable is often denoted , , or , with also often stylized as 𝔼 or .
- Consider a random variable with a finite list of possible outcomes, each of which (respectively) has provability of occurring. The expectation is defined as
- In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first moment) is a generalization of the weighted average. Informally, the expected value is the mean of the possible values a random variable can take, weighted by the probability of those outcomes. Since it is obtained through arithmetic, the expected value sometimes may not even be included in the sample data set; it is not the value you would
- Cross-Correlation
- In signal processing, cross-correlation is a measure of similarity of two series as a function of the displacement of one relative to the other. This is also known as a sliding dot product or sliding inner product. It is commonly used for searching a long signal for a shorter, known feature. It has applications in pattern recognition, single particle analysis, electron tomography, averaging, cryptanalysis, and neurophysiology. The cross-correlation is similar in nature to the convolution of two functions. In an autocorrelation, which is the cross-correlation of a signal with itself, there will always be a peak at a log of zero, and its size will be the signal energy.
- For random vectors and , each containing random elements whose expected value and variance exist, the cross-correlation matrix of and is defined by:
- and the dimensions . Written component-wise:
Notes
In mathematics (in particular, functional analysis), a convolution is a mathematical operation on two functions ( and ) that produces a third function (). The term convolution refers to both the result function and to the process of computing it. It is defined as the integral of the product of two functions after one is reflected about the y-axis and shifted. The integral is evaluated for all values of shift, producing the convolution function. The choice of which functions is reflected and shifted before the integral does not change the integral result (see commutativity). Graphically, it expresses how the 'shape' of one function is modified by the other.
Some features of convolution are similar to cross-correlation: for real-valued functions, of a continuous or discrete variable, convolution differs from the cross-correlation only in that either or is reflected about the y-axis in convolution; thus it is a cross correlation of and , or and .
Convolution has applications that include probability, statistics, acoustics, spectroscopy, signal processing and image processing, geophysics, engineering, physics, computer vision and differential equations.
Computing the inverse of the convolution operation is known as deconvolution.
Definition
The convolution of and is written , denoting the operator with the symbol . It s defined as the integral of the product of two functions after one is reflected about the y-axis and shifted. As such, it is a particular kind of integral transform:
At each , the convolution formula can be described as the area under the function weighted by the function shifted by the amount . As changes, the weighting function emphasizes different parts of the input function ; if is a positive value, then is equal to that slides or is shifted along the -axis toward the right (toward +) by the amount , while if is a negative value, then is equal to that slides or is shifted toward the left (toward ) by the amount .
Discrete Convolution
For a complex-valued functions and defined on the set of integers, the discrete convolution of and is given by:
Intuitively Understanding Convolutions for Deep Learning
The 2D convolution: you start with a kernel, which is a small matrix of weights. This kernel "slides" over the 2D input data, performing an elementwise multiplication with the part of the input it is currently on, and then summing up the results input a single output pixel.
The kernel repeats this process for every location it slides over, converting a 2D matrix of features into yet another 2D matrix of features. The output features are essentially the weighted sums (with the weights being the values of the kernel itself) of the input features located roughly in the same location of the output pixel on the input layer. The size of the kernel directly determines how many (or few) input features get combined in the production of a new output feature.
Convolutions allow us to "look at" only some input features (in contrast to a fully connected layer, where you look at every input feature).
Commonly Used Techniques
- Padding
- In the example about, the outer pixels are never centered by the kernel and the output matrix size is less than the input matrix size. To fix this, we can "pad" the edges with extra, "fake" pixels.
- Striding
- The idea of the stride is to skip some of the slide locations of a kernel. A stride of 1 means to pick slides a pixel apart, so basically every single slide, acting as a standard convolution. More modern networks, such as the ResNet architectures, entirely forgo pooing layers in their internal layers, in favor of stride-d convolutions when needing to reduce their input sizes.
The Multi-Channel Version
Most images have 3 channels (RGB). It's pretty easy to think of channels as being a "view" of the image as a whole, emphasizing some aspects, de-emphasizing others. In the case of multiple channels, the terms filter
and kernel
are unique. Each filter actually happens to be a collection of kernels, with there being one kernel for every single input channel to the layer, and each kernel being unique.
Each filter in a convolution layer produces one and only one output channel, and they do it like so:
- Each of the kernels of the filter slides over their respective input channels, producing a processed version of each.
- Each of the per channel processed versions are then summed together to form one channel. The kernels of a filter each produce one version of each channel, and the filter as a whole produces one output channel.
- Finally, the bias term gets added to the output channel to produce the final output channel.