Hands On Machine Learning Chapter 14 - Deep Computer Vision using Convolutional Neural Networks
I am going to re-read Hands-On Machine Learning with Scikit-learn Keras & TensorFlow because I don't feel that I got a good grasp of machine learning the first time I read it, and I skipped neural networks the first time I read the book. Since the first time reading this textbook.
Deep Computer Vision Using Convolutional Neural Networks
Convolutional Neural networks (CNNs) emerged from the study of the brain's visual cortex, and they have been used in image recognition since the 1980s. In the last few years, thanks to the increase in computational power, the amount of available training data, and the tricks presented in Chapter 11 for training deep nets, CNNs have managed to achieve superhuman performance on complex visual tasks. CNNs are not restricted to visual perception: they are also successful in many other tasks, such as voice recognition or natural language processing (NLP): this chapter focuses on visual applications. This chapter discusses many things, including object detection - classifying multiple objects in an image and placing bounding boxes around them - and semantic segmentation - classifying each pixel according to the class of object it belongs to.
The Architecture of the Visual Cortex
David H. Hubrel and Torsten Wiesel performed a series of experiments on cats in 1958 and 1959 (and a few years later on monkeys), giving crucial insights on the structure of the visual cortex (they received a Nobel Prize in Physiology or Medicine in 1981 for their work). In particular, they showed that many neurons in the visual cortex have a small local receptive field, meaning they react only to visual stimuli located in a limited region of the visual field (In the image below, local receptive fields of five neurons are replaced by dashed circles). The receptive fields of different neurons may overlap, and together they tile the whole visual field. Moreover, the authors showed that some neurons react only to images of horizontal lines, while others react only to lines with different orientations (two neurons may have the same receptive field but react to different line orientations). They also noticed that some neurons have larger receptive fields, and they react to more complex patterns that are combinations of lower level patterns. These observations led to the idea that the higher-level neurons are based on the outputs of neighboring lower-level neurons. This powerful architecture is able to detect all sorts of complex patterns in any area of the visual field.
These studies of the visual cortex inspired the neocognitron, introduced in 1980, which gradually evolved into what we now call convolutional neural networks. An important milestone was a 1998 paper, which introduced the famous LeNet-5 architecture, widely used to recognize handwritten check numbers. This architecture has some building blocks that you already may know, such as fully connected layers and sigmoid activation functions, but it also introduces two new building blocks: convolutional layers and pooling layers. Note: A regular DNN with fully connected layers fails for complex image tasks because of the huge number of parameters it requires.
Convolutional Layer
The most important building block of a CNN is the convolutional layer. Neurons in the first convolutional layer are not connected to every single pixel in the input image, but only to pixels in their receptive fields (See image below). In turn, each neuron in the second convolutional layer is connected only to neurons located within a small rectangle in the first layer. This architecture allows the network to concentrate on small low-level features in the first hidden layer, then assemble them into larger higher-level features in the next hidden layer, and so on. This hierarchal structure is common in real-world images, which is one of the reasons why CNNs work so well for image recognition.
Until now, all multilayer neural networks we looked at had layers composed of a long line of neurons, and we had to flatten input images to 1D before feeding them to the neural network. Now each layer is represented in 2D, which makes it easier to match neurons with their corresponding inputs.
A neuron located in row i , column j of a given layer is connected to the outputs of the neurons in the previous layer located in rows i to i+fh−1 , columns j to j+fw−1 , where fh and fw are the height and width of the receptive field (See image below). In order for a layer to have the same height and width of as the previous layer, it is common to add zeroes around the inputs, as shown below. This is called zero padding.
It is also possible to connect a large input layer to a much smaller layer by spacing out the receptive fields, as shown below. The shift from one receptive field to the next is called the stride. In the image below, a 5 x 7 input layer (plus zero padding) is connected to a 3 x 4 layer, using 3 x 3 receptive fields, and a stride of 2 (the stride doesn't have to be the same in both directions). A neuron located in row i , column j in the upper layer is connected to the outputs of the neurons in the previous layer located in rows i×sh+fh−1 , columns j×sw to j×sw+fw−1 , where sh and sw are the vertical and horizontal strides.
Filters
A neuron's weights can be represented as a small image the size of the receptive field. The image below shows two possible sets of weights, called filters (or convolutional kernels). The white lines in the black squares in the image below mean that the neurons using the weights of the respective filters will ignore everything in their receptive field except for the vertical / horizontal lines. If all nerons in a layer use the same vertical line filter, you will get the network image shown in the top left. A layer full of neurons using the same filter outputs a feature map, which highlights the areas in an image that activate the filter the most.
Stacking Multiple Feature Maps
A convolutional layer has multiple filters in reality, and it outputs one feature map per filter, so it is more accurately represented in 3D (see the image below). To do so, it has one neuron per pixel in each feature map, and all neurons within a given feature map share the same parameters (the same weights and vias term). However, neurons in different feature maps use different parameters. A neuron's receptive field is the same as described earlier, but it extends across all the previous layers' feature maps. In short, a convolutional layer simultaneously applied multiple trainable filters to its inputs, making it capable of detecting multiple features anywhere in its inputs. The fact that all neurons in a feature map share the same parameters dramatically reduces the number of parameters in the model. Once the CNN has learned to recognize a pattern in one location, it can recognize it in any other location. In contrast, once a regular DNN has learned to recognize a pattern in one location, it can recognize it only in that particular location. Input images are typically composed of multiple sublayers: one per color channel. They are typically red, green, and blue.
A neuron location in row i , column j of the feature map k in a given convolutional layer l is connected to the outputs of thge neurons in the previous layer l−1 ,located in rows i×sh to i×sh+fh−1 and columns j×sw to j×sw+fw−1 , across all feature maps. Note that all neurons located in the same row i and column j but in different feature maps are connected to the outputs of the exact same neurons in the previous layer. The equation belo shows how to compute the output of a given neuron in a convolutional layer.
Computing the Output of a Neuron in a Convolutional Layer
- zi,j,k is the output layer of the neuron located in row i , column j in feature map k of the convolutional layer (layer $ l $)
- As explained earlier, sh and sw are the vertical and horizontal strides, fh and fw are the height and width of the receptive field, and fn′ is the number of feature maps in the previous layer ( layer l−1 )
- xi′,j′,k′ is the output of the neuron located in layer l−1 , row i′ , column j′ , feature map k′ (or channel k′ if the previous layer is the input layer).
- bk is the bias term for feature map k (in layer l ). You can think of it as a knob that tweaks the overall brightness of the feature map k .
- wu,v,k′,k is the connection weight between any neuron in feature map k of the layer l and its input layer located at row u , column v (relative to the neuron's receptive field), and feature map k′ .
Tensor Flow Implementation
In TensorFlow, each input is typically represented as a 3D tensor of shape [height, width, channels]. A mini-batch is represented as a 4D tensor of shape [mini-batch size, height, width, channels]. The weights of a convolutional layer are represented as a 4D tensor of shape [fh,fw,fn′,fn] . The bias term of a convolutional layer are simply represented as a 1D tensor of shape [fn]. The following code loads two images, creates two 7 x 7 filters (with horizontal and vertical whitre lines), and applies both to images using the tf.nn.conv2d() function.
Explanation of the code below:
- images is the input mini-batch (a 4D tensor)
- filters is the set of filters to apply (also a 4D tensor)
- strides is equal to 1, but it could be a 1D array with 4 elements, where the two central elements are the vertical and horizontal strides The first and last elements must currently be equal to 1.
- padding must be either "VALID" or "SAME"
- If set to "VALID", the convolutional layer does not use zero passing, and may ignore rows and columns at the bottom and right of the input image, depending on the stride, as shown in the image below
- If set to "SAME", the convolutional layer uses zero passing if necessary. In this case, the number of output neurons is equal to the number of input neurons divided by the stride, rounded up. Then zeros are added as evenly as possible around the inputs.
from sklearn.datasets import load_sample_image
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
# Load Sample Images
china = load_sample_image("china.jpg") / 255
flower = load_sample_image("flower.jpg") / 255
images = np.array([china, flower])
batch_size, height, width, channels = images.shape
# Create 2 filters
filters = np.zeros(shape=(7,7,channels,2),dtype=np.float32)
filters[:,3,:,0] = 1 # vertical line
filters[3,:,:,1] = 1 # horizontal line
outputs = tf.nn.conv2d(images, filters, strides=1, padding="SAME")
plt.imshow(outputs[0, :, :, 1], cmap="gray") # plot 1st image's 2nd feature map
plt.show()
The example above manually creates filters, but in a real CNN you would normally defined filters as trainable variables, so the neural net can learn which filters work best. Instead of manually creating the variables, you can simply use the keras.layers.Conv2D layer:
conv = keras.layers.Conv2D(filters=32, kernel_size=3, strides=1,padding="SAME", activation="relu")
Memory Requirements
Another problem with CNNs is that the convolutional layers require a huge amount of RAM. This is especially true during training, because the reverse pass of backpropagation requires all the intermediate values computed during the forward pass. The amount of RAM needed during training is the total amount of RAM required by all layers. If training crashes because of an out-of-memory error, you can try reducing the mini-batch size.
Pooling Layer
The goal of pooling layers is to subsample (shrink) the input image in order to reduce the computational load, the memory usage, and the number of parameters (thereby limiting the risk of overfitting). The pooling layer is just like the convolutional layers, except that a pooling layer neuron has no weights; all it does is aggregate the inputs using an aggregation function like max or mean. The image below shows a max pooling layer, which is the most common type of pooling layer. In the example below, only the max value makes it to the next receptive field.
Other than reducing computations, memory usage and the number of parameters, a max pooling layer also introduces some level of invariance to small translations, as seen in the image below. As you can see in the image below, the outputs of the max pooling layer for images A and B are identical. This is what translation invariance means. By inserting a max pooling layer every few layers in a CNN, it is possible to get some level of translation invariance at a larger scale. Moreover, max pooling also offers a small amount of rotational invariance and a slight scale invariance. Such invaraince can be useful in cases where the prediction should not depend on these details, such as in classification.
It does have some downsides: destructive for small images; invariance is not desireable for some applications (semantic segmentation - the task of classifying each pixel in an image depending on the object that pixel belongs to: obviously, if the input image is translated by 1 pixel to the right, the output should also be translated by 1 pixel to the right. The goal in this case is equivariance: small change in the inputs should lead to a corresponding small change in the outputs).
The following code creates a max pooling layer using a 2 x 2 kernel. To create an average pooling layer, just use AvgPool2D instead of MaxPool2D. People are mostly using max pooling layers now, as they perform better. Max pooling and average pooling can be defined along the depth dimension, rather than the spatial dimension (see image below). This can allow the CNN to be invariant ti various features. The global average pooling layer computes the mean of each entire fature map (its like an average pooling layer using a pooling kernel with the same spatial dimensions as the inputs). This means that it just outputs a single number per feature map and per instance.
max_pool = keras.layers.MaxPool2D(pool_size=2)
CNN Architectures
Typical CNN architectures stack a few convolutional layers (each one generally followed by a ReLU layer), then a pooling layer, then another few convolutional layers (+ReLU), then another pooling layer, and so on. The image get smaller and smaller as it progresses through the network, but it also typically gets deeper and deeper (i.e., with more feature maps) thanks to the convolutional layers (see image below). At the top of the stack, a regular feedforward network is added, composed of a few fully connected layers (+ReLUs), and the final layer outputs the prediction (a softmax layer that outputs estimated class probabilities). A common mistake is to use convolutional kernels that are too large.
Code Below: Implement a single CNN to tackle the fashion MNIST dataset
- We start by using the parial() function to define a thin wrapper around the Conv2D class, called DefaultConv2D: it simply avoids having to repeat the same hyperparameter values over and over again.
- The first layer uses a large kernel size, but no stride because the input images are not very large. It also sets input_shape=[28, 28, 1], which means the images are 28 x 28 pixels, with a single color channel
- Next, we have a max pooling layer, which divides each spatial dimension by a factor of 2
- We repeat the same structure twice: two convolutional layers followed by a max pooling layer. For larger images, we could repeat this structure several times (the number of repetitions is a hyperparameter you can tune)
- The number of filters grows as we climb up the CNN towards the output layer. It is common practice to double the number of different filters after each pooling: since a pooling layer divides each spatial dimension by a factor of 2, we can afford doubling the number of feature maps in the next layer, without fear of exploding the number of parameters, memory usage, or computational load
- Next is the fully connected network, composed of 2 hidden dense layers and a dense output layer. Note that we must flatten its inputs, since a dense network expects a 1D array of features for each instance. We also add two dropout layers, with a dropout rate of 50% each, to reduce overfitting
This CNN reaches over 92% accuracy on the test set. It's not state of the art, but it is pretty good, and clearly much better than what we achieved with dense networks.
from functools import partial
DefaultConv2D = partial(keras.layers.Conv2D,kernel_size=3, activation='relu', padding="SAME")
model = keras.models.Sequential([
DefaultConv2D(filters=64, kernel_size=7, input_shape=[28, 28, 1]),
keras.layers.MaxPooling2D(pool_size=2),
DefaultConv2D(filters=128),
DefaultConv2D(filters=128),
keras.layers.MaxPooling2D(pool_size=2),
DefaultConv2D(filters=256),
DefaultConv2D(filters=256),
keras.layers.MaxPooling2D(pool_size=2),
keras.layers.Flatten(),
keras.layers.Dense(units=128, activation='relu'),
keras.layers.Dropout(0.5),
keras.layers.Dense(units=64, activation='relu'),
keras.layers.Dropout(0.5),
keras.layers.Dense(units=10, activation='softmax'),
])
LeNet-5
The LeNet-5 architecture is perhaps the most widely known architectures. As mentioned earlier, it was created by Yann LeCyn in 1998 and widely used for hand-written digit recognition. A few details to be noted:
- The average pooling layers are more complex than usual: each neuron computes the mean of its inputs, then multiplies the result by a learnable coefficient (one per map) and adds a learnable bias term (again, one per map), then finally applies the activation function
- Most neurons in C3 maps are connected to neurons in only three or four S2 maps (instead of all six S2 maps)
- The output layer is a bit special: instead of computing the matrix multiplication of the inputs and the weight vector, each neuron outputs the square of the Euclidean distance between its input vector and its weight vector. Each output measures how much the image belongs to a particular digit class. The cross entropy cost function is now preferred, as it penalizes bad predictions much more, producing larger gradients and converging faster.
AlexNet
The AlexNet CNN Architecture won the 2012 ImageNet ILSVRC challenge by a large margin. It is similar to LeNet-5, only much larger and deeper, and it was the first to stack convolutional layers directly on top of each other, instead of pooling layer on top of each convolutional layer.
Thee authors applied two data augmentation techniques: first, they applied dropout, second, they performed data augmentation by randomly shifting the training images by various offsets, flipping them horizontally, and changing the lighting conditions. AlexNet uses a competitive normalization step immediately after the ReLU step of layers C1 and C3, called local response normalization. This most strongly activated neurons inhibit other neurons located at the same position in the neighboring feature maps . This encourages different feature maps to specialize, pushing them apart and forcing them to explore a wider range of features, ultimately improving generalization.
Local Response Normalization
- bi is the normalized output of the neuron located in feature map i , at some row u and column v (note that in this euqation we consider only neurons located at this row and columnn, so u and v are not shown )
- ai is the activation of that neuron after the ReLU step, but before normalization
- k , α , β, and r are hyperparameters. k is called the bias, and r is called the depth radius
- fn is the number of feature maps
Data Augmentation
Data augmentation artificially increases the size of the training set by generating many realistic variants of each training instances. This reduces overfitting, making this a regularization techniqye. The generated instances should be as realistic as possible. Moreover, simply adding white noise will not help; the modifications should not be learnable.
GoogLeNet
The GoogLe Net Architecture was developed by Google Research. The great performance of the model came in large part from the fact that the network was much deeper than previous CNNs. This was made possible by sub-networks called inception modules, which allow GoogLeNet to use parameters much more efficiently than previous architectures: GoogLeNet actually has 10 times fewer parameters than AlexNet.The image below shows the architecture of an inception module. The notation "3 x 3 + 1(S)" means that the layer uses a 3 x 3 kernel, stride 1, and SAME padding. Convolutional Layers with 1 x 1 kernel.
- First, although they cannot capture spatial patterns, they can capture patterns along depth dimension
- Second, they are configured to output fewer feature maps than their inputs, so they serve as bottleneck layers, meaning they reduce dimensionality. This cuts the computational cost and the number of parameters, speeding up training and improving generalization
- Lastly, each pair of convolutional layers acts like a single, convolutional layer, capable of capturing more complex patterns. Instead of sweeping a simple linear classifier across the image (as a single convolutional layer does), this pair of convolutional layers sweeps a two-layer network across the image.
You can think of the whole inception module as a convolutional layer on steroids, able to output feature maps that capture complex patterns at various scales. The number of feature maps output by each convolutional layer and each pooling layer is shown before the kernel size. The architecture is so deep that it has to be represented in three columns, but GoogLeNet is actually one tall stack, including nine inception modules. The size numbers in the inception modules represent the number of feature maps output by each convolutional layer in the module. Note that all the convolutional layers use the ReLU activation function.
The network:
- The first two layers divide the image's width and height by 4, to reduce the computational load. The first layer uses a large kernel size, so that much of the information is still preserved.
- Then the local response normalization layer ensures that the previous layers learn a wide variety of features
- Two convolutional layers follow, where the first acts like a bottleneck layer. As explained earlier, you can think of this pair as a single smarter convolutional layer.
- Again, a local response normalization layer ensures that the previous layers capture a wide variety of patterns.
- Next a max pooling layer reduces the image width and height by 2, again to speed up computations
- The tall stack of nine inception modules, interleaved with a couple max pooling layers to reduce dimensionality and speed up the net
- The global average pooling layer simply outputs the mean of each feature map: this drops any remaining spatial information, which is fine since there was not much spatial information left at that point. GoogLeNet input images are typically expected to be 224 x 224 pixels, so after 5 max pooling layers, each dividing the width and height by 2, the feature maps are down to 7 x 7. Moreover, it is a classification task, not localization, so it does not matter where the object is, Thanks to dimensionality reduction brought by this layer, there is no need to have several fully connected layers at the top of the CNN, and this considerably reduces the number of parameters in the network and limits the risk of overfitting
- The last layers are self-explanatory: dropout for regularization, then a fully connected layer with 1,000 units, since there are 1,000 classes, and a softmax activation function to output estimated class probabilities.
VGGNet
VGGnet has a very simple architecture with 2 or 3 convolutional layers, a pooling layer, then again 2 or 3 convolutional layers, a pooling layer, and so on (with a total of just 16 convolutional layers), plus a final dense network with 2 hidden layers and the output layer. It only used 3 x 3 filters, but many filters.
ResNet
Residual Network uses an extremely deep CNN composed of 152 layers. It confirmed the general trend: models are getting deeper and deeper, with fewer and fewer parameters. The key to being able to train such a deep network is to use skip connections (also called shortcut connections): the signal feeding into a layer is also added to the output of a layer located a bit higher up the stack. When training a neural network, the goal is to make it model a target function h(x) . If you add the input x to the output of the network, then the network will be forced to model f(x)=h(x)−x rather than h(x) . This is called residual learning.
When you initialize a regular neural network, its weights are close to zero, so the network just outputs values close to zero. If you add a skip connection, the resulting network just outputs a copy of its inputs; in other words, it initially models the identity function. If the target function is fairly close to the identity function (which is often the case), this will speed up training considerably.
If you add many skip connections, the network can start making progress even if several layers have not started learning yet - see the image below. Thanks to skip connections, the signal can easily make its way across the whole network. The deep residual network can be seen as a stack of residual units, where each residual unit is a small neural network with a skip connection.
The image below shows ResNet's architecture. It starts and ends like GoogLeNet (except without the dropout layer), and in between is just a very deep stack of simple residual units. Each residual unit is composed of two convolutional layers (and no pooling layer), with Batch Normalization and ReLU activation, using 3 x 3 kernels and preserving spatial dimensions.
The number of feature maps is doubled every few residual units, at the same time their height and width are halved (using a convolutional layer with stride 2). When this happens the inputs cannot be added directly o the outputs of the residual unit since they don't have the same shape (this problem affects the skip connection represented by the dashed arrows in the image above). To solve this problem, the inputs are passed through a 1 x 1 convolutional layer with stride 2 and the right number of feature maps.
Xception
Xception, which stands for Extreme Inception, is another variant of GoogLeNet architecture and was proposed in 2016. Just like Inception-v4, it also merges the ideas of GoogleLeNet and ResNet, but it preplaces the inception modules with a specific type of layer called a depthwise separable convolution (or separable convolution for short). While regular convolutional layer uses filters that try to simultaneously capture spatial patterns, a separable convolutional layer makes the strong assumption that spatial patterns and cross-channel patterns can be modeled separately - see image below.
Since separable convolutional layers only have one spatial filter per input channel, you should avoid using them after layers that have too few channels, such as the input layer. For this reason, the Xception architecture starts with 2 regular convolutional layers, but then the rest of the architecture uses only separable convolutions, plus a few max pooling layers and the usual final layers. Separable convolutions use less parameters, less memory, and less computations than regular convolutional layers, and in general, they even perform better, so you should consider using them by default.
SENet
Squeeze-and-Excitation Network (SENet) architecture extends existing architectures, boosting their performance. The boost comes from the fact that a SENet adds a small neural network, called a SE Block to every unit in the original architecture (every inception module or every residual unit) - as seen in the below.
A SE Block analyzes the output of the unit it is attached to, focusing exclusively on the depth dimension (it does not look for any spatial pattern), and it learns which features are usually most active together. It then uses this information to recalibrate the feature maps - as seen in the image below.
A SEBlock is composed of just 3 layers: a global average pooling layer, a hidden dense layer using the ReLU activation function, and a dense output layer using the sigmoid activation function - see the image below.
Implementing a ResNet-34 CNN Using Keras
DefaultConv2D = partial(keras.layers.Conv2D, kernel_size=3, strides=1,padding="SAME", use_bias=False)
class ResidualUnit(keras.layers.Layer):
def __init__(self, filters, strides=1, activation="relu", **kwargs):
super().__init__(**kwargs)
self.activation = keras.activations.get(activation)
self.main_layers = [
DefaultConv2D(filters, strides=strides),
keras.layers.BatchNormalization(),
self.activation,
DefaultConv2D(filters),
keras.layers.BatchNormalization()]
self.skip_layers = []
if strides > 1:
self.skip_layers = [
DefaultConv2D(filters, kernel_size=1, strides=strides),
keras.layers.BatchNormalization()]
def call(self, inputs):
Z = inputs
for layer in self.main_layers:
Z = layer(Z)
skip_Z = inputs
for layer in self.skip_layers:
skip_Z = layer(skip_Z)
return self.activation(Z + skip_Z)
model = keras.models.Sequential()
model.add(DefaultConv2D(64, kernel_size=7, strides=2,
input_shape=[224, 224, 3]))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Activation("relu"))
model.add(keras.layers.MaxPool2D(pool_size=3, strides=2, padding="SAME"))
prev_filters = 64
for filters in [64] * 3 + [128] * 4 + [256] * 6 + [512] * 3:
strides = 1 if filters == prev_filters else 2
model.add(ResidualUnit(filters, strides=strides))
prev_filters = filters
model.add(keras.layers.GlobalAvgPool2D())
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(10, activation="softmax"))
Using Pretrained models from Keras
In general, you won't have to implement standard models like GoogLeNet or resNet manually, since pretrained networks are readily available with a single like of code, in the keras.applications package:
model = keras.applications.resnet50.ResNet50(weights="imagenet")
The code above creates a ResNet-50 model and download weights pretrained on the ImageNet dataset.
Classification and Localization
Localizing an object in a picture can be expressed as regression task: to predict a bounding box around the object, a common approachis to predict the horizontal and vertical coordinates of the object's center, as well as its width and height. This means that there are 4 numbers to predict. Useful paper on crowdsourcing in Computer Vision.. The MSE often works fairly well as a cost function to train the model, but it is not a great metric to evaluate how well the model can predict the bounding boxes. The most common metric fr this is the Intersection Over Union (IoU): it is the area of overlap between the predicted bounding box and target bounding box, divided by the area of the union - see the image below and tf.keras.metrics.MeanIoU class.
Object Detection
The task of classifying and localizing multiple objects in an image is called object detection.
Fully COnvolutional Networks (FCNs)
A Fully convolutional Network is a much faster way to slide a CNN across an image, ans it was introduced in a 2015 paper for semantic segmentation - the task of classifying every pixel in an image according to the class of the object it belongs to. To convert a dense layer to a convolutional layer, the number of filters in the convolutional layer must be equal to the number of units in the dense layer, the filter must be eqial to the size of the input feature maps, and you must use VALID padding., The stride must be set to 1 or more.
You Only Look Once (YOLO)
YOLO is an extremely fast an accurate object detection architecture propsed in a 2015 paper and impoved in 2016 and in 2018
Semantic Segmentation
In semantic segmentation, each pixel is classified accroding to the class of the object it belongs to - as shown in the image below. Note that different objects of the same class are not distinguished.