Deep Computer Vision Using Convolutional Neural Networks Exercises Answers

This chpater goes into what CNNs are and reviews some CNN architectures.

Deep Computer Vision Using Convolutional Neural Networks Exercises Answers

Chapter 1

What are the advantages of a CNN over a fully connected DNN for image classification?

The fact that all neurons in a feature map share the same parameters dramatically reduces the number of parameters in the model. Moreover, once the CNN has learned to recognize a pattern in one location, it can recognize it in any other location. In contrast, once a regular DNN has learned to recognize a pattern in one location, it can recognize it only in that particular location.

  • CNN's layers are only partially connnected and reuses its weights.
  • learned a kernel which can detect a particular features.
  • A CNN's architecture embeds this prior knowledge.

Chapter 2

Consider a CNN composed of three convolutional layers, each with 3 × 3 kernels, a stride of 2, and SAME padding. The lowest layer outputs 100 feature maps, the middle one outputs 200, and the top one outputs 400. The input images are RGB images of 200 × 300 pixels. What is the total number of parameters in the CNN? If we are using 32-bit floats, at least how much RAM will this network require when making a prediction for a single instance? What about when training on a mini-batch of 50 images?

Note: Filters are another name for convolutional kernels.

  • first convolutional layer kernel-size and RGB channels, plus bias: 3 * 3 * 3 + 1 = 28 output feature maps is 100: 28 * 100 = 2,800

  • second convolutional layer kernel-size and last feature maps, plus bias: 3 * 3 * 100 + 1 = 901 output feature maps is 200: 901 * 200 = 180,200

  • third convolutional layers kernel-size and last feautre maps, plus bias: 3 * 3 * 200 + 1 =1801 output feautre maps is 400: 1,801 * 400 = 720,400

  • Total parameters is 2800 + 180200 + 720400 = 903,400

  • first convolutional layer one feature map size: 100 * 150 = 15,000 total output: 15,000 * 100 = 1,500,000

  • second convolutional layer one feature map size: 50 * 75 = 3,750 total output: 3,750 * 200 = 750,000

  • third convolutional layer one feature map size: 25 * 38 = 950 total ouput: 950 * 400 = 380, 000

Chapter 3

If your GPU runs out of memory while training a CNN, what are five things you could try to solve the problem?

If training crashes because of an out-of-memory error, you can try reducing the mini-batch size. Alternatively, you can try reducing dimensionality using a stride, or removing a few layers. Or you can try using 16-bit floats instead f 32-bit floats. Or you could distribute the CNN across multiple devices.

Chapter 4

Why would you want to add a max pooling layer rather than a convolutional layer with the same stride?

A max pooling layer has no parameters at all, whereas a convolutional layer has quite a few.

Chapter 5

When would you want to add a local response normalization layer?

When you want to generalize better. local response normalization is when the most strongly activated neurons inhibit other neurons located at teh same position in neighboring features maps. This encourages different feature maps to specialize, pushing them apart and forcing them to explore a wider range of features, ultimately improving generalization.

Chapter 6

Can you name the main innovations in AlexNet, compared to LeNet-5? What about the main innovations in GoogLeNet, ResNet, SENet and Xception?

AlexNet was the first to stack convolutional layers directly on top of each other, instead of stacking a pooling layer on top of each convolutional layer. AlexNet uses a competitive normalization step called local response normalization. GoogLeNet's architecture was much deeper than previous CNNs. This was made possible by sub-networks called inception modules, which allow GoogleLeNet to use parameters much more efficiently than previous architectures: GoogLeNet actually has 10 times fewer parameters than AlexNet. ResNet confirmed the general trend: models are getting deeper and deeper, with fewer and fewer parameters. The key to being able to tran such a deep network is to use skip connections: the signal feeding into a layer is also added to the output of a layer located a bit higher up the stack. Residual Units are small neural networks with a skip connection. Xception replaces the inception modules with a special type of layer called a depthwise separable convolution.

Chapter 7

What is a Fully Convolutional Network? How can you convert a dense layer into a convolutional layer?

The idea behind FCNs is to replace the dense layers at the top of a CNN by convolutional layers. To convert a dense layer to a convolutional layer, the number of filters in the convolutional layer must be equal to the number of units in the dense layer, the filter size must be equal to the size of the input feature maps, and you must use VALID padding. The stride may be set to 1 or more.

Chapter 8

What is the main technical difficulty of semantic segmentation?

Image segmentation is the task of partitioning an image into multiple segments. In semantic segmentation, all pixels that are part of the same object type get assigned to the same segment.

The main technical difficulty of semantic segmentation is the need for large amounts of pixel-accurate annotated data, which is time-consuming and expensive to acquire, coupled with the challenge of distinguishing between pixels from the same class across different scenes and situations, leading to inconsistent segmentation results.