ImageNet Classification with Deep Convolutional Neural Networks
I am reading this paper because it was recommended as part of Ilya Sutskever's approx. 30 papers that he recommended to John Carmack to learn what really matters for machine learning / AI today. This paper describes the state of the art deep CNN approach taken to the ImageNet task which achieved state of the art results.
Reference Link to PDF of Paper
0.1 Abstract
A deep convolutional neural network was trained to classify 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, top-1 and top-5 error rates of 32.7% and 170% were achieved, respectively, which is considerably better than the previous state of the art. The neural network has 60 million parameters and 650,000 neurons, consists of 5 convolutional layers and three fully connected layers with a final 1000-way softmax. Training was made faster through the use of non-saturating neurons and an efficient GPU implementation of the convolution operation. Overfitting was reduced through dropout.
0.2 Notes
With enough computation and enough data, learning beats programming for complicated tasks that require the integration of many different, noisy cues. For deep neural networks to shine, they need far more labeled data and hugely more computation. Current GPUs, paired with a highly optimized implementation of 2D convolution, are powerful enough to facilitate the training of interestingly-large CNNs. The contributions of this paper:
- Trained one of the largest CNNs to date and achieved the best results ever reported on this dataset
- Wrote a highly optimized GPU implementation of 2D convolution and all the other operatings inherent in training CNNs
- New features which improve performance
- New approach to reduce overfitting
The final network contains five convolutional and three fully-connected layers, and this depth seems to be important: removing any convolution layer (each of which contains no more than 1% of the model’s parameters) resulted in interior performance. The network’s size is limited mainly by the amount of memory available on current GPUs and by the amount of training time the research were willing to tolerate.
ImageNet is a dataset of over 15 million high-resolution images belonging to roughly 22,000 categories. The images were collected form the web and labeled by human labelers using Amazon’s MechanicalTurk crowd-sourcing tool. ImageNet consists of variable resolution images, while the model requires a constant input dimensionality, sot the images were downsampled to a fixed resolution. Given a rectangular image, the image was first rescaled so that the shorter side was length 256, and then cropped out the central 256 x 256 patch from the resulting image.
The architecture of the network consists of 8 learned layers, 5 convolutional and 3 fully connected. The Rectified Linear Unit activation function used in this dataset served to decrease training time. The network was trained on multiple GPUs due to the memory limits of a single GPU. Current GPUs are particularly well-suited for cross-GPU parallelization, as they are able to read from and write to one another’s memory directly, without going through the host machine memory.The parallelization scheme employed put half the neurons on each GPU, and the GPUs only communicated in certain layers. ReLUs have the desirable property that they do not require input normalization to prevent them from saturating. However, adding a normalization scheme still adds to generalization. Pooling layers in CNNs summarize the outputs of neighboring groups of neurons in the same kernel map.
The network contains 8 layers with weights. The first five layers are convolutional, and the last three are fully connected. The output of the last fully connectled layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels. The network maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under prediction distribution. The kernels of the second, fourth, and fifth convolutional layers are connected only to those kernel maps in the previous layer which reside on the same GPU. The kernels of the third convolutional layer are connected to all kernel maps in the second layer. The neurons in the fully connected layers are connected to all neurons in the previous layer. Response-normalization layers follow the first and second convolutional layers. Max-pooling layers follow both response-normalization layers as well as the fifth convolutional layer. The ReLU non linearity is applied to the output of every convolutional and fully connected layer.
The easiest and most common method to reduce overfitting on image data is to artifically enlarge the dataset using label-preserving transformations. They did data augmentation by doing horizontal reflections and changing the intensities of RGB channels. They used dropout to reduce overfitting.
The models were trained using stochastic gradient descent with a batch size of 128 samples, memomentum of 0.8m and weight decay of 0.0005. The weight decay was found to be an important factor for the model to learn.
Comments
You have to be logged in to add a comment
User Comments
There are currently no comments for this article.