ImageNet Classification with Deep Convolutional Neural Networks

I am reading this paper because it was recommended as part of Ilya Sutskever's approx. 30 papers that he recommended to John Carmack to learn what really matters for machine learning / AI today. This paper describes the state of the art deep CNN approach taken to the ImageNet task which achieved state of the art results.

Reference Link to PDF of Paper

Date Created:
Last Edited:
1 18

0.1 Abstract

A deep convolutional neural network was trained to classify 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, top-1 and top-5 error rates of 32.7% and 170% were achieved, respectively, which is considerably better than the previous state of the art. The neural network has 60 million parameters and 650,000 neurons, consists of 5 convolutional layers and three fully connected layers with a final 1000-way softmax. Training was made faster through the use of non-saturating neurons and an efficient GPU implementation of the convolution operation. Overfitting was reduced through dropout.

0.2 Notes

With enough computation and enough data, learning beats programming for complicated tasks that require the integration of many different, noisy cues. For deep neural networks to shine, they need far more labeled data and hugely more computation. Current GPUs, paired with a highly optimized implementation of 2D convolution, are powerful enough to facilitate the training of interestingly-large CNNs. The contributions of this paper:

  • Trained one of the largest CNNs to date and achieved the best results ever reported on this dataset
  • Wrote a highly optimized GPU implementation of 2D convolution and all the other operatings inherent in training CNNs
  • New features which improve performance
  • New approach to reduce overfitting

The final network contains five convolutional and three fully-connected layers, and this depth seems to be important: removing any convolution layer (each of which contains no more than 1% of the model’s parameters) resulted in interior performance. The network’s size is limited mainly by the amount of memory available on current GPUs and by the amount of training time the research were willing to tolerate.

ImageNet is a dataset of over 15 million high-resolution images belonging to roughly 22,000 categories. The images were collected form the web and labeled by human labelers using Amazon’s MechanicalTurk crowd-sourcing tool. ImageNet consists of variable resolution images, while the model requires a constant input dimensionality, sot the images were downsampled to a fixed resolution. Given a rectangular image, the image was first rescaled so that the shorter side was length 256, and then cropped out the central 256 x 256 patch from the resulting image.

The architecture of the network consists of 8 learned layers, 5 convolutional and 3 fully connected. The Rectified Linear Unit activation function used in this dataset served to decrease training time. The network was trained on multiple GPUs due to the memory limits of a single GPU. Current GPUs are particularly well-suited for cross-GPU parallelization, as they are able to read from and write to one another’s memory directly, without going through the host machine memory.The parallelization scheme employed put half the neurons on each GPU, and the GPUs only communicated in certain layers. ReLUs have the desirable property that they do not require input normalization to prevent them from saturating. However, adding a normalization scheme still adds to generalization. Pooling layers in CNNs summarize the outputs of neighboring groups of neurons in the same kernel map.

PIC

The network contains 8 layers with weights. The first five layers are convolutional, and the last three are fully connected. The output of the last fully connectled layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels. The network maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under prediction distribution. The kernels of the second, fourth, and fifth convolutional layers are connected only to those kernel maps in the previous layer which reside on the same GPU. The kernels of the third convolutional layer are connected to all kernel maps in the second layer. The neurons in the fully connected layers are connected to all neurons in the previous layer. Response-normalization layers follow the first and second convolutional layers. Max-pooling layers follow both response-normalization layers as well as the fifth convolutional layer. The ReLU non linearity is applied to the output of every convolutional and fully connected layer.

The easiest and most common method to reduce overfitting on image data is to artifically enlarge the dataset using label-preserving transformations. They did data augmentation by doing horizontal reflections and changing the intensities of RGB channels. They used dropout to reduce overfitting.

The models were trained using stochastic gradient descent with a batch size of 128 samples, memomentum of 0.8m and weight decay of 0.0005. The weight decay was found to be an important factor for the model to learn.

Comments

You have to be logged in to add a comment

User Comments

Insert Math Markup

ESC
About Inserting Math Content
Display Style:

Embed News Content

ESC
About Embedding News Content

Embed Youtube Video

ESC
Embedding Youtube Videos

Embed TikTok Video

ESC
Embedding TikTok Videos

Embed X Post

ESC
Embedding X Posts

Embed Instagram Post

ESC
Embedding Instagram Posts

Insert Details Element

ESC

Example Output:

Summary Title
You will be able to insert content here after confirming the title of the <details> element.

Insert Table

ESC
Customization
Align:
Preview:

Insert Horizontal Rule

#000000

Preview:


View Content At Different Sizes

ESC

Edit Style of Block Nodes

ESC

Edit the background color, default text color, margin, padding, and border of block nodes. Editable block nodes include paragraphs, headers, and lists.

#ffffff
#000000

Edit Selected Cells

Change the background color, vertical align, and borders of the cells in the current selection.

#ffffff
Vertical Align:
Border
#000000
Border Style:

Edit Table

ESC
Customization:
Align:

Upload Lexical State

ESC

Upload a .lexical file. If the file type matches the type of the current editor, then a preview will be shown below the file input.

Upload 3D Object

ESC

Upload Jupyter Notebook

ESC

Upload a Jupyter notebook and embed the resulting HTML in the text editor.

Insert Custom HTML

ESC

Edit Image Background Color

ESC
#ffffff

Insert Columns Layout

ESC
Column Type:

Select Code Language

ESC
Select Coding Language

Insert Chart

ESC

Use the search box below

Upload Previous Version of Article State

ESC