Scaling Laws for Neural Language Models

I am reading this paper because it was recommended as part of Ilya Sutskever's approx. 30 papers that he recommended to John Carmack to learn what really matters for machine learning / AI today. This paper studies empirical scaling laws for language model performance on the cross entropy loss.

Reference Link to PDF of Paper

Date Created:
Last Edited:
0 19
  • Cross-Entropy Loss In information theory, the cross-entropy between two probability distributions p and q, over the same underlying set of events, measures the average number of buts needed to identify an event drawn from the set when the coding scheme used for the set is optimized for an estimated probability distribution q, rather than the true distribution p.

0.2 Abstract

This paper studies the empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitide. Simple equations govern the dependenc of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow for the determination of the optimal allocation of a fixed compute budget.

0.3 Introduction

Language provides a natural domain for the study of artificial intelligence, as the vast majority of reasoning tasks can be efficiently expressed and evaluated in language, and the world’s text provides a wealth of data for unsupervised learning via generative modeling. Deep learning has recently seen rapid progress in language modeling, with state of the art models approaching human-level performance on many specific tasks. This work empirically investigates the dependence of language mdoeling loss on model architetcure, the size of neural modles, the computing power used to train them, and the data available for the training process. PIC

The image above shows that language modeling performance improves smoothly as we increase the model size, dataset size, and the amount of compute used for training. All three of these factors should be scaled up in tandem. Model performance depends most strongly on scale: the number of model paramaters N, the size of the dataset D, and the amount of compute C used for training. Within reasonable limits, the performance depends very weakly on other architectural hyperparameters such as depth vs width. Performance has a power-law relationship with each of the three scale factors N, D ,and C when not bottlenecked by the other two.

  • Universality of overfitting: Performance improves predictably as long as we scale up N and D in tandem.
  • Universality of training: Training curves follow predictable power-laws whose parameters are roughly independent of the model size.
  • Transfer to a different distribution than the model was trained on incurs a constant penalty but otherwise improves roughly in line with performance on the training set
  • Sample efficiency: Large models are more sample efficient than small models, reaching the same level fo performance with fewer optimization steps
  • Optimal performance is found by training very large models and stopping significantly short of convergence.
  • The ideal batch size for training these models is roughly a power of the loss only

PIC

The image above shows:

  1. Larger models require fewer samples to reach the same performance

Comments

You have to be logged in to add a comment

User Comments

Insert Math Markup

ESC
About Inserting Math Content
Display Style:

Embed News Content

ESC
About Embedding News Content

Embed Youtube Video

ESC
Embedding Youtube Videos

Embed TikTok Video

ESC
Embedding TikTok Videos

Embed X Post

ESC
Embedding X Posts

Embed Instagram Post

ESC
Embedding Instagram Posts

Insert Details Element

ESC

Example Output:

Summary Title
You will be able to insert content here after confirming the title of the <details> element.

Insert Table

ESC
Customization
Align:
Preview:

Insert Horizontal Rule

#000000

Preview:


View Content At Different Sizes

ESC

Edit Style of Block Nodes

ESC

Edit the background color, default text color, margin, padding, and border of block nodes. Editable block nodes include paragraphs, headers, and lists.

#ffffff
#000000

Edit Selected Cells

Change the background color, vertical align, and borders of the cells in the current selection.

#ffffff
Vertical Align:
Border
#000000
Border Style:

Edit Table

ESC
Customization:
Align:

Upload Lexical State

ESC

Upload a .lexical file. If the file type matches the type of the current editor, then a preview will be shown below the file input.

Upload 3D Object

ESC

Upload Jupyter Notebook

ESC

Upload a Jupyter notebook and embed the resulting HTML in the text editor.

Insert Custom HTML

ESC

Edit Image Background Color

ESC
#ffffff

Insert Columns Layout

ESC
Column Type:

Select Code Language

ESC
Select Coding Language

Insert Chart

ESC

Use the search box below

Upload Previous Version of Article State

ESC