Scaling Laws for Neural Language Models
I am reading this paper because it was recommended as part of Ilya Sutskever's approx. 30 papers that he recommended to John Carmack to learn what really matters for machine learning / AI today. This paper studies empirical scaling laws for language model performance on the cross entropy loss.
Reference Link to PDF of Paper
0.1 Related
- Cross-Entropy Loss In information theory, the cross-entropy between two probability distributions and , over the same underlying set of events, measures the average number of buts needed to identify an event drawn from the set when the coding scheme used for the set is optimized for an estimated probability distribution , rather than the true distribution .
0.2 Abstract
This paper studies the empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitide. Simple equations govern the dependenc of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow for the determination of the optimal allocation of a fixed compute budget.
0.3 Introduction
Language provides a natural domain for the study of artificial intelligence, as the vast majority of reasoning tasks can be efficiently expressed and evaluated in language, and the world’s text provides a wealth of data for unsupervised learning via generative modeling. Deep learning has recently seen rapid progress in language modeling, with state of the art models approaching human-level performance on many specific tasks. This work empirically investigates the dependence of language mdoeling loss on model architetcure, the size of neural modles, the computing power used to train them, and the data available for the training process.
The image above shows that language modeling performance improves smoothly as we increase the model size, dataset size, and the amount of compute used for training. All three of these factors should be scaled up in tandem. Model performance depends most strongly on scale: the number of model paramaters , the size of the dataset , and the amount of compute used for training. Within reasonable limits, the performance depends very weakly on other architectural hyperparameters such as depth vs width. Performance has a power-law relationship with each of the three scale factors , ,and when not bottlenecked by the other two.
- Universality of overfitting: Performance improves predictably as long as we scale up and in tandem.
- Universality of training: Training curves follow predictable power-laws whose parameters are roughly independent of the model size.
- Transfer to a different distribution than the model was trained on incurs a constant penalty but otherwise improves roughly in line with performance on the training set
- Sample efficiency: Large models are more sample efficient than small models, reaching the same level fo performance with fewer optimization steps
- Optimal performance is found by training very large models and stopping significantly short of convergence.
- The ideal batch size for training these models is roughly a power of the loss only
The image above shows:
- Larger models require fewer samples to reach the same performance
Comments
You have to be logged in to add a comment
User Comments
There are currently no comments for this article.