Scaling to very very large corpora for natural language disambiguation

Looking for more research papers to read, I scanned my Hands-On Machine Learning notes for the many papers that were referenced there. This is one of those papers. These papers are mainly on machine learning and deep learning topics.

Reference Scaling to very very large corpora for natural language disambiguation Paper

DOWNLOAD TEX

Date Created: 25 10, 2025

Last Edited: 09 12, 2025

1 85

At the time of this paper, the internet consists of hundreds of billions of words, but most NLP tasks are optimized for training on corpora of 1 million words or less. The internet This paper evaluates the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously ever been used. This paper presents a study of the effects of data size on machine learning for natural language disambiguation. It studies the problem of selection among confusable words, using orders of magnitude more training data than has ever been used.

Confusion set disambiguation is the problem of choosing the correct use of a word, given a set of words with which it is commonly confused. Example confusion sets include { principal, principal }, { then, than }, { to, too, two }, and { weather, whether }. Numerous methods have been presented for confusable disambiguation. Tis work was partially motivated by the desire to develop an improved grammar checker. Running experiments showed the learning curves (accuracy) to be log-linear with respect to the number of words trained on. These results suggest that we may want to reconsider the trade-off between spending time and money on algorithm development versus spending it on corpus development. For this problem, none of the learners tested were close to asymptoting in performance at the training corpus size commonly employed by the field.

Voting was shown to be an effective technique for improving classifier accuracy for many applications. Voting can be effective in reducing both the bias of a particular training corpus and the bias of a specific learner. The larger the training set, in this paper, decreased the effectiveness of voting - voting even seemed to hurt accuracy in some instances. Active learning involves intelligently selecting a portion of samples for annotation from a pool of as-yet unannotated samples. By concentrating human annotation efforts on the samples of greatest utility to the machine learning algorithm, it may be possible to attain better performance for a fixed annotation cost than if samples were chosen randomly for human annotation. It was shown in the paper that it is possible to benefit from the availability of extremely large corpora without incurring the full cost of annotation, training time, and representation size.

Scaling to very very large corpora for natural language disambiguation

Comments

User Comments