Scaling to very very large corpora for natural language disambiguation

Looking for more research papers to read, I scanned my Hands-On Machine Learning notes for the many papers that were referenced there. This is one of those papers. These papers are mainly on machine learning and deep learning topics.

Reference Scaling to very very large corpora for natural language disambiguation Paper

Date Created:
Last Edited:
1 9

At the time of this paper, the internet consists of hundreds of billions of words, but most NLP tasks are optimized for training on corpora of 1 million words or less. The internet This paper evaluates the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously ever been used. This paper presents a study of the effects of data size on machine learning for natural language disambiguation. It studies the problem of selection among confusable words, using orders of magnitude more training data than has ever been used.

Confusion set disambiguation is the problem of choosing the correct use of a word, given a set of words with which it is commonly confused. Example confusion sets include { principal, principal }, { then, than }, { to, too, two }, and { weather, whether }. Numerous methods have been presented for confusable disambiguation. Tis work was partially motivated by the desire to develop an improved grammar checker. Running experiments showed the learning curves (accuracy) to be log-linear with respect to the number of words trained on. These results suggest that we may want to reconsider the trade-off between spending time and money on algorithm development versus spending it on corpus development. For this problem, none of the learners tested were close to asymptoting in performance at the training corpus size commonly employed by the field.

Voting was shown to be an effective technique for improving classifier accuracy for many applications. Voting can be effective in reducing both the bias of a particular training corpus and the bias of a specific learner. The larger the training set, in this paper, decreased the effectiveness of voting - voting even seemed to hurt accuracy in some instances. Active learning involves intelligently selecting a portion of samples for annotation from a pool of as-yet unannotated samples. By concentrating human annotation efforts on the samples of greatest utility to the machine learning algorithm, it may be possible to attain better performance for a fixed annotation cost than if samples were chosen randomly for human annotation. It was shown in the paper that it is possible to benefit from the availability of extremely large corpora without incurring the full cost of annotation, training time, and representation size.

Comments

You have to be logged in to add a comment

User Comments

Insert Math Markup

ESC
About Inserting Math Content
Display Style:

Embed News Content

ESC
About Embedding News Content

Embed Youtube Video

ESC
Embedding Youtube Videos

Embed TikTok Video

ESC
Embedding TikTok Videos

Embed X Post

ESC
Embedding X Posts

Embed Instagram Post

ESC
Embedding Instagram Posts

Insert Details Element

ESC

Example Output:

Summary Title
You will be able to insert content here after confirming the title of the <details> element.

Insert Table

ESC
Customization
Align:
Preview:

Insert Horizontal Rule

#000000

Preview:


View Content At Different Sizes

ESC

Edit Style of Block Nodes

ESC

Edit the background color, default text color, margin, padding, and border of block nodes. Editable block nodes include paragraphs, headers, and lists.

#ffffff
#000000

Edit Selected Cells

Change the background color, vertical align, and borders of the cells in the current selection.

#ffffff
Vertical Align:
Border
#000000
Border Style:

Edit Table

ESC
Customization:
Align:

Upload Lexical State

ESC

Upload a .lexical file. If the file type matches the type of the current editor, then a preview will be shown below the file input.

Upload 3D Object

ESC

Upload Jupyter Notebook

ESC

Upload a Jupyter notebook and embed the resulting HTML in the text editor.

Insert Custom HTML

ESC

Edit Image Background Color

ESC
#ffffff

Insert Columns Layout

ESC
Column Type:

Select Code Language

ESC
Select Coding Language

Insert Chart

ESC

Use the search box below

Upload Previous Version of Article State

ESC