Various NLP Tasks
Why Create This Page
I am creating this page to test out some nlp tasks that commonly need to be performed when analyzing text-based user generated content (or any other text content). These tasks include:
- Fill-Mask Tasks
- Question Answering Tasks
- Sentiment Analysis Tasks
- Named Entity Recognition Tasks
- Classification Tasks
- Translation Tasks
- Table Question Answering Tasks
- Feature Extraction Tasks
- Text Generation Tasks
- Summarization Tasks
- Keyword Extraction Tasks
Fill Mask Tasks
Masked language modeling is the task of masking some of the words in a sentence and predicting which words should replace those masks. These models are useful when we ant to get a statistical understanding of the language in which the model is trained in.
Masked language models do not require labelled data. They are trained by masking a couple of words in sentences and the model is expected to guess the masked word. Models trained on fill-mask tasks can then be fine-tuned to solve different tasks, such as text classification or question answering.
Enter in text into the textbox below. Replace the word you want filled in with [MASK] and the AI model will attempt to fill in the masked word. If you do not enter in [MASK] into the input, then [MASK] will be appended to the text.
I am using the BERT base model (cased) for fill-mask tasks. It is a pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in and first released in this repository.
BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with two objectives:
- Masked Language Modeling
- Next Sentence Prediction (NSP)
Question Answering Tasks
Question Answering models can retrieve the answer to a question from a given text, which is useful for searching for an answer in a document. Some question answering models can generate answers without context!
You can use Question Answering models to automate the response to frequently asked questions by using a knowledge base (documents) as context.
Task Variants:
- Extractive QA: The model extracts the answer from the context. This is usually solved with BERT-like models.
- Open Generative QA: The model generates free text based on the context.
- Closed Generative QA: No context is provided; the answer is completely generated.
I use the distilbert/distilbert-base-cased-squad model for question answering. DitilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language understanding benchmark.
Upload a text document OR insert text into the textbox below, and ask a question about the document / text, using the textarea input, to get answers.
Sentiment Analysis Tasks
Sentiment Analysis (also known as opinion mining or emotion AI) is the use of natural language processing, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, more difficult domains can be analyzed.
I am using the SamLowe/roberta-base-go_emotions model from hugging face for multi-label classification. go_emotions is based on Reddit data and has 28 labels. It is a multi-label dataset where one or multiple labels may apply for any given input text, hence this model is a multi-label classification model with 28 'probability' float outputs for any given input text.
Enter text into the textarea below OR upload a document, submit the form, and an analysis of the sentiment of the document will be returned.
Named Entity Recognition Tasks
Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a usbtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, monetary values, percentages, etc.
NER systems have been created that use linguistic grammar-based techniques as well as statistical models such as machine learning.
I am using the dslim/bert-base-NER model for named entity recognition. This model is a fine-tuned BERT model that is ready to use for Named Entity Recognition and achieves state-of-the-art performance for the NER task. It has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PER), and Miscellaneous (MISC)
.
See Types of Entities Below:
Upload a document or input text into the textbox below and submit the form to see the results of named entity recognition.
Classification Tasks
Text Classification is the task of assigning a label or class to a given text. Some use cases are sentiment analysis, natural language inference, and assessing grammatical correctness.
Use cases include:
- NLI(Natural Language Inference)
- MNLI(Multi-Genre Natural Language Inference)
- QNLI(Question Natural Language Inference)
- Sentiment Analysis
- Grammatical Correctness
I am using the facebook/bart-large-mnli model for classification tasks.
Yin et al. proposed a method for using pre-trained NLI models as a ready-made zero-shot sequence classifiers. The method works by posing the sequence to be classified as the NLI premise and to construct a hypothesis from each candidate label. For example, if we want to evaluate whether a sequence belongs to the class "politics", we could construct a hypothesis of This text is about politics.
Upload a document or insert text that is to be classified using the textarea below. Enter the candidate labels - the labels that you want the text to be labeled as - in the textarea below as a comma separated list. Submit the form, and receive the likelihood that the text corresponds to each class.
Translation Tasks
Neural Machine Translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modelling entire sentences in a single integrated model.
It is the dominant approach today and can produce translations that rival human translations when translating between high-resource languages under specific conditions. However, there still remain challenges, especially with languages where less high-quality data is available.
I don't demonstrate a translation model here, but I have used some before and I recommend Google's Cloud Translation API.
Table Question Answering Tasks
Table QA is the answering a question about information in a given table.
Use Cases:
- SQL Execution
- Table Question Answring
I have not completed Table Question Answering yet, but I hope to create a project that involves table question answering with the Census dataset, similar to Census GPT.
Feature Extraction Tasks
Feature extraction is a process in machine learning and data analysis that involves identifying and extracting relevant features from raw data. These features are later use to create a more informative dataset, which can be further utilized for various tasks such as: classification, prediction, and clustering. Feature extraction aims to reduce data complexity while retaining as much relevant information as possible.
While I have come across many different definitions of the term feature extraction
, I think it is most commonly used to denote vectorization - the translation from characters or bytes into vectors with semantic / relative meaning.
I currently use OpenAI's vector embedding API to generate feature vectors for text. I will not include an example here, but you can use the API to generate vectors of variable length for text. These vectors can be used for semantic search, recommendations, data visualization, a text feature encoder for ML models, and for zero shot classification. Note: I chose to use this embedding model throughout the site since I don't have a lot of traffic and the API is cheap; if I was to have more traffic and I didn't need great performance, I might choose a different model.
Enter some text (between 10 and 200 characters) into the input below, and see the result of OpenAI's feature extraction by submitting the form.
Text Generation Tasks
Generating text is the task of generating new text given another text. These models can, for example, fill in incomplete text or paraphrase.
I currently use OpenAI's Text Generation API for text generation tasks. APIs from large AI labs are currently cheaper than running a large model on your own server and renting a GPU. You can see a demonstration of current chat model on the chat page.
Summarization Tasks
Transformer models can be used to condense long documents into summaries, a task known as text summarization. This is one of the most challenging NLP tasks as it requires a range of abilities, such as understanding long passages and generating coherent text that captures the main topics in a document.
I currently use the facebook/bart-large-cnn model for summarization tasks. It is a BART model pre-trained on English language, and fine-tuned on CNN Daily Mail
. It was introduced in the paper BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension . BARTis a transformer encoder-encoder (seq2seq) model with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder. BART is particularly effective when fine-tuned for text generation (e.g., summarization, translation) but also works well for comprehension tasks (e.g. text classification, question answering).
Upload a document or insert text that is to be summarized using the textarea below. Enter the max length and min length that the summary should be and submit the form to receive the summary.
Keyword Extraction Tasks
Keyword extraction is tasked with the automatic identification of terms that best describe the subject of a document.
Key Phrases, key segments, or just keywords are the terminology which is used for definint teh terms that represent the most relevant information contained in the document. Although the terminology is different, function is the same: characterization of the topic discussed in a document.
I use the Voicelab/vlt5-base-keywords model for keyword extraction. This model generates approximately 3-5 keywords per paragraph of text. It was trained on scientific papers (POSMAC corpus). Longer pieces of text must be split into smaller chunks (1,500 characters).
Our vlT5 model is a keyword generation model based on encoder-decoder architecture using Transformer blocks presented by Google. The vlT5 was trained on scientific articles corpus to predict a given set of keyphrases based on the concatenation of the article's abstract and title. It generates precise, yet not always complete keyphrases that describe the content of the article based only on the abstract.
Upload a document or insert text into the input below and submit the form to generate keywords for the text input. The amount of keywords returned will be dependent on the amount of text in the document / input. I think there will be about 3-5 keywords per 1,500 characters.
Open AI Moderation API
The moderations endpoint is a tool you can use to check whether text or images are potentially harmful. Once harmful content is identified, developers can take corrective action like filtering content or intervening with user accounts creating offending content. The moderation endpoint if free to use.
The models available for this endpoint are:
- omni-moderation-latest: This model and all snapshots support more categorization options and multi-modal inputs.
- text-moderation-latest: Older model that supports only text inputs and fewer input categorizations. The newer omni-moderation models will be the best choice for new applications.
Upload a document or insert some text into the textarea below and submit the form to see what the free open ai moderations API says about the content.