Retrieval Augmented Generation

I want to learn about Retrieval Augmented Generation.

Date Created:
1 61

References



Notes


Wikipedia


Retrieval Augmented Generation (RAG) is a technique that grants generative artificial intelligence models information retrieval capabilities. It modifies interactions with a large language model (LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to augment information drawn from its own vast, static training data. This allows LLMs to use domain-specific and/or updated information. Use cases include providing chatbot access to internal company data or giving factual information only from an authoritative source.

The RAG process is made up of four key stages. First, all the data must be prepared and indexed for use by the LLM. Thereafter, each query consists of a retrieval, augmentation, and generation phase.

RAG Process

  1. Indexing
    1. Typically, the data to be referenced is converted into LLM embeddings. These embeddings are stored in a vector database for document retrieval.
  2. Retrieval
    1. Given a user query, a document retriever is first called to select the most relevant documents that will be used to augment the query. How the comparison is done depends on the type of indexing being used.
  3. Augmentation
    1. The model feeds retrieved information into the LLM via prompt engineering of the user's original query.
  4. Generation
    1. Finally, the LLM can generate output based on both the query and the retrieved documents.

Improvements

Improvements to the basic process can be applied at different stages in the RAG flow:

  1. Encoder
    1. Improving the embeddings' accuracy of the meaning of data can improve RAG, as can improving how queries are compared to stored embeddings.
  2. Retriever-centric methods
    1. These methods improve the quality of hits from the vector database:
      1. pre-train the retriever using the Inverse Cloze Task
      2. progressive data augmentation
      3. re-ranking the retriever
  3. Language model
    1. By redesigning the language model with the retriever in mind, a 25-time smaller network can get comparable perplexity as its much larger counterparts
  4. Chunking
    1. Chunking involves various strategies for breaking up data into vectors so that the retriever can find details in it.
    2. Types of Chunking include:
      1. Fixed length with overlap
      2. syntax-based chunks
      3. File format-based chunking. Certain file types have natural chunks built in, and it's best to respect them. HTML files should leave <table> elements or base64 encoded <img elements intact.

Google


RAG (Retrieval Augmented Generation) is an AI framework that combines the strengths of traditional information retrieval systems (such as searches and databases) with capabilities of generative Large Language Models (LLMs). By combining your data and world knowledge with LLM language skills, grounded generation is more accurate, up-to-date, and relevant to your specific needs.

RAG operates with a few main steps to help enhance generative AI outputs:

  • Retrieval and pre-processing: RAGs leverage powerful search algorithms to query external data, such as web pages, knowledge bases, and databases. Once retrieved, the relevant information undergoes pre-processing, including tokenization, stemming, and removal of stop words.
  • Grounded generation: The pre-processed retrieved information is then seamlessly incorporated into the pre-trained LLM. This integration enhances the LLM's context, providing it with a more comprehensive understanding of the topic. This augmented context enables the LLM to generate more precise, informative, and engaging responses.

Why use RAG?

  • Access to fresh information
    • LLMs are limits to their pre-trained data. This leads to outdated and potentially inaccurate responses. RAG overcomes this by providing up-to-date information to LLMs.
  • Factual Grounding
    • LLMs can sometimes struggle with factual accuracy. Providing facts to the LLM as part of the input prompt can mitigate generative AI hallucinations. The crux of this approach is ensuring that the most relevant facts are provided to the LLM, and that the LLM output is entirely grounded on those facts while also answering the suer's question and adhering to system instructions and safety constraints.
  • Search with Vector Databases and Relevancy Re-Rankers
    • RAGs usually retrieve facts via search, and modern search engines now leverage vector databases to efficiently retrieve relevant documents.


AWS


Retrieval-Augmented Generation (RAG) is the process of optimizing the output of a large language model, so it references an authoritative knowledge base outside of its training data sources before generating a response. Large Language Models (LLMs) are trained on vast volumes of data and use billions of parameters to generate original output for tasks like answering questions, translating languages, and completing sentences. RAG extends the already powerful capabilities of LLMs to specific domains or an organization's internal knowledge base, all without the need to retrain the model. It is a cost-effective approach to improving LLM output so it remains relevant, accurate, and useful in various contexts.

RAG technology brings several benefits to an organization's generative AI efforts:

  • Cost-effective implementation
  • Current information
  • Enhanced user trust
  • More developer control

How does RAG work?

  1. Create external data
  2. Retrieve relevant information
  3. Augment the LLM prompt
    1. The RAG model augments the user input by adding the relevant retrieved data in context. This step uses prompt engineering techniques to communicate effectively with the LLM.
  4. Update external data
    1. To maintain current information for retrieval, asynchronously update documents and update embedding representation of the documents. You can do this through automated real-time processes or periodic batch processing.

RAG Process

NVIDIA


Retrieval-augmented generation (RAG) is a technique for enhancing the accuracy and reliability of generative AI models with facts fetched from external sources.
Patrick Lewis, lead author of the 2020 paper that coined the term, apologized for the unflattering acronym that now describes a growing family of methods across hundreds of papers and dozens of commercial services he believes represent the future of generative AI.


Comments

You have to be logged in to add a comment

User Comments

Insert Math Markup

ESC
About Inserting Math Content
Display Style:

Embed News Content

ESC
About Embedding News Content

Embed Youtube Video

ESC
Embedding Youtube Videos

Embed TikTok Video

ESC
Embedding TikTok Videos

Embed X Post

ESC
Embedding X Posts

Embed Instagram Post

ESC
Embedding Instagram Posts

Insert Details Element

ESC

Example Output:

Summary Title
You will be able to insert content here after confirming the title of the <details> element.

Insert Table

ESC
Customization
Align:
Preview:

Insert Horizontal Rule

#000000

Preview:


View Content At Different Sizes

ESC

Edit Style of Block Nodes

ESC

Edit the background color, default text color, margin, padding, and border of block nodes. Editable block nodes include paragraphs, headers, and lists.

#ffffff
#000000

Edit Selected Cells

Change the background color, vertical align, and borders of the cells in the current selection.

#ffffff
Vertical Align:
Border
#000000
Border Style:

Edit Table

ESC
Customization:
Align:

Upload Lexical State

ESC

Upload a .lexical file. If the file type matches the type of the current editor, then a preview will be shown below the file input.

Upload 3D Object

ESC

Upload Jupyter Notebook

ESC

Upload a Jupyter notebook and embed the resulting HTML in the text editor.

Insert Custom HTML

ESC

Edit Image Background Color

ESC
#ffffff

Insert Columns Layout

ESC
Column Type:

Select Code Language

ESC
Select Coding Language

Insert Chart

ESC

Use the search box below

Upload Previous Version of Article State

ESC