Retrieval Augmented Generation

I want to learn about Retrieval Augmented Generation.

Date Created:

1 2315

References

Notes

Wikipedia

Retrieval Augmented Generation (RAG) is a technique that grants generative artificial intelligence models information retrieval capabilities. It modifies interactions with a large language model (LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to augment information drawn from its own vast, static training data. This allows LLMs to use domain-specific and/or updated information. Use cases include providing chatbot access to internal company data or giving factual information only from an authoritative source.

The RAG process is made up of four key stages. First, all the data must be prepared and indexed for use by the LLM. Thereafter, each query consists of a retrieval, augmentation, and generation phase.

RAG Process

Indexing
1. Typically, the data to be referenced is converted into LLM embeddings. These embeddings are stored in a vector database for document retrieval.
Retrieval
1. Given a user query, a document retriever is first called to select the most relevant documents that will be used to augment the query. How the comparison is done depends on the type of indexing being used.
Augmentation
1. The model feeds retrieved information into the LLM via prompt engineering of the user's original query.
Generation
1. Finally, the LLM can generate output based on both the query and the retrieved documents.

Improvements

Improvements to the basic process can be applied at different stages in the RAG flow:

Encoder
1. Improving the embeddings' accuracy of the meaning of data can improve RAG, as can improving how queries are compared to stored embeddings.
Retriever-centric methods
1. These methods improve the quality of hits from the vector database:
2. 1. pre-train the retriever using the Inverse Cloze Task
  2. progressive data augmentation
  3. re-ranking the retriever
Language model
1. By redesigning the language model with the retriever in mind, a 25-time smaller network can get comparable perplexity as its much larger counterparts
Chunking
1. Chunking involves various strategies for breaking up data into vectors so that the retriever can find details in it.
2. Types of Chunking include:
3. 1. Fixed length with overlap
  2. syntax-based chunks
  3. File format-based chunking. Certain file types have natural chunks built in, and it's best to respect them. HTML files should leave <table> elements or base64 encoded <img elements intact.

Google

RAG (Retrieval Augmented Generation) is an AI framework that combines the strengths of traditional information retrieval systems (such as searches and databases) with capabilities of generative Large Language Models (LLMs). By combining your data and world knowledge with LLM language skills, grounded generation is more accurate, up-to-date, and relevant to your specific needs.

RAG operates with a few main steps to help enhance generative AI outputs:

Retrieval and pre-processing: RAGs leverage powerful search algorithms to query external data, such as web pages, knowledge bases, and databases. Once retrieved, the relevant information undergoes pre-processing, including tokenization, stemming, and removal of stop words.
Grounded generation: The pre-processed retrieved information is then seamlessly incorporated into the pre-trained LLM. This integration enhances the LLM's context, providing it with a more comprehensive understanding of the topic. This augmented context enables the LLM to generate more precise, informative, and engaging responses.

Why use RAG?

Access to fresh information
- LLMs are limits to their pre-trained data. This leads to outdated and potentially inaccurate responses. RAG overcomes this by providing up-to-date information to LLMs.
Factual Grounding
- LLMs can sometimes struggle with factual accuracy. Providing facts to the LLM as part of the input prompt can mitigate generative AI hallucinations. The crux of this approach is ensuring that the most relevant facts are provided to the LLM, and that the LLM output is entirely grounded on those facts while also answering the suer's question and adhering to system instructions and safety constraints.
Search with Vector Databases and Relevancy Re-Rankers
- RAGs usually retrieve facts via search, and modern search engines now leverage vector databases to efficiently retrieve relevant documents.

AWS

Retrieval-Augmented Generation (RAG) is the process of optimizing the output of a large language model, so it references an authoritative knowledge base outside of its training data sources before generating a response. Large Language Models (LLMs) are trained on vast volumes of data and use billions of parameters to generate original output for tasks like answering questions, translating languages, and completing sentences. RAG extends the already powerful capabilities of LLMs to specific domains or an organization's internal knowledge base, all without the need to retrain the model. It is a cost-effective approach to improving LLM output so it remains relevant, accurate, and useful in various contexts.

RAG technology brings several benefits to an organization's generative AI efforts:

Cost-effective implementation
Current information
Enhanced user trust
More developer control

How does RAG work?

Create external data
Retrieve relevant information
Augment the LLM prompt
1. The RAG model augments the user input by adding the relevant retrieved data in context. This step uses prompt engineering techniques to communicate effectively with the LLM.
Update external data
1. To maintain current information for retrieval, asynchronously update documents and update embedding representation of the documents. You can do this through automated real-time processes or periodic batch processing.

RAG Process

NVIDIA

Retrieval-augmented generation (RAG) is a technique for enhancing the accuracy and reliability of generative AI models with facts fetched from external sources.

Patrick Lewis, lead author of the 2020 paper that coined the term, apologized for the unflattering acronym that now describes a growing family of methods across hundreds of papers and dozens of commercial services he believes represent the future of generative AI.