Investigating RAG

My initial attempt at implementing Retrieval-Augmented-Generation search for this site did not go so well, and I think it is because of the chunk size that I used to create the embeddings for site-wide semantic search. In this note, I will investigate the ideal chunk size for RAG for this site, and then hopefully implement RAG search.

Date Created:

Last Edited:

2 169

References

Llama Index

LlamaIndex is the leading framework for building LLM-powered agents over your data with LLMs and workflows.

Introduction

What are Agents

Agents are LLM-powered knowledge assistants that use tools that perform tasks like research, data extraction, and more. Agents range from simple question-answering to being able to sense, decide, and take actions in order to complete tasks.

What are Workflows

Workflows are multi-step processes that combine one or more agents, data connectors, and other tools to complete a task. They are event-driven software that allows you to combine RAG data sources and multiple agents to create a complex application that can perform a wide variety of tasks with reflection, error correction, and other hallmarks of advanced LLM applications. You can then deploy these agentic workflows as production microservices.

What is Context Augmentation

LLMs offer a natural language interface between humans and data. LLMs come pre trained on huge amounts of publicly available data, but they are not trained on your data. Your data may be private or specific to the problem you are trying to solve. Context augmentation makes your data available to the LLM to solve the problem at hand. LlamaIndex provides the tools to build any context-augmentation use case, from prototype to production. The most popular example of context-augmentation is Retrieval-Augmented Generation or RAG, which combines context with LLMs at inference time.

LlamaIndex Tools

Data connectors ingest your existing data from their native source and format. These could be APIs, PDFs, SQL and more.
Data indexes structure your data in intermediate representations that are easy and performant for LLMs to consume.
Engines provide natural language access to your data:
- Query engines are powerful interfaces for question-answering
- Chat engines are conversational interfaces for multi-message, back and forth interactions with your data
Agents are LLM powered knowledge workers augments by tools, from simple helper functions to API integrations and more
Observability/Evaluation integrations that enable you to rigorously experiment, evaluate, and monitor your app in a virtuous cycle
Workflows allow you to combine all of the above into an event-driven system far more flexible than other, graph-based alternatives.

Popular Use Cases

Question Answering
Chatbots
Document Understanding and Data Extraction
Autonomous Agents
Multi-modal applications
Fine-tuning

High-Level Concepts

Large Language Models

LLMs are the fundamental innovation that launched LlamaIndex. They are an artificial intelligence (AI) computer system that can understand, generate, and manipulate natural language, including answering questions based on their training data or data provided to them at query time.

Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) is a core technique for building data-backend LLM applications with LlamaIndex. It allows LLMs to answer questions about your private data by proving it to the LLM at query time, rather than training the LLM on your data. To avoid sending all of your data to the LLM every time, RAG indexes your data and selectively sends only relevant parts along with your query.

Agents

An agent is a piece of software that semi-autonomously performs tasks by combining LLMs with other tools.

Use Cases

The use cases for LLM applications can be roughly grouped into four categories:

Structured Data Extraction Pydantic extractors allow you to specify a precise data structure to extract from your data and use LLMs to fill in the missing pieces in a type-safe way. This is useful for extracting structured data from unstructured sources like PDFs, websites, and more, and is the key for automating workflows.
Query Engines: A query engine is an end-to-end flow that allows you to ask questions over your data. It takes in natural language query and returns a response, along with reference context retrieved and passed to the LLM.
Chat Engines: A chat engine is an end-to-end flow for having a conversation with your data.
Agents: An agent is an automated decision maker powered by an LLM that interacts with the world through a set of tools. Agents can take ab arbitrary number of steps to complete a given task, dynamically deciding on the best course of action rather than following pre-determined steps.

The LlamaIndex ecosystem is structured using a collection of namespaced packages. What this means for users is that LlamaIndex comes with a core starter bundle, and additional integrations can be installed as needed.

Learn

Using LLMs

One of the first steps when building an LLM-based application is which LLM to use; you can also use more than one if you wish. LLMs are used at multiple stages in your workflow:

During Indexing, you may use an LLM to determine the relevance of data (whether to index it at all) or you may use an LLM to summarize the raw data and index the summaries instead
During Querying LLMs can be used in two ways:
- During Retrieval (fetching data from your index) LLMs can be given an array of options (such as multiple different indices) and make decisions about where best to find the information you are looking for. An agentic LLM can also use tools at this stage to query different data sources.
- During Response Synthesis (turning retrieved data into an answer), an LLM can combine answers to multiple sub-queries into a single coherent answer, or it can transform data, such as from unstructured text to JSON or another programmatic output format

By default, LlamaIndex comes with a great set of built-in, battle-tested prompts that handle the tricky work of getting a specific LLM to correctly handle and format data. This is one of the biggest benefits of using LlamaIndex. If you want to, you can customize the prompts.

Building a RAG Pipeline

LLMs are trained on enormous amounts of data but they aren't trained on your data. Retrieval-Augmented Generation (RAG) solve this problem by adding your data to the data LLMs already have access to. In RAG, your data is loaded and prepared for queries or indexed. User queries act on the index, which filters your data down to the most relevant context. This context and your query then go to the LLM along with a prompt, and the LLM provides a response.

RAG

There are five key stages of RAG, which in turn will be part of most larger applications you will build. These are:

Loading: this refers to getting your data from where it lives - whether it's text files, PDFs, another website, a database or an API - into your workflow. LlamaHub has hundreds of connectors to choose from.
Indexing: this means creating a data structure that allows for querying the data. For LLMs, this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data
Storing: once your data is indexed you will almost always want to store your index, as well as other metadata, to avoid having to re-index it.
Querying: for any given indexing strategy there are many ways you can utilize LLMs and LlamaIndex data structures to query, including sub-queries, multi-step queries, and hybrid strategies
Evaluation: a critical step in any workflow is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures of how accurate, faithful, and fast your responses to queries are.

Important Concepts with RAG

Loading Stage
- Nodes and Documents: A Document is a container around any data source - for instance, a PDF, an API output, or retrieve data from a database. A Node is the atomic unit of data in LlamaIndex and represents a chunk of a source Document. Nodes have metadata that relate them to the document they are in and to other nodes.
- Connectors: A data connector (often called a Reader) ingests data from different data sources and different formats into Documents and Nodes
Indexing Stage
- Indexes: Once you've ingested your data, LlamaIndex will help you index the data into a structure that's easy to retrieve. This usually involves generating vector embeddings which are stored in a specialized database called a vector store. Indexes can also store a variety of metadata about your data.
- Embeddings: LLMs generate numerical representations of data called embeddings. When filtering your data for relevance, LlamaIndex will convert queries into embeddings, and your vector store will find data that is numerically similar to the embedding of your query
Querying Stage
- Retrievers: A retriever defines how to effective retrieve relevant context from an index when given a query. Your retrieval strategy is key to the relevancy of the data retrieved and the efficiency with which it's done.
- Routers: A router determines which retriever will be used to retrieve relevant context from the knowledge base. More specifically, the RouterRetriever class, is responsible for selecting one or multiple candidate retrievers to execute a query. They use a selector to choose the best option based on each candidate's metadata and the query.
- Node Postprocessors: A node postprocessor takes in a set of retrieved nodes and applies transformations, filtering, or re-ranking logic to them
- Response Synthesizers: A response synthesizer generates a response from an LLM, using a user query and a given set of retrieved text chunks

Loading and Ingestion

Before an LLM can act on your data, you first need to process the data and load it. This has parallels to data cleaning/feature engineering pipelines in the ML world, or ETL pipelines in the traditional data setting. The ingestion pipeline typically consists of three main stages:

Load the data
Transform the data
Index and store the data

After the data is loaded, you need to process and transform the data before putting it into a storage system. These transformations include chunking, extracting metadata, and embedding each chunk. This is necessary to make sure that the data can be retrieved, and used optimally by the LLM.

When your data is loaded, you now have a list of document objects (or a list of Nodes). It's time to build an Index over these objects so you can start querying them. In LlamaIndex terms, an Index is a data structure composed of Document objects, designed to enable querying by an LLM.

A VectorStoreIndex is by far the most frequent type of Index you'll encounter. The Vector Store Index takes your Documents and splits them up into Nodes. It then creates vector embeddings of the text of every node, ready to be queried by an LLM. Vector embeddings are central to how LLM applications function. A vector embedding, often just called an embedding, is a numerical representation of semantics, or meaning of your text. Two pieces of text with similar meanings will have mathematically similar embeddings, even if the actual text is quite different.

This mathematical relationship enables semantic search, where a user provides query terms and LlamaIndex can locate test that is related to the meaning of the query terms rather than simple keyword matching. This is a big part of how Retrieval-Augmented Generation works, and how LLMs function in general. There are many types of embeddings, and they vary in efficiency, effectiveness and computational cost.

VectorStore turns all of your text into embeddings using an API from your LLM; this is what is meant when we say it embeds your text. When you want to search your embeddings, your query itself is turned into a vector embeddings, and then a mathematical operation is carried out by VectorStoreIndex to rank all the embeddings by how semantically similar they are to your query.
Once the ranking is complete, VectorStoreIndex returns the most-similar embeddings as their corresponding chunks of text. the number of embeddings it returns is known as k, so the parameter controlling how many embeddings to return is known as top_k. This whole type of search is often referred to as top-k semantic retrieval for this reason.

A summary Index is a simpler form of Index best suited to queries where, as the name suggests, you are trying to generate a summary of the text in your Documents. It simply stores all of the Documents and returns all of them to your query engine. If your data is a set of interconnected concepts, then you may be interested in our knowledge graph index.

Once you have data loaded and indexed, you will probably want to store it to avoid the time and cost of re-indexing it. By default, your indexed data is stored only in memory.

Querying is just a prompt call to an LLM: it can be a question and get an answer, or a request for summarization, or a much more complex instruction. More complex querying could involve repeated/chained prompt + LLM calls, or even a reasoning loop across multiple components.

Querying consist of three distinct stages:

Retrieval is when you find and return the most relevant documents for your query from your Index
Postprocessing is when the Nodes retrieved are optionally reranked, transformed, or filtered, for instance by requiring that they have specific metadata such as keywords attached
Response Synthesis is when your query, your most-relevant data and your prompt are combined and sent to your LLM to return a response.

In LlamaIndex, an agent is a semi-autonomous piece of software powered by an LLM that is given a task and executes a series of steps toward solving that task. It is given a set of tools, which can be anything from arbitrary functions up to full LlamaIndex query engines, and it selects the best available tool to complete each step. When each step is completed, the agent judges whether the task is now complete, in which case it returns a result to the user, or whether it needs to take another step, in which case it loops back to the start.

Workflows Introduction

A workflow is an event-driven, step-based way to control the execution flow of an application. Your application id divided into sections called Steps which are triggered by Events, and themselves emit Events which trigger further steps. By combining steps and events, you can create arbitrary complex flows that encapsulate logic and make your application more maintainable and easier to understand. A step can be anything from a single line of code to a complex agent. They can have arbitrary inputs and outputs, which are passed around by events.

As generative AI applications become more complex, it becomes harder to manage the flow of data and control the execution of the application. Workflows provide a way to manage this complexity by breaking the application into smaller, more manageable pieces.

GitHub Repository on RAG Techniques

Retrieval-Augmented Generation is revolutionizing the way we combine information retrieval with generative AI. This repository showcases a curated collection of advanced techniques designed to supercharge your RAG systems, enabling them to deliver more accurate, contextually relevant, and comprehensive responses.

User Comments

Frank • 2m ago

New comment 1

0 2 9

Second test of comment

0 2 7

New comment 2

0 1 9

Testing nested comments - I originally wasn't going to allow nested comments, but I have decided to try to implement them.

6 1 9

Testing out nested ~~comments~~

1 1 7

Test of another nested comment

0 1 8

test of more comments

0 0 8

0 0 7

Test of more comments

New comment 4