A storied introduction to RAG

One of my nightmares is being on Jeopardy! and having the worst board ever.  Along with categories like Minor League Right Fielders of the 1970s and The Periodic Table in Portuguese  I’d expect to see LLM Abbreviations.  While I do know LLM stands for Large Language Model, there seems to be a new LLM-related abbreviation every day. 

It’s time to face my fears and take on one of those abbreviations–RAG.  But instead of getting into technical details right away, let’s tell a story that will hopefully make RAG a lot less daunting.

What is RAG?

The abbreviation RAG itself is not that bad; it stands for Retrieval-Augmented Generation.  That’s a pretty good description.  Before generating an answer, RAG will retrieve documents that are pertinent to the question and send them along to the LLM to get the final answer.

RAG is one way to improve how LLMs generate their output.  There are a lot of other alternatives to using RAG, such as 

  • retraining an entire model from scratch (a very expensive operation) 
  • working to improve the prompts sent to the model 
  • tuning the model with more test cases specific to your use case

So, when is RAG a better choice than these?  

  • RAG gives the LLM access to current information.  Since retraining is so expensive, it isn’t done often and some popular LLMs may not have been trained for years.  
  • RAG gives the LLM access to specific information for a company.  A bakery may use a general LLM to help with planning, but by using RAG, it can add exact information about the bakery’s tools, recipes, and supplies.   
  • Since the user is providing information they believe to be true, RAG tends to reduce hallucinations.

A story

I’ve had the privilege of knowing Dr. Shannon Duvall for years and one of her research areas is storytelling in computer science[1].  Stories can make complex topics more concrete and, on first glance, RAG is a complex topic.  So, let’s start with a story.

Chris owns a travel agency and Pat works there.  

Chris finished travel agent school in 2022 and has been too busy to keep up with what’s new.

Fortunately, the travel agency gets lots of brochures from around the world that have more recent information.  Now, these brochures are in lots of foreign languages and they use similar words to say the same thing (like “on the water”, “beachfront”, “walk from your room to the ocean”, “sur la mer”, “на берегу океана” etc.).  So Pat has encoded them into a bunch of numbers, based on their attributes.  Pat uses a very involved process to do this encoding.  There’s no clear pattern between the document contents and the encoding, but it is amazingly consistent!

When Madison, a customer, comes and asks for advice, Pat encodes the request using the same process and looks at the brochures’ encodings and finds the ones that match the best.  Again, Pat has a fancy way of checking for how well brochures match, but we trust Pat to do the match well.

Once Pat has all the best brochures ready, they are given to Chris, who also gets Madison’s request.  Chris has a rule that any response must include information from at least one of those brochures.   Based on all of this, Chris responds to Madison.

Comments on the story

On first read, this story may seem to have nothing to do with computing or LLMs, but let’s break it down.

  • Chris represents the LLM that will ultimately generate a recommendation for Madison.
  • But Chris’s knowledge is outdated, so will need to be augmented.
  • There are lots of travel brochures that can be used to help Chris make a current decision and Pat will decide which of these are pertinent.
  • Fortunately, the information for each of the brochures has been given a numeric encoding that is stored in the travel agency’s computer before Madison ever showed up.
  • So Pat will run a simple program to encode Madison’s request and determine which of the brochures are the closest matches.  
  • These brochures are then given to Chris along with the original request.  Using this additional, pertinent material, Chris can make a better recommendation to Madison. 

So, how does that relate to doing retrieval-augmented response generation?

First, RAG is not a one step process.

  • You need to first create a set of authoritative material and create an encoding from it.  This does not need to be done for each request of the LLM.
  • Then you need to encode the request and use it to retrieve the pertinent documents.
    • Be sure you’re using the same encoding system for both the main set of documents and the request!
  • Then you can send the query along with the documents to the “main” LLM to create a response.

RAG doesn’t change the underlying model.  Because of this, it has a relatively low cost per request, just needing to encode single prompt.  

And there’s not one way to do it.  You still need to pick a place to save the embeddings (the vector space) and decide on how to determine how close two points in that vector space are (the similarity measure).  If you have a lot of embeddings (at least 5,000), it may increase efficiency to create a vector index to improve your search speed.  And you still have your choice of the main LLM to feed these intermediate results into.

Why use numbers?

Why do we need to translate the documents into numbers?  Put simply, computers are very, very good at processing numbers and nowhere near as good at dealing with text.  By converting to numbers, we play into the strengths of computing.  (My apologies to any cyber-archaeologists who read this in 20 to 100 years and laugh at the claim that computers are not as good at handling text as they are at numbers.)  Converting to numbers removes differences in wording and languages.   

If you look into embeddings, you’ll usually find examples like plotting foods on a “spicy” to “bland” axis (and it may go to 2 or 3 dimensions, adding in “hot” or “cold” and “healthy” or “indulgent”).  I’m not going to do that here because it can be misleading.  First, most embeddings are done in hundreds of dimensions.  Second, usually the axes do not have clear meaning (like “temperature” or “spiciness”).   But you can see a simple example of the embeddings for individual words at the Embedding Projector, which visualizes high dimensional embeddings for 10,000 common English words and the closest words to them.  I searched for “dog” and while I expected some of the results (like “pet”), others were more surprising (like “hat”).

Next steps

Hopefully, this introduction has let you see RAG is not as daunting a topic as it may have seemed.  Understanding something and implementing it can be vastly different.  Come back for future posts to get into the details of find and embedding data and passing it on to an LLM

[1] Shannon Duvall. 2008. Computer science fairy tales. J. Comput. Sci. Coll. 24, 2 (December 2008), 98–104.