RAG - with Romeo and Juliet

Over the last week I've been learning something you may have heard mentioned in the context of large language models - Retrieval Augmented Generation (RAG).

What is RAG?

Large Language Models (LLMs) like ChatGPT are trained on vast amounts of data - most of the internet - but that doesn't include your internal company data.

Say you work in an insurance company. If you ask ChatGPT about customer behaviour, it can only give a generic response. If you ask it about claims from October in London, and that information isn't publicly available, then it can't answer your question.

RAG bridges the gap between the immense power of a LLM and the specific information that it doesn't have access to but you do. The idea is to retrieve relevant information from your data, augment your query with the relevant bits, and then send those to the LLM to generate a more informed answer.

Why am I learning RAG?

I identified RAG as a key AI-based skill with which I can contribute value to almost any company.

As you probably know, ChatGPT sparked the current AI revolution a couple of years ago. If you look at trends since then, initially everyone was rushing to train their own neural networks. Then the best models improved and people switched more to fine-tuning existing models for custom purposes. Then OpenAI, Anthropic and others developed the multimodal foundational models that we're familiar with today, models so powerful that fine-tuning them or training one's own neural network just can't compete.

For example, Bloomberg developed a proprietry model for financial data, at enormous expense, in the early part of this boom. It is now thoroughly outperformed by today's top models, even though they are general models, not specifically focused on the financial domain.

The huge foundational models (ChatGPT, Claude, Gemini and others) are phenomenal, ever more accurate and capable. I personally use them all the time for my research and programming assistance and am at least four times as productive as a result. There are thousands of LLM models out there but these big boys outperform the rest by some way - and are still rapidly improving.

So the future isn't making custom neural networks, like I enjoyed learning how to do last month; it's learning to use the latest and greatest foundational LLMs. (At the time of writing, the best one is Claude 3.5 by most measures.)

So one of the key skills an AI developer can have, in my opinion, is helping companies harness this power and integrate it into their existing systems and workflows.

RAG is a key component to deliver value in this regard. Related topics are LangChain and prompt engineering, which I'm also learning this month.

RAG in Action - "How does Juliet die?"

Here is an example of RAG that I wrote in Python after working through several deep dive tutorials, notably this one on RAG and this one on LangChain.

Rather than working with insurance company data, I took 16 classic literary texts from the latter tutorial: Romeo and Juliet, Moby Dick, Ulysses, Frankenstein, Sherlock Holmes, Pride and Prejudice, etc.

I then split them into chunks and then used a pre-trained 'embedding model' from HuggingFace to create what's called embeddings for each chunk. These are numerical representations of the text but which cleverly capture the meaning. This takes the form of multi-dimensional vectors, and are stored in a vector database.

That sounds complicated but thanks to Python libraries and the HuggingFace model I used for the embedding process, it's about ten lines of code.

Then comes the fun part. I ask a simple question, like "How does Juliet die?", not specifying which of the 16 texts I'm referring to. Obviously LLMs are trained on all of the above literary books but note that we're not using one at all at this stage.

For the retrieval part of RAG, we're going to do a semantic search, and this is where it gets really interesting.

Semantic search

Here is the power of embeddings and vectors. If you searched through the above literary texts for "How does Juliet die?", there would be no matches because that exact phrase doesn't appear in them. But our numerical representation of each chunk of text has captured its meaning, and that's all stored in our vector database.

So the magic happens when we encode our query - "How does Juliet die?" into the same embedding space and then compare it to each stored chunk of text in terms of meaning. Each chunk it is compared to produces a similarity score, and then we can just take the top one or more matches.

For "How does Juliet die?", the chunk of text whose meaning ranks as most relevant is:


'[_Noise within._]

I hear some noise. Lady, come from that nest
Of death, contagion, and unnatural sleep.
A greater power than we can contradict
Hath thwarted our intents. Come, come away.
Thy husband in thy bosom there lies dead;
And Paris too. Come, I’ll dispose of thee
Among a sisterhood of holy nuns.
Stay not to question, for the watch is coming.
Come, go, good Juliet. I dare no longer stay.

Go, get thee hence, for I will not away.

 [_Exit Friar Lawrence._]

What’s here? A cup clos’d in my true love’s hand?
Poison, I see, hath been his timeless end.
O churl. Drink all, and left no friendly drop
To help me after? I will kiss thy lips.
Haply some poison yet doth hang on them,
To make me die with a restorative.

 [_Kisses him._]

Thy lips are warm!

[_Within._] Lead, boy. Which way?

Yea, noise? Then I’ll be brief. O happy dagger.

 [_Snatching Romeo’s dagger._]

This is thy sheath. [_stabs herself_] There rest, and let me die.'

Note that it doesn't even say that Juliet dies. But that particular passage is deemed most similar to the query, more than anything in Frankenstein or Huckleberry Finn or elsewhere in Romeo and Juliet.

Augmented Queries

So far we've only done the R part of RAG. We haven't interacted with a LLM at all. We pre-processed our documents, which in the real world would probably be company data rather than literary texts. Then we used that information to retrieve relevant passages based on a given query.

Augmenting the query is then just a matter of prompt engineering. Specifically, I wrote:


"You are a helpful research assistant.

Please answer the question '{query}' briefly, using ONLY the information below.

If there is not enough information to answer the question, say that you do now know the answer, but hazard a guess if there is a possible answer. Do not use your existing knowledge.

If you are able to answer the question, please follow your reply by mentioning the associated source as shown below."


Then I append the three retrieved chunks of text that were most relevant to the query, including the passage above in this case, plus the associated sources. Then I sent the augmented query to Claude using the Anthropic API, and it replied:


Based on the information provided in the documents, I do not have enough information to definitively answer how Juliet dies. The documents mention that Juliet is found dead by Friar Lawrence, and that she takes Romeo's dagger and stabs herself, but do not explicitly state that this action leads to her death.

However, a possible inference based on the context could be that Juliet dies by stabbing herself with Romeo's dagger. This can be supported by Juliet's line "O happy dagger. This is thy sheath. [_stabs herself_] There rest, and let me die." which suggests she intentionally stabs herself, likely leading to her death.

Source: 'romeo_and_juliet.txt'

- which is an excellent answer.

Again, Claude would have known the answer without any RAG in this case, hence why I told it not to use any existing knowledge. But in a real use-case with company data, there's no way it could answer without the augmented query. Clever stuff. And the fact it can provide a source makes the response much more reliable and trustworthy.

RAG versus attachments

ChatGPT and Claude let me upload attachments, you might say. Why bother with RAG if I can just send it the relevant text in its entirity?

In simple scenarios, this would work. But even the best LLMs have limits to how much input you can give them, what's called the context window. Your company data might be millions of documents. There are likely also privacy issues - you don't want to push all of your corporate secrets to OpenAI every time you want to use a LLM.

A huge amount of data also means that most of it is irrelevant, and that low signal to noise ratio makes it harder for the LLM to extract the relevant parts. It performs much better to just give it the data it needs to answer the question, rather than including a huge amount of irrelevant information, even if the context window was large enough to accept it.

Finally, using LLMs isn't free. In a web or app interface we can use Claude and ChatGPT for free but to access the APIs programmatically, as you would need to do in any kind of commercial production environment, costs money. The amount it costs scales with the amount of input and output. If you're maxing out the context window by sending it thousands of pages of text, and making thousands of queries a day in an enterprise environment, the cost is going to rise very quickly.

Illustration generated by Dall-E 3.