Expanding contexts: In Context Learning vs. Retrieval-Augmented Generation (RAG) in LLM customisation
By: Devika Bhalla
General purpose models like Open AI and Anthropic trained on publicly available data have gained significant popularity amongst end users. However, enterprises are keen to incorporate business-specific context. As per a survey by Insight Partners, around 50% of enterprises are fine-tuning and customising Large Language Models (LLMs). LLMs can be customised with either training stage techniques, such as fine-tuning, or Inferencing stage techniques, such as RAG or In-Context learning. While fine-tuning is best suited for adding a specialised task, e.g., image recognition, or domain-specific knowledge, in-context learning and RAG are more suited to provide up-to-date knowledge and context. This article discusses the expansion trend of context windows and its implications on using RAG and in-context learning in customisation for enterprise AI use.
Retrieval-Augmented Generation technique was first introduced in 2020 by Meta’s AI researcher Patrick Lewis, who referred to RAG as a process of optimising the output of a LLM, so it references a knowledge base outside of its training data sources before generating a response. The model doesn’t have to rely only on what it has learned during training—rather it can pull in fresh, up-to-date knowledge when needed. In-context learning is a technique in which the model is fed a few key examples or hints, and it uses those to figure out the right output.
The ever-expanding context window
Since the advent of Generative AI in the enterprise, context windows have gone from few thousand tokens to a million in tokens. For instance, Anthropic’s Claude for Enterprise offers a 500K token context window, while Google’s Gemini 1.5 Pro and Gradient, an AI startup, expanded to 1 million tokens in 2024. Further, Google Cloud in partnership with an AI start up Magic is developing cloud-based supercomputers capable of supporting LLMs with a 100 million-token context window—equivalent to processing ten years of human speech.
Does this spell the end of RAG?
The enthusiasm surrounding expanding context windows raises questions about the future of RAG. For example, Pin-Yu Chen, an IBM researcher who has studied in-context learning, predicts that RAG will disappear if all the external documents and information can be fed into the LLM. He argues that RAG comes with information loss, which can be minimised by feeding all the information directly.
While there is a lot of promise with the use of large context windows, there have been mixed results on its performance. Recent research ( Li et al., June 2024) suggests that while LLMs handle shorter contexts well, they struggle with longer, complex sequences. Similarly, another line of research (Liu et al.,August 2023) suggests that the performance of LLMs is highest when relevant information occurs at the beginning or end of the input context and significantly degrades when models must access important information in the middle of long contexts, even
for explicitly long-context models. These challenges suggest that while longer context windows hold promise, they may not fully replace RAG in enterprise use cases.
Moreover, from a cost perspective, long context windows seem to be computationally more expensive. Research (Li et al., July 2024) suggests that in contrast to Long Context Window, RAG significantly decreases the input length in LLMs, which in turn leads to reduced costs as the LLM APIs are typically priced based on the number of input tokens.
What lies ahead?
In-Context learning is still far for being an alternative to RAG in the customisation of LLMs. However, there is potential for advancements in memory compression, etc. that can accelerate the adoption of In-context learning. As per recent research (Munkhdalai et al., August 2024) Infini-attention, is a technique that enables them to identify the right details and information to store from the long context window so only relevant parts are considered for retrieval and compute, thereby minimising input tokens. This approach allows LLMs an efficient way to access key information without a significant increase in memory and compute usage
Rather than viewing RAG and long context window models as competing solutions, it is important to recognise that these approaches are best suited for different scenarios. Some AI providers, such as Dataiku and Cohere, and recent research suggest that long-context window models are ideal when understanding the entirety of a document provides a clear advantage, such as in multi-document summarisation tasks, long horizon agent tasks, etc. By contrast, there are scenarios where accurately retrieving precise details and citation quality are important; RAG is still the way to go.
Moreover, there is also merit in combining the best of RAG and Long Context LLMs. A framework introduced by Jiang et al. (June 2024) called LongRAG enhances traditional RAG by utilising long-context LLMs. In traditional RAG, the retrieval units are generally short but plenty in number, which also means there are chances of mixing up important information across different retrieval units whereas LongRAG groups the documents into longer text but less numbers of units reducing the burden on the retriever. The larger context window of LLMs allows the model to generate responses based on a longer retrieval data leading to improved performance.
The expanding context windows in LLMs offer exciting new opportunities for enterprise-level AI customisation. However, RAG remains a cost-effective and efficient solution for now, particularly in use cases requiring precise, up-to-date information retrieval. Enterprises must stay abreast of advancements in techniques, context windows, etc., to leverage the best method or combination of methods for customising for their specific AI use cases.
— The author, Devika Bhalla is based in New York City, where she works in Product Strategy at a Fortune 500
Enterprise Technology company. She is also a Fellow at the American Society of AI. The views expressed are those of the author and do not represent the views of their employer