Understanding RAGs, Multimodal RAGs and the Future of Enterprise Search

RAG AI LLMs

Unlock the potential of your data with Multimodal RAG. This advanced technique combines the power of large language models with diverse data types (text, images, tables) to revolutionize enterprise search and knowledge retrieval.

Thu Feb 20 2025

8 min read

Understanding RAGs, Multimodal RAGs and the Future of Enterprise Search

I began writing this post in November 2024, but it wasn’t published until February 2025. My goal was to simplify the concept of Retrieval‑Augmented Generation (RAG) and its potential for enterprise search.

Typically, anyone adopting an AI solution at the enterprise level is interested in the benefits and challenges of RAG and how it can be extended to multimodal data for more comprehensive search capabilities. This article was inspired by two workshops I attended at the OSI 2024 conference.

First, Multimodal Retrieval- Augmented Generation (RAG) Implementation

Presented by:

Ashwini KumarLead Software Engineer

Vipul GuptaDirector, Technology Solutions

Sanchit BalchandanSolution Architect

Second, Building AI Application in the Cloud and Locally

Presented by:

Vinayak HegdePrincipal AI Advocate

Unfortunately, I could not run the hands‑on workshop exercises because my 2015 Mac Book Pro could not handle the latest software requirements. But I did manage to get a good understanding of the concepts and the potential of RAG in enterprise search.

Introduction

Retrieval-Augmented Generation, or RAG, is a technique that’s gaining traction in the AI world. Particularly in enterprise environments, where privacy and proprietary data handling are paramount, RAG offers a powerful solution. Let’s explore how RAG, when combined with large language models (LLMs) and multimodal capabilities, can revolutionize search and information retrieval across different types of data.

TLDR;

Imagine a really smart robot that can answer any question. But sometimes, it doesn't know the answer or gives a wrong answer because it hasn't learned enough. RAG is like giving the robot a superpower to look up answers in books or on the internet! This way, the robot can always give the right answer and learn new things. It's like having a super smart friend who always knows the answer and can help you learn too!

Problem Statement

Enterprises often need search functionality on proprietary data without the risk of exposing it to third-party services. Until recently, many models had small context windows, which often led to out‑of‑context hallucinations.

Options like fine-tuning models on in-house data or using pre-trained models are available, but these solutions are often:

Expensive: They demand high computational resources.
Data-Dependent: They require annotated, labeled, and clean data.

This is where RAG-based solutions step in, providing a more feasible way to integrate external knowledge retrieval with LLMs for more accurate and contextual responses.

What is RAG?

RAG combines LLMs with external knowledge retrieval from even proprietary data sources to enhance response accuracy and context. LLMs are like super-smart computers that can understand and talk like humans. But they can only use the information they were trained on, which can be outdated or incomplete. RAG helps LLMs by letting them access and use information from the real world, like books, articles, or websites, to answer your questions better.

Here’s how it works:

User Query: The user submits a query (Step 1)
Retrieval Model: The query is passed to a retrieval model that fetches relevant context, be it a paragraph, document, or set of documents (Step 2 and 3)
LLM Processing: The retrieved context and the user’s query are fed into the LLM (Step 4 and 5)
Response Generation: The LLM generates an accurate, context-rich response (Step 6)

RAG Components

1. Ingestion

To make data retrievable, we first need to process and embed it:

Document Splitting: Breaks down large documents into smaller chunks (500-1000 tokens each).
- Again, a point to note is that if the LLM model of choice has a larger context window, the chunk size can be increased.
Document Embedding: Converts text into numerical vectors, capturing semantic meaning. Similar vectors indicate similar meanings, enabling semantic search.

2. Vector Database

A vector database stores these embeddings, facilitating fast retrieval.

3. Retrieval

Here, semantic search is applied to the embedded documents, fetching the most contextually relevant information.

4. Synthesis

The LLM combines the retrieved context with the user’s query to produce a comprehensive response.

Why RAG?

LLMs, while powerful, have limitations:

Static Training Data: Models often lack the latest information.
No Fact-Checking: LLMs can “hallucinate” or make things up without reliable sources.
Cost-Effective: It’s cheaper than continually retraining a model on new data.

Benefits of RAG

Enhanced Accuracy: Reduced hallucinations, with links to verified information.
Dynamic Knowledge Updates: Continuously updated information from external sources.
Operational Efficiency: Easier maintenance with minimal retraining.

How LLMs and SLMs Help in RAG

LLMs are the brains of RAG systems. They process information and generate answers. But sometimes, LLMs can be too big and slow. That's where Small Language Models (SLMs) come in. SLMs are smaller and faster, and they can help LLMs by:

Finding information quickly: SLMs can quickly find the information LLMs need from a huge amount of data.
Summarizing information: SLMs can make information shorter and easier for LLMs to understand.
Answering specific questions: SLMs can be trained to be experts in certain areas and answer questions in those areas.

SLMs over LLMs?

Lower computational cost: SLMs are much cheaper and faster than LLMs, but they can't do everything LLMs can.
Faster inference speed: SLMs can process information much quicker, making them ideal for real-time applications like chatbots where fast responses are crucial.
Other things like much better control over privacy, security and task specific focus, as SLMs can be trained on smaller, more specific datasets.

Real-World Challenges with RAG

Implementation challenges:

Chunking – keeping context across document chunks can be difficult.
Embedding consistency – the same embedding model should be used for both initial processing and final queries.
Image and tabular data – these formats often lack full support, which can cause misinterpretation.

Limitations

No Real-Time Updates: LLMs don’t update on their own and rely on the stored vectors.
Restricted to Textual Data: Traditional models struggle with non-text formats unless using multimodal LLMs.

Multimodal Data with RAG

One of the implementation challenges of RAG is handling multimodal data. But this is now possible with the advent of orchestrated multimodal LLMs. In simple terms, we can break down image and table information into a format that an LLM can understand. This challenge is expected to diminish as multimodal LLMs improve.

By adopting the above mentioned way RAG’s potential extends beyond just text; it can handle multimodal data too:

Text
Images with Text
General Images
Tabular Data

Multimodal LLMs like LLaVa-NeXT, PaliGemma, and Pixtral 12B (Mistral AI) can process various data types, enabling RAG-based solutions that use text, images, and even tables to provide more nuanced responses.

Steps for MultiModal RAG Implementation

Data Extraction: Use parsers for unstructured PDFs, obtaining separate elements, tables, and images.
Embedding: For each data type, create specific embeddings:
- Image Summary: Generate summaries from images.
- Table Summary: Convert tables into HTML or Markdown format.
- Text Summary: Standard text embeddings.
Querying: Use multimodal LLMs to process and generate responses based on the query and embedded data.

Costs of Using RAG

Using RAG can be expensive because you need to pay for:

Computers and storage: You need powerful computers and lots of storage space to keep all the information.
Training LLMs: It takes a lot of time and money to train LLMs.
Experts: You need skilled people to build and maintain RAG systems.

Common Questions on RAG

Q1: What about token usage costs in RAG-based solutions?
Since RAG may require sending entire documents to the LLM, it can be costly. However, you can control costs by choosing embedded contextual data where data sent to LLMs is optimized for generating context relevant response.

Q2: Can GPT-4o handle embeddings in a MultiModal RAG setup while a smaller LLM manages response generation?
Yes! For tasks like text summarization from tables or images, multimodal LLMs can create the embeddings, allowing a smaller LLM to generate the final response.

Q3: Do I need to convert images to base64 for API calls, or can they be sent directly?
Direct image transmission is possible, though LangChain might impose limitations. Alternatively, store files in S3 and pass URLs for processing.

Key Sub-topics in RAG

Here are some important things to know about how RAG works:

Retrieval Methods: Different ways to find the right information, like searching by keywords or using smart algorithms that understand the meaning of words.
Knowledge Base Construction: How to collect, organize, and store information so that it's easy to find and use.
Contextualization: Making sure the LLM understands the context of your question and uses the right information to answer it.
Evaluation Metrics: How to measure how well the RAG system is working, like checking if it finds the right information and gives accurate answers.

Hopefully I will be able to dive deeper into these topics in future posts.

Conclusion

RAG is a powerful tool that makes LLMs even better at understanding and answering your questions. It's like giving LLMs a superpower to access and use information from the real world. This makes them more accurate, reliable, and helpful in many different situations.

RAG is redefining how we interact with data, making information retrieval more accurate, context-driven, and responsive across modalities. It’s a promising step toward creating dynamic, real-world applications that handle data effectively in secure environments.