Conference Talk 14: Explaining the Basics of Retrieval Augmented Generation
- Mastering LLMs Course Notes: My notes from the course Mastering LLMs: A Conference For Developers & Data Scientists by Hamel Husain and Dan Becker.
- Summary
- Demystifying RAG
- The Compact MVP: A Simple RAG Implementation
- Understanding Bi-Encoders in Vector Search
- Improving Retrieval with Re-ranking
- Keyword Search
- Leveraging Metadata for Targeted Retrieval
- Putting it All Together: The Complete MVP++ Pipeline
- Beyond the Basics: Future Exploration and Resources
- Q&A Session
Summary
Ben Clavié deconstructs the concept of Retrieval-Augmented Generation (RAG) and guides the audience through building a robust, basic RAG pipeline. He emphasizes that RAG is not a standalone technology, but a pipeline combining retrieval and generation, and each component needs individual attention for optimization. Ben advocates for a “MVP++” approach, incorporating essential elements like bi-encoders, re-ranking, keyword search (TF-IDF/BM25), and metadata filtering for a well-rounded system.
Demystifying RAG
- RAG: Overused and Misunderstood
- The term “RAG” is often used incorrectly to represent an end-to-end system, creating confusion.
- RAG as a Pipeline: Retrieval + Generation
- RAG simply combines retrieval (finding relevant information) and generation (creating text) using Large Language Models (LLMs).
- It’s not a single technology, but a pipeline requiring optimization at each stage: retrieval, generation, and their connection.
- Importance of Identifying Specific RAG Issues
- “My RAG doesn’t work” is too broad. Pinpointing the failing component (retrieval, LLM utilization) is crucial for debugging.
The Compact MVP: A Simple RAG Implementation
- Basic Pipeline Components and Flow
- Query embedding
- Document embedding
- Cosine similarity search for relevant documents
- Code Example: Vector Search with NumPy
- Demonstrates a basic RAG pipeline without a vector database, emphasizing simplicity.
- Uses NumPy for cosine similarity search for demonstration purposes.
# Load the embedding model 1from sentence_transformers import SentenceTransformer = SentenceTransformer("Alibaba-NLP/gte-base-en-v1.5") model 2# Fetch some text content... from wikipediaapi import Wikipedia = Wikipedia('RAGBot/0.0', 'en') wiki = wiki.page('Hayao_Miyazaki').text doc = doc.split('\n\n') paragraphs # ...And embed it. = model.encode(paragraphs, normalize_embeddings=True) docs_embed 3# Embed the query = "What was Studio Ghibli's first film?" query = model.encode(query, normalize_embeddings=True) query_embed 4# Find the 3 closest paragraphs to the query import numpy as np = np.dot(docs_embed, query_embed.T) similarities = similarities.topk(3).indices.tolist() top_3_idx = [paragraphs[idx] for idx in top_3_idx] most_similar_documents
- 1
- Load Bi-Encoder
- 2
- Embed Documents
- 3
- Embed Query
- 4
- Cosine Similarity Search
Understanding Bi-Encoders in Vector Search
Vector Databases: When and Why?
- Useful for efficiently searching large document sets using Approximate Search techniques
- Not necessary for small datasets (e.g., 500 documents)
- Modern CPU can search through hundreds of vectors in milliseconds
Bi-Encoders: Separate Encoding for Queries and Documents
- Encode documents and queries independently.
- Pre-computed document representations allow for efficient inference, as only the query needs encoding at runtime.
- Comes with retrieval performance tradeoffs
Improving Retrieval with Re-ranking
- Bi-Encoder Limitations: Context Unawareness
- Bi-encoders encode documents and queries separately, potentially missing nuanced relationships between them.
- Cross-Encoders: Joint Encoding for Better Relevance Scoring
- Encode query-document pairs together, allowing for a more context-aware relevance score.
- Effectively a binary classifier
- Uses the probability of being the positive class as the similarity score.
- Computationally expensive for large datasets.
Re-ranking in Practice: Addressing Computational Costs
rerankers: A lightweight unified API for various reranking models.
Keyword Search
Also called “full-text search”
Embeddings Are Not Enough: Lossy Compression and Jargon
- Embeddings compress information, potentially losing details crucial for accurate retrieval, especially with domain-specific jargon and acronyms.
TF-IDF and BM25
- Emphasizes the importance of incorporating traditional keyword search alongside embedding-based methods.
- TF-IDF (Term Frequency-Inverse Document Frequency)
- Assigns a weight to words or groups of words based on their rarity
- Stanford IR Book: Inverse document frequency
- BM25 (Best-Matching 25)
- Wikipedia: Okapi BM25
- Stanford IR Book: Okapi BM25: a non-binary model
BM25 Performance and Relevance in Modern Pipelines
- Highlights BM25’s continued relevance and effectiveness, often outperforming or complementing more complex methods.
- Results Table
Model (→) Dataset (↓) BM25 DeepCT SPARTA docT5query DPR ANCE TAS-B GenQ MS MARCO 0.228 0.296‡ 0.351‡ 0.338‡ 0.177 0.388‡ 0.408‡ 0.408‡ TREC-COVID 0.656 0.406 0.538 0.713 0.332 0.654 0.481 0.619 BioASQ 0.465 0.407 0.351 0.431 0.127 0.306 0.383 0.398 NFCorpus 0.325 0.283 0.301 0.328 0.189 0.237 0.319 0.319 NQ 0.329 0.188 0.398 0.399 0.474‡ 0.446 0.463 0.358 HotpotQA 0.603 0.503 0.492 0.580 0.391 0.456 0.584 0.534 FiQA-2018 0.236 0.191 0.198 0.291 0.112 0.295 0.300 0.308 Signal-1M (RT) 0.330 0.269 0.252 0.307 0.155 0.249 0.289 0.281 TREC-NEWS 0.398 0.220 0.258 0.420 0.161 0.382 0.377 0.396 Robust04 0.408 0.287 0.276 0.437 0.252 0.392 0.427 0.362 ArguAna 0.315 0.309 0.279 0.349 0.175 0.415 0.429 0.493 Touché-2020 0.367 0.156 0.175 0.347 0.131 0.240 0.162 0.182 CQADupStack 0.299 0.268 0.257 0.325 0.153 0.296 0.314 0.347 Quora 0.789 0.691 0.630 0.802 0.248 0.852 0.835 0.830 DBPedia 0.313 0.177 0.314 0.331 0.263 0.281 0.384 0.328 SCIDOCS 0.158 0.124 0.126 0.162 0.077 0.122 0.149 0.143 FEVER 0.753 0.353 0.596 0.714 0.562 0.669 0.700 0.669 Climate-FEVER 0.213 0.066 0.082 0.201 0.148 0.198 0.228 0.175 SciFact 0.665 0.630 0.582 0.675 0.318 0.507 0.643 0.644 Avg. Performance vs. BM25 - 27.9% - 20.3% + 1.6% - 47.7% - 7.4% - 2.8% - 3.6% - Source: BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models
- Especially powerful on longer documents and documents containing a lot of domain-specific jargon
- Virtually unnoticeable inference-time compute overhead
- Highlights BM25’s continued relevance and effectiveness, often outperforming or complementing more complex methods.
The TF-IDF MVP++
Leveraging Metadata for Targeted Retrieval
- Real-World Data Has Context: Metadata Matters
- Real-world documents often possess valuable metadata (e.g., author, date, department) that can significantly improve retrieval accuracy.
- Pure Semantic or Keyword Search can struggle with metadata
- Example Query: “Can you get me the cruise division financial report for Q4 2022?”
- Model must accurately represent all of “financial report”, “cruise division”, “Q4” and “2022”, into a single vector
- Otherwise it will fetch documents that look relevant but aren’t meeting one or more of those criteria.
- If the number of documents you search for (“k”) is set too high, you will be passing irrelevant financial reports to your LLM
- Model must accurately represent all of “financial report”, “cruise division”, “Q4” and “2022”, into a single vector
- Example Query: “Can you get me the cruise division financial report for Q4 2022?”
- Entity Detection and Metadata Filtering: A Practical Example
- Use entity detection models like GLiNER to automatically extract relevant metadata (e.g., document type, time period, department).
- GLiNER
- Paper: GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer
- GitHub Repository: GLiNER
- :hugs: Spaces Demo: GLiNER-medium-v2.1, zero-shot NER
- Filter documents based on extracted metadata to ensure relevance and reduce noise.
- Storing and Using Metadata for Pre-filtering
- Store metadata alongside documents in the database.
- During retrieval, pre-filter documents based on query-specific metadata to narrow down the search space.
Putting it All Together: The Complete MVP++ Pipeline
The Final Compact MVP++
Code Example: Implementing the MVP++ with LanceDB
# Fetch some text content in two different categories
from wikipediaapi import Wikipedia
= Wikipedia('RAGBot/0.0', 'en')
wiki = [{"text": x,
docs "category": "person"}
for x in wiki.page('Hayao_Miyazaki').text.split('\n\n')]
+= [{"text": x,
docs "category": "film"}
for x in wiki.page('Spirited_Away').text.split('\n\n')]
# Enter LanceDB
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embedding import get_registry
# Initialise the embedding model
= get_registry().get("sentence-transformers")
model_registry = model_registry.create(name="BAAI/bge-small-en-v1.5")
model
# Create a Model to store attributes for filtering
class Document(LanceModel):
str = model.SourceField()
text: 384) = model.VectorField()
vector: Vector(str
category:
= lancedb.connect(".my_db")
db = db.create_table("my_table", schema=Document)
tbl
# Embed the documents and store them in the database
tbl.add(docs)
# Generate the full-text (tf-idf) search index
"text")
tbl.create_fts_index(
# Initialise a reranker -- here, Cohere's API one
from lancedb.rerankers import CohereReranker
= CohereReranker()
reranker
= "What is Chihiro's new name given to her by the witch?"
query
= (tbl.search(query, query_type="hybrid") # Hybrid means text + vector
results "category = 'film'", prefilter=True) # Restrict to only docs in the 'film' category
.where(10) # Get 10 results from first-pass retrieval
.limit(=reranker) # For the reranker to compute the final ranking
.rerank(reranker )
Beyond the Basics: Future Exploration and Resources
- RAGatouille: Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline.
- rerankers: A lightweight unified API for various reranking models.
- Video Tutorial: A Hackers’ Guide to Language Models
- ColBERT
- Sparse Vectors: Understanding hybrid search
- Multi-vector Retrievers: Multi-Vector Retriever for RAG on tables, text, and images
Q&A Session
Fine-tuning Bi-Encoders and Cross-Encoders
Q: Does the fine-tuning approach for bi-encoder models impact the fine-tuning of cross-encoder models and vice versa?
A: While domain-specific, generally aim for complementarity. Fine-tune bi-encoders for broader retrieval, capturing potential candidates. Rely on cross-encoders (re-rankers) for precise filtering and ranking.
Combining Bi-Encoder and TF-IDF Scores
Q: What are the advantages and disadvantages of using a weighted average of bi-encoder and TF-IDF scores for selecting re-ranker questions compared to taking the top X from each ranker?
A: Both methods are valid and depend on the data. Weighted averages can be effective, but in domains like biomedicine, where document specificity is crucial, taking the top X from both ensures representation for potentially poorly embedded queries.
RAG’s Future with Million-Token Context Lengths
Q: How will the emergence of million-token context lengths impact the relevance of RAG in the future?
A: RAG remains relevant even with extended context windows. Just as RAM doesn’t replace hard drives, large context windows won’t replace the need for efficient retrieval from vast external knowledge stores. Long context windows provide more flexibility in retrieval speed and allow for incorporating longer documents.
Chunking Strategies
Q: What are your thoughts on different chunking strategies?
A: While LLMs for pre-chunking are promising but currently immature, maintaining semantic continuity within chunks is vital. The recommended approach is 300 tokens per chunk, avoiding sentence interruptions and including overlapping context (50 tokens) between consecutive chunks.
Fine-tuning Bi-Encoders
Q: Should bi-encoders always be fine-tuned with labeled data or is it acceptable to use them off-the-shelf and rely on a re-ranker?
A: Fine-tuning encoders (both bi- and cross-) with labeled data consistently improves results. If data and resources are available, fine-tuning is highly recommended. However, for MVPs with limited resources, leveraging pre-trained models with re-ranking is a viable option.
Colbert Clarification and Discussion
Discussion: Clarifying the role of ColBERT in RAG pipelines.
ColBERT as a First-Stage Retriever: Ideally replaces the bi-encoder in new pipelines, not used as a re-ranker.
ColBERT as a Re-Ranker: Can be used when pipeline changes are not feasible, but less optimal.
ColBERT Overview: A bi-encoder variant where documents and queries are represented as bags of embeddings (one per token). This approach enhances out-of-domain performance due to its multi-vector representation, capturing more granular information.
Tools for Fine-tuning Embeddings
Q: Recommendations for tools to fine-tune embeddings for retrieval.
A: Sentence Transformers, particularly version 3.0, is highly recommended for its user-friendliness and comprehensive implementation of essential features.
Fine-tuning Embeddings Workflow
Q: Can you describe the workflow for fine-tuning an embedding model?
A:
- Gather Data: Obtain queries and their corresponding relevant and non-relevant documents.
- Define Loss Function: Use a suitable loss function like triplet loss, which leverages positive and negative examples to guide the model.
- Consider Hard Negatives: Enhance training by retrieving hard negatives—documents similar to positive examples but irrelevant to the query.
- Data Analysis and Generation: Thoroughly analyze existing queries or generate synthetic ones using LLMs to augment training data.
Impact of Long Context Windows on RAG
Q: How do long context windows change the strategies and possibilities within RAG?
A: Long context windows enable:
Longer Documents: Incorporating longer documents or concatenated chunks into the context.
Reduced Retrieval Overhead: Relaxing the reliance on highly precise retrieval (e.g., Recall@3) as more documents can fit within the context window. This allows for faster, less resource-intensive retrieval methods.
Fine-tuning Encoder Tutorials
Q: Recommendations for tutorials on fine-tuning encoders.
A: The Sentence Transformers documentation is a valuable resource but can be challenging for beginners.
Go-to Embedding Models
Q: Go-to embedding models for different scenarios.
A:
Demos: Cohere’s embedding models due to their API accessibility, performance, and affordability.
Production: Multi-vector models like ColBERT are preferred.
General Recommendations:
Model Size: Stick to models with parameters between 100 million and 1 billion; larger LLMs as encoders often have unfavorable latency-performance trade-offs.
Avoid Overly Large Models: Using excessively large LLMs for embedding can lead to diminishing returns in performance and increased latency.
Using Elasticsearch for RAG
Q: Can Elasticsearch, a widely used search engine, be integrated into RAG pipelines, especially for organizations already invested in it?
A:
Hybrid Approach: Use Elasticsearch’s BM25 capabilities for initial retrieval and integrate a separate re-ranking pipeline (potentially using a cross-encoder).
Vector Database Integration: Leverage Elasticsearch’s vector database offerings to incorporate semantic search capabilities.
BM25 Score in Re-ranking
Q: Is it beneficial to incorporate BM25 similarity scores during the re-ranking stage?
A: No, BM25 scores are primarily used for candidate retrieval and are not typically required by cross-encoders during re-ranking.
Strategies for Chunks Exceeding Context Window
Q: Strategies for handling situations where document chunks exceed the context window size.
A: Solutions depend on the specific constraints:
Latency Tolerance: User experience dictates acceptable processing time.
Document Length and Diversity Requirements:
Precomputed Summaries: Maintain a separate database mapping documents to their summaries, generated offline. Retrieve relevant chunks and feed summaries into the context window to provide concise context.
I’m Christian Mills, a deep learning consultant specializing in practical AI implementations. I help clients leverage cutting-edge AI technologies to solve real-world problems.
Interested in working together? Fill out my Quick AI Project Assessment form or learn more about me.