Conference Talk 14: Explaining the Basics of Retrieval Augmented Generation

notes
llms
In this talk, Ben Clavié from Answer.ai deconstructs the concept of Retrieval-Augmented Generation (RAG) and walks through building a robust, basic RAG pipeline.
Author

Christian Mills

Published

August 2, 2024

This post is part of the following series:
  • Mastering LLMs Course Notes: My notes from the course Mastering LLMs: A Conference For Developers & Data Scientists by Hamel Husain and Dan Becker.

Summary

Ben Clavié deconstructs the concept of Retrieval-Augmented Generation (RAG) and guides the audience through building a robust, basic RAG pipeline. He emphasizes that RAG is not a standalone technology, but a pipeline combining retrieval and generation, and each component needs individual attention for optimization. Ben advocates for a “MVP++” approach, incorporating essential elements like bi-encoders, re-ranking, keyword search (TF-IDF/BM25), and metadata filtering for a well-rounded system.

Demystifying RAG

  • RAG: Overused and Misunderstood
    • The term “RAG” is often used incorrectly to represent an end-to-end system, creating confusion.
  • RAG as a Pipeline: Retrieval + Generation
    • RAG simply combines retrieval (finding relevant information) and generation (creating text) using Large Language Models (LLMs).
    • It’s not a single technology, but a pipeline requiring optimization at each stage: retrieval, generation, and their connection.
  • Importance of Identifying Specific RAG Issues
    • “My RAG doesn’t work” is too broad. Pinpointing the failing component (retrieval, LLM utilization) is crucial for debugging.

The Compact MVP: A Simple RAG Implementation

  • Basic Pipeline Components and Flow
    • Query embedding
    • Document embedding
    • Cosine similarity search for relevant documents

bencoder_approach cluster_0 This is called a 'bi-encoder' approach Query Query Query_Embedding_Model Embedding Model Query->Query_Embedding_Model Documents Documents Document_Embedding_Model Embedding Model Documents->Document_Embedding_Model Query_Embedding_Pooling Embedding pooling (into 1 vector) Query_Embedding_Model->Query_Embedding_Pooling Cosine_Similarity_Search Cosine similarity search Query_Embedding_Pooling->Cosine_Similarity_Search Document_Embedding_Pooling Embedding pooling (into 1 vector) Document_Embedding_Model->Document_Embedding_Pooling Document_Embedding_Pooling->Cosine_Similarity_Search Results Results Cosine_Similarity_Search->Results

  • Code Example: Vector Search with NumPy
    • Demonstrates a basic RAG pipeline without a vector database, emphasizing simplicity.
    • Uses NumPy for cosine similarity search for demonstration purposes.
    # Load the embedding model
    1from sentence_transformers import SentenceTransformer
    model = SentenceTransformer("Alibaba-NLP/gte-base-en-v1.5")
    
    2# Fetch some text content...
    from wikipediaapi import Wikipedia
    wiki = Wikipedia('RAGBot/0.0', 'en')
    doc = wiki.page('Hayao_Miyazaki').text
    paragraphs = doc.split('\n\n')
    # ...And embed it.
    docs_embed = model.encode(paragraphs, normalize_embeddings=True)
    
    3# Embed the query
    query = "What was Studio Ghibli's first film?"
    query_embed = model.encode(query, normalize_embeddings=True)
    
    4# Find the 3 closest paragraphs to the query
    import numpy as np
    similarities = np.dot(docs_embed, query_embed.T)
    top_3_idx = similarities.topk(3).indices.tolist()
    most_similar_documents = [paragraphs[idx] for idx in top_3_idx]
    1
    Load Bi-Encoder
    2
    Embed Documents
    3
    Embed Query
    4
    Cosine Similarity Search

Improving Retrieval with Re-ranking

  • Bi-Encoder Limitations: Context Unawareness
    • Bi-encoders encode documents and queries separately, potentially missing nuanced relationships between them.
  • Cross-Encoders: Joint Encoding for Better Relevance Scoring
    • Encode query-document pairs together, allowing for a more context-aware relevance score.
    • Effectively a binary classifier
      • Uses the probability of being the positive class as the similarity score.
    • Computationally expensive for large datasets.

bi_encoder cluster_0 Bi-Encoder Query Query Query_Bi_Encoder Bi-Encoder (Embed + Pool) Query->Query_Bi_Encoder Documents Documents Document_Bi_Encoder Bi-Encoder (Embed + Pool) Documents->Document_Bi_Encoder Cosine_Similarity_Search Cosine similarity search Query_Bi_Encoder->Cosine_Similarity_Search Document_Bi_Encoder->Cosine_Similarity_Search

cross_encoder cluster_1 Cross-Encoder Query_2 Query Cross_Encoder Cross-Encoder Query_2->Cross_Encoder Documents_2 Documents Documents_2->Cross_Encoder Similarity_Score Similarity Score Cross_Encoder->Similarity_Score

  • Re-ranking in Practice: Addressing Computational Costs

    • Leverage a powerful but computationally expensive model (like cross-encoders) to score a subset of your documents, previously retrieved by more efficient model
    • Examples of other re-ranking approaches:
      • RankGPT: LLMs as Re-Ranking Agent
      • RankLLM: Repository for prompt-decoding using LLMs
  • rerankers: A lightweight unified API for various reranking models.

reranking Compact Pipeline + Reranking cluster_0 Query Query Query_Bi_Encoder Bi-Encoder (Embed + Pool) Query->Query_Bi_Encoder Documents Documents Document_Bi_Encoder Bi-Encoder (Embed + Pool) Documents->Document_Bi_Encoder Cosine_Similarity_Search Cosine similarity search Query_Bi_Encoder->Cosine_Similarity_Search Document_Bi_Encoder->Cosine_Similarity_Search Reranking Reranking Cosine_Similarity_Search->Reranking Results Results Reranking->Results

Leveraging Metadata for Targeted Retrieval

  • Real-World Data Has Context: Metadata Matters
    • Real-world documents often possess valuable metadata (e.g., author, date, department) that can significantly improve retrieval accuracy.
    • Pure Semantic or Keyword Search can struggle with metadata
      • Example Query: “Can you get me the cruise division financial report for Q4 2022?”
        • Model must accurately represent all of “financial report”, “cruise division”, “Q4” and “2022”, into a single vector
          • Otherwise it will fetch documents that look relevant but aren’t meeting one or more of those criteria.
        • If the number of documents you search for (“k”) is set too high, you will be passing irrelevant financial reports to your LLM
  • Entity Detection and Metadata Filtering: A Practical Example
  • Storing and Using Metadata for Pre-filtering
    • Store metadata alongside documents in the database.
    • During retrieval, pre-filter documents based on query-specific metadata to narrow down the search space.

Putting it All Together: The Complete MVP++ Pipeline

The Final Compact MVP++

tfidf_mvp_plusplus cluster_0 cluster_1 Query Query Query_Bi_Encoder Bi-Encoder (Embed + Pool) Query->Query_Bi_Encoder Query_tfidf tf-idf (weighted full text) Query->Query_tfidf Documents Documents Document_Bi_Encoder Bi-Encoder (Embed + Pool) Documents->Document_Bi_Encoder Document_tfidf tf-idf (weighted full text) Documents->Document_tfidf Cosine_Similarity_Search Cosine similarity search Query_Bi_Encoder->Cosine_Similarity_Search Metadata_Filtering Metadata Document Filtering Document_Bi_Encoder->Metadata_Filtering BM25 BM25 (full-text) search Query_tfidf->BM25 Metadata_Filtering_2 Metadata Document Filtering Document_tfidf->Metadata_Filtering_2 Metadata_Filtering->Cosine_Similarity_Search Metadata_Filtering_2->BM25 Combine_Scores Combine the scores Cosine_Similarity_Search->Combine_Scores Results Results Reranking Reranking Reranking->Results BM25->Combine_Scores Combine_Scores->Reranking

Code Example: Implementing the MVP++ with LanceDB

# Fetch some text content in two different categories
from wikipediaapi import Wikipedia
wiki = Wikipedia('RAGBot/0.0', 'en')
docs = [{"text": x,
         "category": "person"}
        for x in wiki.page('Hayao_Miyazaki').text.split('\n\n')]
docs += [{"text": x,
          "category": "film"}
         for x in wiki.page('Spirited_Away').text.split('\n\n')]

# Enter LanceDB
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embedding import get_registry

# Initialise the embedding model
model_registry = get_registry().get("sentence-transformers")
model = model_registry.create(name="BAAI/bge-small-en-v1.5")

# Create a Model to store attributes for filtering
class Document(LanceModel):                    
    text: str = model.SourceField()
    vector: Vector(384) = model.VectorField()
    category: str

db = lancedb.connect(".my_db")
tbl = db.create_table("my_table", schema=Document)

# Embed the documents and store them in the database
tbl.add(docs)                                        

# Generate the full-text (tf-idf) search index
tbl.create_fts_index("text")                         

# Initialise a reranker -- here, Cohere's API one
from lancedb.rerankers import CohereReranker

reranker = CohereReranker()                        

query = "What is Chihiro's new name given to her by the witch?"

results = (tbl.search(query, query_type="hybrid") # Hybrid means text + vector
           .where("category = 'film'", prefilter=True) # Restrict to only docs in the 'film' category
           .limit(10) # Get 10 results from first-pass retrieval
           .rerank(reranker=reranker) # For the reranker to compute the final ranking
          )

Beyond the Basics: Future Exploration and Resources

Q&A Session

Fine-tuning Bi-Encoders and Cross-Encoders

  • Q: Does the fine-tuning approach for bi-encoder models impact the fine-tuning of cross-encoder models and vice versa?

  • A: While domain-specific, generally aim for complementarity. Fine-tune bi-encoders for broader retrieval, capturing potential candidates. Rely on cross-encoders (re-rankers) for precise filtering and ranking.

Combining Bi-Encoder and TF-IDF Scores

  • Q: What are the advantages and disadvantages of using a weighted average of bi-encoder and TF-IDF scores for selecting re-ranker questions compared to taking the top X from each ranker?

  • A: Both methods are valid and depend on the data. Weighted averages can be effective, but in domains like biomedicine, where document specificity is crucial, taking the top X from both ensures representation for potentially poorly embedded queries.

RAG’s Future with Million-Token Context Lengths

  • Q: How will the emergence of million-token context lengths impact the relevance of RAG in the future?

  • A: RAG remains relevant even with extended context windows. Just as RAM doesn’t replace hard drives, large context windows won’t replace the need for efficient retrieval from vast external knowledge stores. Long context windows provide more flexibility in retrieval speed and allow for incorporating longer documents.

Chunking Strategies

  • Q: What are your thoughts on different chunking strategies?

  • A: While LLMs for pre-chunking are promising but currently immature, maintaining semantic continuity within chunks is vital. The recommended approach is 300 tokens per chunk, avoiding sentence interruptions and including overlapping context (50 tokens) between consecutive chunks.

Fine-tuning Bi-Encoders

  • Q: Should bi-encoders always be fine-tuned with labeled data or is it acceptable to use them off-the-shelf and rely on a re-ranker?

  • A: Fine-tuning encoders (both bi- and cross-) with labeled data consistently improves results. If data and resources are available, fine-tuning is highly recommended. However, for MVPs with limited resources, leveraging pre-trained models with re-ranking is a viable option.

Colbert Clarification and Discussion

  • Discussion: Clarifying the role of ColBERT in RAG pipelines.

    • ColBERT as a First-Stage Retriever: Ideally replaces the bi-encoder in new pipelines, not used as a re-ranker.

    • ColBERT as a Re-Ranker: Can be used when pipeline changes are not feasible, but less optimal.

    • ColBERT Overview: A bi-encoder variant where documents and queries are represented as bags of embeddings (one per token). This approach enhances out-of-domain performance due to its multi-vector representation, capturing more granular information.

Tools for Fine-tuning Embeddings

  • Q: Recommendations for tools to fine-tune embeddings for retrieval.

  • A: Sentence Transformers, particularly version 3.0, is highly recommended for its user-friendliness and comprehensive implementation of essential features.

Fine-tuning Embeddings Workflow

  • Q: Can you describe the workflow for fine-tuning an embedding model?

  • A:

    1. Gather Data: Obtain queries and their corresponding relevant and non-relevant documents.
    2. Define Loss Function: Use a suitable loss function like triplet loss, which leverages positive and negative examples to guide the model.
    3. Consider Hard Negatives: Enhance training by retrieving hard negatives—documents similar to positive examples but irrelevant to the query.
    4. Data Analysis and Generation: Thoroughly analyze existing queries or generate synthetic ones using LLMs to augment training data.

Impact of Long Context Windows on RAG

  • Q: How do long context windows change the strategies and possibilities within RAG?

  • A: Long context windows enable:

    • Longer Documents: Incorporating longer documents or concatenated chunks into the context.

    • Reduced Retrieval Overhead: Relaxing the reliance on highly precise retrieval (e.g., Recall@3) as more documents can fit within the context window. This allows for faster, less resource-intensive retrieval methods.

Fine-tuning Encoder Tutorials

Go-to Embedding Models

  • Q: Go-to embedding models for different scenarios.

  • A:

    • Demos: Cohere’s embedding models due to their API accessibility, performance, and affordability.

    • Production: Multi-vector models like ColBERT are preferred.

  • General Recommendations:

    • Model Size: Stick to models with parameters between 100 million and 1 billion; larger LLMs as encoders often have unfavorable latency-performance trade-offs.

    • Avoid Overly Large Models: Using excessively large LLMs for embedding can lead to diminishing returns in performance and increased latency.

Using Elasticsearch for RAG

  • Q: Can Elasticsearch, a widely used search engine, be integrated into RAG pipelines, especially for organizations already invested in it?

  • A:

    • Hybrid Approach: Use Elasticsearch’s BM25 capabilities for initial retrieval and integrate a separate re-ranking pipeline (potentially using a cross-encoder).

    • Vector Database Integration: Leverage Elasticsearch’s vector database offerings to incorporate semantic search capabilities.

BM25 Score in Re-ranking

  • Q: Is it beneficial to incorporate BM25 similarity scores during the re-ranking stage?

  • A: No, BM25 scores are primarily used for candidate retrieval and are not typically required by cross-encoders during re-ranking.

Strategies for Chunks Exceeding Context Window

  • Q: Strategies for handling situations where document chunks exceed the context window size.

  • A: Solutions depend on the specific constraints:

    • Latency Tolerance: User experience dictates acceptable processing time.

    • Document Length and Diversity Requirements:

    • Precomputed Summaries: Maintain a separate database mapping documents to their summaries, generated offline. Retrieve relevant chunks and feed summaries into the context window to provide concise context.


About Me:
  • I’m Christian Mills, a deep learning consultant specializing in computer vision and practical AI implementations.
  • I help clients leverage cutting-edge AI technologies to solve real-world problems.
  • Learn more about me or reach out via email at [email protected] to discuss your project.