Conference Talk 9: Why Fine-Tuning is Dead

notes
llms
In this talk, Emmanuel Ameisen from Anthropic argues that fine-tuning LLMs is often less effective and efficient than focusing on fundamentals like data quality, prompting, and Retrieval Augmentation Generation (RAG).
Author

Christian Mills

Published

July 19, 2024

This post is part of the following series:
  • Mastering LLMs Course Notes: My notes from the course Mastering LLMs: A Conference For Developers & Data Scientists by Hamel Husain and Dan Becker.

Performance Observations: Fine-tuning vs. RAG

Q&A on Fine-tuning, RAG, and Knowledge

  • Fine-tuning effectiveness and model size: Fine-tuning might be more beneficial for smaller models compared to larger ones, which are already more capable of learning from context.
  • Domain-specific knowledge and RAG: Even for smaller models, RAG remains crucial for tasks involving domain-specific knowledge.
  • Evaluating fine-tuning success: The choice of evaluation metric significantly impacts the perceived effectiveness of fine-tuning. For tasks like style adherence, fine-tuning might appear more beneficial than RAG.
  • The blurry line between style and content: The distinction between “style” and “content” can be ambiguous, making it difficult to definitively determine when fine-tuning is beneficial.

Audience Questions and Examples of Fine-tuning Success

  • Complex knowledge bases and fine-tuning: When dealing with large, curated knowledge bases, it’s crucial to evaluate whether the next generation of LLMs, combined with RAG, might be sufficient without fine-tuning.
  • Adding knowledge via prompting and RAG: In many cases, adding knowledge to the model can be achieved through prompting, RAG, or a combination of both, eliminating the need for fine-tuning.
  • Fine-tuning for multilingual models: Fine-tuning might be beneficial for improving the performance of multilingual models on languages with limited training data, as it leverages the model’s existing understanding of language mapping.
  • Fine-tuning for code generation: While fine-tuning can be used to adapt code generation models to specific styles and conventions, RAG remains highly effective for providing codebase context.
  • Contextual learning vs. fine-tuning: LLMs are demonstrating impressive abilities to learn from context, potentially replacing the need for fine-tuning in scenarios where sufficient context can be provided.

The Moving Target: Fine-tuning and Frontier Models

  • Rapid LLM advancements: The rapid pace of LLM development makes fine-tuning a moving target, as newer models often surpass the performance of previously fine-tuned models.
  • Bloomberg GPT example: Bloomberg GPT, a large language model pre-trained on financial data, initially outperformed existing models on financial tasks. However, its performance was subsequently surpassed by newer models like GPT-4.
  • The cost of keeping up: Continuously fine-tuning on new model releases can be prohibitively expensive, especially for large datasets. Prompt-based and RAG-based pipelines offer more flexibility and cost-effectiveness.
  • Fine-tuning effectiveness and model scale: Fine-tuning might become less effective as models grow larger and more capable.

The Difficulty of Fine-tuning: Prioritizing Fundamentals

  • The 80/20 rule of ML: Similar to traditional ML, most effort in LLM development should be dedicated to data work (80%), followed by engineering (18%), debugging (2%), and architecture research (0%).
  • Fine-tuning as a last resort: Fine-tuning should only be considered after thoroughly addressing fundamentals like data quality, evaluation, prompting, and RAG.
  • Hierarchy of needs: Prioritize building a solid ML system with robust evaluation, prompting, and RAG before attempting to fine-tune.
    • Book: Building Machine Learning Powered Applications: Going from Idea to Product
    • Continuous Integration
      • Model Backtesting
      • Model Evaluation
      • Experimentation Framework
    • Application Logic
      • Input Validation → Filtering Logic → Model Code → Output Validation → Displaying Logic
    • Monitoring
      • Monitoring Input Distribution
      • Monitoring Latency
      • Monitoring Output Distribution

Conclusion

  • Finetuning is:
    • expensive and complex
    • has become less valuable
    • often underperforms simpler approaches
  • Models are continuously becoming:
    • cheaper
    • smarter
    • faster
    • longer context
  • Always start with:
    • prompting
    • making a train/test set
    • RAG
  • Treat finetuning as a niche/last resort solution
    • like cloud vs on prem

Q&A Session

Question #1

  • Context window size and computational cost: Passing large amounts of context through an LLM for every request can be computationally expensive. However, increasing LLM efficiency and advancements like prefix caching could mitigate this cost.
  • Fine-tuning complexity: Fine-tuning increasingly complex and larger LLMs might become more challenging, potentially outweighing the benefits compared to context-based approaches.

Question #2

  • Dynamic few-shot learning: Dynamically selecting and providing relevant few-shot examples from a database is a powerful technique for improving LLM performance without fine-tuning.
  • Iterative prompt and example improvement: Invest time in iteratively refining prompts and curating effective few-shot examples before considering fine-tuning.

About Me:
  • I’m Christian Mills, a deep learning consultant specializing in computer vision and practical AI implementations.
  • I help clients leverage cutting-edge AI technologies to solve real-world problems.
  • Learn more about me or reach out via email at [email protected] to discuss your project.