Conference Talk 18: Fine-Tuning OpenAI Models - Best Practices

notes
llms
In this talk, Steven Heidel from OpenAI’s fine-tuning team covers best practices, use cases, and recent updates for fine-tuning OpenAI models.
Author

Christian Mills

Published

August 29, 2024

This post is part of the following series:
  • Mastering LLMs Course Notes: My notes from the course Mastering LLMs: A Conference For Developers & Data Scientists by Hamel Husain and Dan Becker.

What is fine-tuning? When to use it?

  • What is fine-tuning?
    • Training a model to follow a set of input-output examples.
    • Teaching the model to output specific responses for given inputs.
  • When to use fine-tuning?
    • Following a specific format or tone: e.g., writing in a consistent style.
    • Processing input in a particular way: e.g., extracting specific information from text.
    • Following complex instructions: When the base model struggles with specific instructions.
    • Improving latency (speed) and reducing token usage (cost): Compared to multi-shot prompting, fine-tuning can be faster and cheaper.
  • When NOT to use fine-tuning?
    • Teaching the model new knowledge: Fine-tuning is not effective for adding new information to the model’s knowledge base. Use RAG (Retrieval Augmented Generation) or custom models for this purpose.
    • Performing multiple, unrelated tasks: A single fine-tuned model is best suited for one task. Create separate models for different tasks or use prompt engineering.
    • Including up-to-date information: Fine-tuning is not suitable for incorporating real-time or constantly changing data.

Custom Models and Real-World Fine-tuning Examples

  • Question: What are “custom models”?
    • Answer:
      • Custom models involve a deeper level of collaboration with OpenAI.
      • OpenAI works with select partners to train and refine models using large datasets over several months.
      • Example: OpenAI collaborated with Harvey, a legal tech company, to create a model trained on case law. This model excels in legal tasks and significantly reduces hallucinations compared to the base GPT model.
  • Question: Are there detailed examples of fine-tuning for real-world use cases, including training data and prompts?

A Combined Approach

  • Question: What does it mean to combine fine-tuning and RAG, as shown in the upper-right quadrant of the “Optimizing LLMs” chart?
    • Answer:
      • The chart depicts a general progression, but specific applications may not require all steps.
      • Combining fine-tuning and RAG involves:
        • Using RAG to introduce new knowledge or context.
        • Using fine-tuning to modify the model’s response style or instruction-following capabilities.

Importance of Evals (Evaluation Metrics)

  • Key Takeaway: Define your own set of evals (evaluation metrics) tailored to your application’s specific needs.
  • Why not use standard benchmarks?
    • Standard benchmarks might not accurately reflect your application’s requirements.
  • Best Practice: Create evals based on real prompts and desired responses from your application.

A Cautionary Tale: When Fine-tuning Goes Wrong

  • Example: A company tried to create a Slack bot using fine-tuning on their entire Slack corpus.
    • Goal: The bot was intended to answer onboarding questions from new employees.
    • Problem: The model learned the format of Slack responses but not the underlying information.
      • It often responded with deferrals like “I’ll do that tomorrow” or “Okay.”
    • Reason for Failure:
      • The model focused on replicating common responses instead of understanding the content.
      • This approach violated the principle of not using fine-tuning to add new knowledge.
    • Better Approach: Using RAG on the Slack data to provide the model with relevant information.

Preparing Data for Fine-tuning

  • Data Format:
  • Recommendations:
    • Number of Examples: 50-100 examples per task.
    • Focus on a Single Task: Use a single fine-tuned model for one specific task.
    • Consistency in Prompt Structure: Maintain consistency between the prompt structure used in fine-tuning and in actual application calls.

Handling Multi-Turn Conversations and Internal Thoughts

  • Question: How to handle multi-turn conversations with RAG, function calling, and internal thoughts (generated by tools like Langchain) during fine-tuning data preparation? Should all internal thoughts be included, even if they are not always present in production?
    • Answer:
      • Experiment with the weight parameter to control which parts of the conversation the model should focus on learning.
      • The weight parameter allows you to assign different levels of importance to different messages in the training data.
      • General Advice:
        • Avoid excessive complexity in fine-tuning. If multi-turn conversations prove too complex, consider simplifying or using alternative approaches.
  • Follow-up Question: Is a weight parameter of 0 equivalent to setting the label of those tokens to -100 (effectively ignoring them)?
    • Answer:
      • A weight of 0 means the model won’t learn how to generate those specific messages (e.g., user and system messages).
      • However, it will still use them as context when learning how to generate other messages.

Evolution of the Weight Parameter and OpenAI’s Development Process

  • Observation: The introduction of the weight parameter was a significant development. However, its rollout wasn’t widely publicized.
  • Question: How does OpenAI decide on new features and gather feedback? Could this process be more transparent?
    • Answer:
      • OpenAI gathers feedback from various sources:
        • OpenAI Developer Forum
        • Direct customer interactions
        • Account managers
        • Events, conferences, and talks
      • The weight parameter was a direct result of this feedback process.
      • OpenAI acknowledges the importance of a more prominent announcement for such features.
      • Future Direction: OpenAI plans to continue adding models, methods, and customization options to its fine-tuning platform.

OpenAI’s Fine-tuning Best Practices and Hyperparameters

Curate examples carefully

  • Datasets can be difficult to build, start small and invest intentionally. Optimize for fewer high-quality training examples.
    • Consider “prompt baking”, or using a basic prompt to generate your initial examples
    • If your conversations are multi-turn, ensure your examples are representative
    • Collect examples to target issues detected in evaluation
    • Consider the balance & diversity of data
    • Make sure your examples contain all the information needed in the response

Iterate on hyperparameters

  • Start with the defaults and adjust based on performance.
    • If the model does not appear to converge, increase the learning rate multiplier
    • If the model does not follow the training data as much as expected, increase the number of epochs
    • If the model becomes less diverse than expected, decrease the number of epochs by 1-2

Establish a baseline

  • Often users start with a zero-shot or few-shot prompt to build a baseline evaluation before graduating to fine-tuning.

Optimize for latency and token efficiency

  • When using GPT-4, once you have a baseline evaluation and training examples, consider fine-tuning 3.5 to get similar performance for less cost and latency.
    • Experiment with reducing or removing system instructions with subsequent fine-tuned model versions.

Automate your feedback pipeline

  • Introduce automated evaluations to highlight potential problem cases to clean up and use as training data.
    • Consider the G-Eval approach of using GPT-4 to perform automated testing using a scorecard.

Hyperparameters

  • Epochs: The number of times the training data is iterated over.
    • Impacts training the most.
    • Higher epochs risk overfitting, lower epochs might lead to underfitting.
    • OpenAI automatically chooses a default based on dataset size.
  • Batch Size and Learning Rate Multiplier: Have less impact but can be adjusted for optimization.
    • Larger batch sizes tend to work better for larger datasets.
    • Experiment with learning rates between 0.02-0.2.
    • Larger learning rates often perform better with larger batch sizes
  • Recommendation: Start with defaults and adjust based on eval results.

Success Stories and Performance Improvements

  • Case Study 1: Fine-tuning a larger model (GPT-4) for an Icelandic government project resulted in significant improvements in grammar correction compared to the base model and other approaches.
  • Case Study 2: Fine-tuning GPT-3.5 outperformed GPT-4 with few-shot prompting in a specific task, highlighting potential cost and latency benefits.

Function Calling and Fine-tuning Considerations

  • Use Case: Fine-tuning can improve the model’s ability to select and call the correct functions.
  • Caution: When working with complex or lengthy function definitions, it might be necessary to include them in the prompt even after fine-tuning.
    • Unlike scenarios where fine-tuning can replace lengthy prompts, function calling might require those definitions to be present during runtime.

Reducing Hallucinations

  • Example: The Q&A model in OpenAI’s cookbook demonstrates how to reduce hallucinations.
    • Approach: The model is trained on Olympic-related questions. If the question falls outside this domain, it’s designed to respond with “I don’t know.”
    • Result: Fine-tuning effectively minimizes false positives (incorrect answers) in this scenario.

Adoption Rate of Fine-tuning and Open Source Alternatives

  • Observation: Less than 1% of OpenAI API users utilize fine-tuning.
    • Reason:
      • Fine-tuning is a more advanced technique used for optimization or when other methods are insufficient.
      • Many users find success with base models and simpler approaches.
  • Question: What are the advantages of fine-tuning OpenAI models compared to open-source models?
    • Answer:
      • Advantages of OpenAI models:
        • Access to state-of-the-art models like GPT-4 (for select partners).
        • Advanced features like tool calling, function calling, and the assistance API.
        • Simplified deployment and infrastructure management.
      • Consider Open Source When:
        • OpenAI’s offerings don’t meet specific needs or violate terms of service.
  • Additional Insights:
    • OpenAI’s models, even GPT-3.5, often outperform open-source alternatives, especially for complex tasks and at scale.
    • OpenAI’s platform offers greater ease of use, especially with features like tool calling and handling API rate limits.

Moving from Fine-tuning to Newer Models

  • Observation: The release of more powerful base models sometimes leads users to abandon fine-tuning in favor of the improved base models.
  • Factors Influencing the Decision:
    • Performance Difference: If the new base model offers substantial improvement, switching might be preferable.
    • Cost and Latency: If the performance difference is minimal, sticking with a fine-tuned 3.5 model might be more cost-effective and faster.

Sustainability and Future of Fine-tuning

  • Concern: Given Google’s history of discontinuing products, is there a risk of OpenAI doing the same with fine-tuning?
  • Reassurance:
    • OpenAI is committed to supporting applications that are successful with fine-tuning.
    • The company recognizes the investment users make in data preparation and training, and aims to avoid disruption.
    • Key Takeaway: OpenAI understands the higher switching cost associated with fine-tuning and aims to provide continued support.

Data Ownership and Licensing for Fine-tuning

  • Question: What happens to the data used for fine-tuning in terms of IP and licensing?
  • Answer:
    • Data Privacy: OpenAI never uses customer data for training its foundation models. This is explicitly stated in their privacy policy and terms of service.
    • Data Control: Users have complete control over their data’s lifecycle. They can delete it after fine-tuning or retain it for future models.

Language Model Agents and Function Calling Internals

  • Question: Are there any impressive examples of OpenAI being used to create successful language model agents?
  • Answer:
    • Steven defers to another team specializing in agents and tool calling.
    • Personal Interest: Steven finds the GitHub AI workspaces and their agent-like capabilities for coding tasks promising.
  • Question: Is there tension between the abstraction provided by higher-level services (like the assistance API) and fine-tuning, especially regarding data access and transparency?
    • Answer:
      • Fine-tuned models can be used within the assistance API.
      • The assistance API primarily aids in context management and tool access, not necessarily replacing fine-tuning.
  • Follow-up Question: If the assistance API modifies or truncates context, wouldn’t that impact fine-tuning?
    • Answer:
      • Steven acknowledges the potential issue but lacks specific details about the assistance API’s internal context handling.
  • Question: Regarding function calling, is it true that JSON schema definitions are translated into a Hyperscript-like format before processing? Can users leverage this for simpler function definitions?
    • Answer:
      • Steven confirms that all inputs, including JSON schemas, are converted into OpenAI’s internal token format.
      • He can’t share specifics about the internal representation or translation process.
      • General Direction: OpenAI aims to make the default function calling experience (using JSON schemas) the most effective, without requiring users to rely on workarounds or internal knowledge.

About Me:
  • I’m Christian Mills, a deep learning consultant specializing in computer vision and practical AI implementations.
  • I help clients leverage cutting-edge AI technologies to solve real-world problems.
  • Learn more about me or reach out via email at [email protected] to discuss your project.