Workshop 1: When and Why to Fine-Tune an LLM

notes

llms

Workshop #1 provides a practical overview of fine-tuning large language models, focusing on when it is and is not beneficial, emphasizing a workflow of simplification, prototyping, and iterative improvement using evaluations.

Author

Christian Mills

Published

May 31, 2024

This post is part of the following series:

Mastering LLMs Course Notes: My notes from the course Mastering LLMs: A Conference For Developers & Data Scientists by Hamel Husain and Dan Becker.

Key Takeaways
Course Overview
When to Fine-Tune
Understanding Fine-Tuning
Case Study: Logistics Company Regression Problem
Case Study: Honeycomb Natural Language Query Assistant
Q&A Session #1
Chatbots
Preference Optimization
Evaluating Use Cases for Fine-Tuning
Q&A Session #2

Key Takeaways

Start simple: Focus on prompt engineering and using pre-trained models like those from OpenAI before jumping into the complexity of fine-tuning.
Fine-tune strategically: Consider fine-tuning when you need bespoke behavior, have unique data, or require data privacy.
Templating is crucial: Pay close attention to consistency in templating between training and inference to avoid unexpected model behavior.
Evaluate rigorously: Use domain-specific evaluations and metrics to measure model performance and guide fine-tuning decisions.
Preference optimization shows promise: Techniques like Direct Preference Optimization (DPO) can train models to outperform even human experts by learning from comparative feedback.

Course Overview

Focus: Actionable insights and practical guidance from real-world experience in deploying LLMs for various business needs.
Philosophy:
- Prioritize practical value over project ideas that only sound cool.
- Start with simple, straightforward solutions and progressively refine them.
- Ship prototypes quickly for rapid iteration and feedback.
Workflow:
- Start with prompt engineering before considering fine-tuning.
  - Prompt engineering provides much faster iteration and experimentation.
  - The results from prompt engineering will help inform whether fine-tuning is necessary.
- Iterate quickly with simple prototypes.
  - Build and show people concrete things, so they can provide feedback.
  - Simple prototypes almost always work well enough to start making progress.
- Incorporate evaluations (Evals) to measure and improve model performance.
  - Blog Post: Your AI Product Needs Evals

When to Fine-Tune

Don’t fine-tune for generic behavior:
- Use existing powerful models like OpenAI’s GPT or Anthropic’s models via API for tasks where they excel.
- Increasingly larger context windows allows us to fit more examples to fit into a prompt.
- You should have some minimal evaluation system that you hit a wall on with prompting alone, before considering fine-tuning.
Do fine-tune for bespoke behavior:
- When you need specific outputs or behavior not achievable through prompt engineering alone.
- When you have a narrow, well-defined problem domain and sufficient data for training.
  - Fine-tuning requires examples of desired inputs and outputs for supervised learning.
- When data privacy and model ownership are critical.
- When you need improved quality and lower latency compared to large pre-trained models.
- Requires proper operational setup and significant value use cases.
Iteration Speed & Complexity: Fine-tuning involves slower iteration cycles and operational complexities compared to using pre-trained models.

Understanding Fine-Tuning

Pre-training: Training LLMs on massive text datasets to learn language fundamentals and next-token prediction.
Building on Pre-trained Models:
- Fine-tuning adapts pre-trained models with vast general language knowledge to excel in specific domains.
- Fine-tuning harnesses the next-token prediction mechanism used in pre-training to generate desired outputs.
Importance of Input-Output Examples
- Fine-tuning requires clear examples of desired inputs and outputs.
- Documentation alone isn’t sufficient; practical examples are necessary.
- Mixed quality of training data (e.g., varied quality of human-written summaries) can lead to mediocre model performance.
Templating for Inference Control:
- Guides the model to produce specific outputs by short-circuiting pre-trained behavior.
- Inputs and outputs are placed within a consistent template to guide the model during inference.
- Crucial for aligning training and inference.
- Defines the structure of input and output text to guide the model.
- Inconsistencies in templating are a major source of errors.
  - Templates must be identical between training and inference.
  - There are many kinds of templates and it is easy to misinterpret them.
  - Many tools try to abstract away and automate building templates and something often goes wrong.
  - Blog Post: Tokenization Gotchas

Case Study: Logistics Company Regression Problem

Overview

Task: Logistics company (e.g., UPS, DHL, USPS) needed to predict item value based on an 80-character description.
Takeaways: Highlights the importance of understanding and preparing the training data, the limitations of fine-tuning for specific regression tasks, and the practical issues encountered with this approach.

Traditional NLP and ML Approaches

Classical Techniques: Initial consideration to use traditional NLP and ML methods.
Bag of Words Representation: Highlighted issue where models fail to recognize unseen words or infrequent words due to limited data.

Fine-Tuning Large Language Models

Initial Approach: Attempted to use a large language model (LLM) with and without fine-tuning for regression.
Outcome: The model learned patterns in the data that were not ideal for the task.

Key Observations from Fine-Tuning

Round Numbers: The model tended to predict round numbers frequently because past entries often used round numbers.
Mismatch in Values: Conventional ML models can predict approximate values (e.g., $97 vs. $100), which is often more useful than exact but less frequent round number predictions by the LLM.
Training Data Limitations: Training data often contained inaccuracies, such as undervalued entries to avoid insurance costs.

Data Representation and Preprocessing

Description Complexity: Corporate descriptions were often abbreviated or used acronyms, making them hard to interpret both for humans and models.
Pre-trained Model Limitations: Pre-trained models struggled with unknown abbreviations or context-specific terms not encountered during pre-training.

Conclusion

Unsuccessful Case Study: The fine-tuning approach was largely unsuccessful due to predictable data issues.

Insights and Recommendations

Data Quality: Emphasized the importance of high-quality, representative training data for desired future behavior.
Raw Data Examination: Stressed the need to carefully inspect raw data, a common yet frequently overlooked step in data science.
Practicality of ML Solutions: For this case, traditional ML and NLP techniques did not provide satisfactory results, leading to the retention of the manual workflow.

Case Study: Honeycomb Natural Language Query Assistant

Overview

Task: Building a system for Honeycomb, an observability platform that logs telemetry data about software applications, that translates natural language queries into the platform’s domain-specific query language.
Takeaways: Highlights the importance of fine-tuning in addressing domain-specific challenges, improving model performance, and meeting business requirements such as data privacy and operational efficiency.

Honeycomb Platform Overview

Honeycomb is an observability platform.
Logs telemetry data like page load times, database response times, and application bottlenecks.
Users query this data using a domain-specific query language.

Initial Solution: Natural Language Query Assistant

Problem: Users must learn a specific query language to use Honeycomb effectively.
Solution: Create a natural language query assistant that translates user queries into Honeycomb’s query language using large language models (LLMs).
Initial Approach:
- User provides a query and schema (list of column names from the user’s data).
- Prompt assembled with user input and schema sent to GPT-3/GPT-3.5.
- Generated a Honeycomb query based on the prompt.

Prompt Structure

System Message:
- “Honeycomb AI suggests queries based on user input.”
Columns Section:
- Schema from the user’s data inserted here.
Query Spec:
- Simplified programming manual for Honeycomb’s query language.
- Contains operations and comments on their usage.
Tips Section:
- Guidelines to handle different failure modes and edge cases.
- Example: Handling time ranges correctly.
Few-Shot Examples:
- Examples of natural language queries and corresponding Honeycomb query outputs.

Challenges with Initial Solution

Expressing Query Language Nuances:
- Hard to capture all idioms and best practices of the query language.
- GPT-3.5 lacks extensive exposure to Honeycomb’s specific query language.
Tips Section Complexity:
- Tips devolved into numerous if-then statements.
- Difficult for the language model to follow multiple conditionals.
Few-Shot Examples Limitations:
- Hard to cover all edge cases.
- Dynamic few-shot examples could help but were not implemented.

Business Challenges

Data Privacy:
- Need permission to send customer data to OpenAI.
- Preference to keep data within a trusted boundary.
Quality vs. Latency Tradeoff:
- GPT-4 offered higher quality but was too slow and expensive.
- Goal: Train a smaller, faster model with comparable quality.
Narrow Domain Problem:
- Honeycomb queries are a focused, narrow domain ideal for fine-tuning.
Impracticality of Extensive Prompt Engineering:
- Hard to manually encode all nuances of the query language.
- Fine-tuning with many examples is more practical.

Fine-Tuning Solution

Advantages:
- Faster, more compliant with data privacy needs.
- Higher quality responses compared to GPT-3.5.
Implementation:
- Fine-tuned a model using synthetic data provided by Honeycomb.
- The process and challenges encountered during fine-tuning will be simulated in the course.

Recommendations

Implement Fine-Tuning:
- Use synthetic data to replicate and improve the model.
- Focus on capturing edge cases and nuances in the training data.
Optimize for Performance:
- Balance model size and latency to ensure quick responses without sacrificing quality.
Ensure Data Privacy:
- Keep data within a trusted boundary to comply with customer privacy requirements.
Regularly Update Few-Shot Examples:
- Dynamically generate examples to cover new edge cases and improve model accuracy.
Monitor and Iterate:
- Continuously monitor model performance and iteratively improve based on user feedback and new data.

Q&A Session #1

This Q&A session covers various aspects of fine-tuning machine learning models, particularly focusing on fine-tuning versus retrieval-augmented generation (RAG), function calling, and synthetic data generation. It also touches upon the use of base models versus instruction-tuned models and the appropriate amount of data for fine-tuning.

Fine-Tuning vs. RAG

Definitions:
- Fine-Tuning: Adjusting a pre-trained model with additional data to improve performance in specific tasks.
- RAG (Retrieval-Augmented Generation): Combines information retrieval with generation to produce responses based on external documents.
Key Point: Fine-tuning and RAG are not mutually exclusive; they can complement each other.
Process: Validate the need for fine-tuning by ensuring good prompts and effective RAG.

Fine-Tuning for Function Calls

Capability: Models can be fine-tuned to improve at making function calls.
Examples: Open models like LLaMA 3 and LLaMA 2 have been fine-tuned for function calling.
Challenges: Identify and use good training data with successful function call examples while filtering out failures.

Data Requirements for Fine-Tuning

Amount of Data: Success with as few as 100 samples, though this varies by problem scope.
Broad Scope Problems: Require more data to cover the problem space adequately.
Narrow Scope Problems: Can often be fine-tuned with relatively little data.

Synthetic Data Generation

Importance: Helps overcome data scarcity in specific domains.
Methods: Use powerful models to generate synthetic data, perturb existing data, and create test cases.
Practical Example: Honeycomb example shows generating synthetic data to test and train models.

Base Models vs. Instruction-Tuned Models

Base Models: Not fine-tuned for specific instructions, allowing more control over fine-tuning processes.
Instruction-Tuned Models: Pre-fine-tuned to respond to instructions, useful in broader chat-based applications.
Preference: Often uses base models to avoid template conflicts and ensure specific fine-tuning needs.

Model Size for Fine-Tuning

Preferred Size: Starts with smaller models (e.g., 7 billion parameters) and scales up based on complexity and performance needs.
Trade-Offs: Larger models require more resources and justification due to higher costs and hosting difficulties.

Multimodal Fine-Tuning

Example Project: Fine-tuning models to write alt text for images to assist visually impaired users.
Tools: The LLaVA model is recommended for fine-tuning multimodal tasks.

Recommendations

Validate the Need for Fine-Tuning: Before starting, ensure you have good prompts and effective RAG if applicable.
Choose the Right Data: Use high-quality, successful examples for fine-tuning and filter out poor results.
Start Small: Begin with smaller models and incrementally increase size based on performance needs.
Leverage Synthetic Data: Generate and use synthetic data to supplement training data, especially in data-scarce domains.
Understand Model Types: Choose between base models and instruction-tuned models based on the specific use case and desired control over fine-tuning.
Explore Multimodal Capabilities: Consider multimodal fine-tuning for tasks that require handling both text and images, utilizing models like LLaVA.

Chatbots

Overview

Topic: Delves into the common pitfalls and considerations when working with LLM-powered chatbots.
Takaways: Highlights why general-purpose chatbots are often a bad idea, with unrealistic expectations and overly broad scope leading to poor user experiences and significant challenges in development.

Importance of Saying No to General-Purpose Chatbots

Prevalence of Chatbot Requests: When working with LLMs, most clients will request a chatbot.
Need for Caution: It’s often necessary to push back on these requests due to potential complications.

Case Study: Rechat Real Estate CRM Tool

Initial Concept: A CRM tool for real estate that integrated multiple functionalities (appointments, listings, social media marketing).
Initial Implementation: Started with a broad chat interface labeled “Ask Lucy anything.”
- Problems with Broad Scope:
  - Unmanageable surface area.
  - User expectations mismatched with capabilities.
  - Difficult to make progress on scoped tasks.

Lessons from Rechat Case Study

Scoped Interfaces: Guide users towards specific tasks.
Fine-Tuning Challenges: Difficult to fine-tune against a large and varied set of functions.

Managing User Expectations

High User Expectations: Users often assume chatbots can handle any request, leading to disappointment.
Setting Realistic Boundaries: Important to guide users on what the chatbot can realistically do.

Real-World Example: DPD Chatbot Incident

Background: A chatbot released for a package delivery company, DPD, faced issues on launch.
Incident: The chatbot swore in response to a user’s prompt, leading to negative publicity.
- Media Coverage: The incident was widely reported, causing significant concern within the company.
Lesson Learned:
- Expectations vs. Reality: Even harmless errors can become major issues if they attract public attention.
- Guardrails: Conventional software has clear input validation; free-form text input in chatbots is harder to manage.

Guardrails and Prompt Injections

Challenges with Guardrails: Tools to check for prompt injections are imperfect.
Importance of Reviewing Prompts: Critical to understand and review the prompts used by guardrails to ensure safety.

Recommendations

Scoped Interfaces Over General Chatbots: Focus on integrating chatbot functionalities into specific parts of the application rather than creating a general-purpose chatbot.
User Expectation Management: Clearly communicate what the chatbot can and cannot do to manage user expectations effectively.
Modular Functionality: Break down the chatbot’s functionalities into specific modules that can be fine-tuned individually.
Review Guardrails: Regularly review and understand the prompts and guardrails to ensure they are functioning correctly.
Careful Rollout: Test chatbots extensively before public release to avoid unexpected behaviors that could lead to negative publicity.

Preference Optimization

Discusses the effectiveness of Direct Preference Optimization (DPO) in fine-tuning LLMs to produce superior outputs. By leveraging human preferences in comparing two responses to the same prompt, DPO can significantly improve the quality of model outputs.

Preference Optimization Algorithms

Challenge: Human-generated data is often imperfect, and training models solely on this data can lead to suboptimal results.
Human Preference Evaluation: Humans excel at choosing between two options based on preference.
Preference Optimization Algorithms: These techniques leverage human preferences to fine-tune models.

Direct Preference Optimization (DPO)

Definition: DPO involves using human preference data to guide model fine-tuning.
Comparison to Supervised Fine-Tuning:
- Supervised Fine-Tuning: Model learns to imitate responses based on a prompt-response pair.
- DPO: Model learns from human preference data by comparing two responses to the same prompt and determining which is better.

Process of Direct Preference Optimization

Data Collection:
- Prompt: Initial input or question.
- Responses: Two different responses to the prompt.
- Human Evaluation: Determining which response is better.
Model Update: Model adjusts weights to favor better responses, potentially exceeding the quality of the best human-generated responses.

Case Study: Customer Service Email Project

Project Overview:
- Data: 200 customer service emails.
- Responses: Two responses per email from different agents.
- Manager Evaluation: Manager chose the preferred response from each pair.
Model Used: Fine-tuned on Zephyr (base model).

Performance Comparison

Methods Compared:
1. GPT-4 Response Generation: Direct use of GPT-4 for generating responses.
2. Supervised Fine-Tuning: Model fine-tuned on pairs of input-output data.
3. Human Agents: Responses generated by human customer service agents.
4. DPO Model: Model fine-tuned using direct preference optimization.
Results:
- GPT-4: Produced the lowest quality responses.
- Supervised Fine-Tuning: Better than GPT-4 but worse than human agents.
- Human Agents: Better than the supervised fine-tuned model.
- DPO Model: Outperformed human agents, producing responses preferred 2 to 1 over human responses in blind comparisons.

Advantages of Direct Preference Optimization

Superhuman Performance: DPO models can generate responses superior to those of human experts.
Flexibility with Data Quality: Effective even with imperfect or messy data.

Recommendations

Adopt DPO for Fine-Tuning: Implement DPO in model fine-tuning processes to achieve superior performance.
Leverage Human Preferences: Collect and utilize human preference data to guide model improvements.
Evaluate Model Performance: Regularly compare DPO model outputs with human-generated outputs to ensure quality.
Explore Variations of DPO: Investigate slight tweaks and alternative algorithms related to DPO to further enhance model performance.

Evaluating Use Cases for Fine-Tuning

This discussion focuses on evaluating different use cases for fine-tuning large language models (LLMs). The primary aim is to determine when fine-tuning is beneficial for the target use case compared to using a general model like ChatGPT.

1. Customer Service Automation for a Fast Food Chain

Use Case: Automating responses to most customer service emails, with unusual requests routed to a human.
Evaluation:
- Fit for Fine-Tuning: Strong fit.
- Reasoning: The company likely has a substantial dataset from past customer interactions. Fine-tuning can capture the specific nuances of the company’s customer service style and common issues.
- Example: Handling specific inquiries about menu items, store locations, or promotions that are frequently encountered.

2. Classification of Research Articles for a Medical Publisher

Use Case: Classifying new research articles into a complex ontology, facilitating trend analysis for various organizations.
Evaluation:
- Fit for Fine-Tuning: Excellent fit.
- Reasoning: The ontology is complex with many subtle distinctions that are hard to convey in a prompt. The publisher likely has extensive historical data for training.
- Example: Classifying articles into one of 10,000 categories, focusing on the most common 500 categories initially for efficiency.
- Implementation Detail: Used a JSON array output for multi-class classification.

3. Short Fiction Generation for a Startup

Use Case: Creating the world’s best short fiction writer.
Evaluation:
- Fit for Fine-Tuning: Potentially good fit.
- Reasoning:
  - General models like ChatGPT can write good short stories.
  - Fine-tuning can help the model learn specific preferences in storytelling that go beyond what a general LLM can offer. The startup can gather user preferences on generated stories to continually improve the model.
- Example: Generating two different story versions on a given topic and having users rate them to inform future fine-tuning.
- Considerations: The feedback loop involving user ratings can help refine and optimize the storytelling quality.

4. Automated News Summarization for Employees

Use Case: Providing employees with summaries of new articles on specific topics daily.
Evaluation:
- Fit for Fine-Tuning: Potentially unnecessary.
- Reasoning: General LLMs like ChatGPT can already provide high-quality summaries. The benefit of fine-tuning depends on the availability of unique internal data to improve the summarization process.
- Example: Summarizing a wide range of news articles without a significant internal dataset may not justify the effort of fine-tuning.
- Alternative: Using preference-based optimization (DPO) to gather feedback on summary quality and improve the model if news summarization is a critical business function.

Important Considerations

Data Availability: Fine-tuning is more effective when there is a large, high-quality dataset available from past interactions or classifications.
Complexity and Specificity: Use cases with complex, nuanced requirements are better candidates for fine-tuning compared to general tasks.
Resource Commitment: The decision to fine-tune should consider the resources required for collecting and annotating additional data, as well as the importance of the task within the organization.

Recommendations

Assess Data Quality and Quantity: Ensure sufficient and relevant data is available for fine-tuning.
Evaluate Task Complexity: Use fine-tuning for tasks that require specific knowledge or subtle distinctions that a general model might not capture.
Consider Cost-Benefit: Weigh the benefits of improved performance against the costs of data collection and model training.
Iterate and Improve: Continuously gather feedback to refine and improve the fine-tuned model, especially for user-preference-driven tasks.

Q&A Session #2

This Q&A session addressed various questions related to model quantization, handling hallucinations in language models, and the importance of data annotation.

Quantization

Definition: Quantization is a technique used to reduce the precision of models.
Performance Impact: Over-quantization can lead to performance degradation.
Testing: It is crucial to test the quantized models to ensure performance is not adversely affected.

Hallucination in Language Models

Issue: When classifying academic or scientific articles, ensuring that the language model (LM) only outputs valid classes is critical.
Solution: Providing enough examples with specific sets of classes to train the model effectively.
Metrics: Continuous monitoring and treating misclassifications as part of the expected process.

Fine-Tuning Large Language Models

Use Case Evaluation: The skill of evaluating use cases for fine-tuning is essential for data scientists.
Example: Fine-tuning can outperform even human experts in specific, well-defined tasks, such as customer service for companies like McDonald’s.

Optimizing Prompts

Efficiency: Static elements in prompts that don’t change should be removed in favor of more dynamic elements.
Few-Shot Examples: These should be minimized or eliminated with extensive fine-tuning.
Prompt Engineering: A critical technique in making language models more efficient and effective.

Data Annotation and Evaluation

Human in the Loop: Essential for evaluating LLMs and curating data for training and fine-tuning.
Tool Building: Custom tools are often more effective than generic ones for specific domains.

About Me:

I’m Christian Mills, an Applied AI Consultant and Educator.

Whether I’m writing an in-depth tutorial or sharing detailed notes, my goal is the same: to bring clarity to complex topics and find practical, valuable insights.

If you need a strategic partner who brings this level of depth and systematic thinking to your AI project, I’m here to help. Let’s talk about de-risking your roadmap and building a real-world solution.

Start the conversation with my Quick AI Project Assessment or learn more about my approach.