Conference Talk 19: Fine Tuning LLMs for Function Calling

notes
llms
In this talk, Pawell Garbacki from Fireworks.ai covers the process and best practices of finetuning an LLM for function/tool use.
Author

Christian Mills

Published

August 30, 2024

This post is part of the following series:
  • Mastering LLMs Course Notes: My notes from the course Mastering LLMs: A Conference For Developers & Data Scientists by Hamel Husain and Dan Becker.

Introduction

Understanding Function/Tool Calling

  • Definition: Giving LLMs the ability to interact with the external world.
  • Use Cases:
    • Accessing real-time or unavailable information: E.g., retrieving current stock prices.
    • Orchestrating multi-agent systems: LLMs can access and utilize multiple tools to assist users.

Key Decisions in Fine-Tuning for Function Calling

1. Objective Selection:

  • Impact: The objective significantly impacts data preparation, training data volume, fine-tuning complexity, and model usage.
  • Recommendation: Choose the simplest objective that meets the use case requirements.
  • Common Objectives:
    • Single-Turn Forced Call (Routing Use Case):
      • User provides a single instruction.
      • Model maps the instruction to one of several pre-defined functions and its parameters.
      • Forced Call: The model is constrained to respond with a function call, not natural language.
      • Example: User requests the current stock price of Nvidia, the model identifies the appropriate function and its parameters.
        • User: What is the stock price of Nvidia?
          
          Assistant: {
            "name": "get_stock_price",
            "arguments": {"ticker": "NVDA"}
          }
    • Parallel Function Calling:
      • Similar to single-turn forced call, but the model can call multiple independent functions in parallel.
      • Example: Retrieving stock prices for multiple companies simultaneously.
        • User: What is the stock price of Nvidia and Apple?
          
          Assistant: [
            {
              "name": "get_stock_price",
              "arguments": {"ticker": "NVDA"}
            },
            {
              "name": "get_stock_price",
              "arguments": {"ticker": "AAPL"}
            }
          ]
    • Nested Function Calls:
      • Model calls functions sequentially, with the output of one function feeding into the next.
      • Example: Retrieving stock prices and then using those prices to generate a plot.
        • User: Plot the stock price of Nvidia and Apple over the last two weeks
          
          Assistant: [
            {
              "name": "get_stock_price",
              "arguments": {"ticker": "NVDA", "start_time": "2 weeks ago"}
            },
            {
              "name": "get_stock_price",
              "arguments": {"ticker": "AAPL", "start_time": "2 weeks ago"}
            }
          ]
          
          Tool: {"NVDA": [120, 121, …],"AAPL": [192, 190, …]}
          
          Assistant: {
            "name": "plot",
            "arguments": {"NVDA": [120, 121, …],"AAPL": [192, 190, …]}
          }
      • Implementation:
        • User Role: Client interacting with the model.
        • Assistant Role: The LLM generating function calls.
        • Tool Role: Client-side component that executes function calls and returns results to the model.
    • Multi-Turn Chat with Optional Function Calling:
      • Most complex objective, combining natural language conversation with optional function calls.
      • Example: User asks for news, model fetches trending news, summarizes them, and engages in further conversation.
        • User: What's in the news today?
          
          Assistant: {"name": "trending_news"}
          
          Tool: {
            "headlines": [
              "Nvidia market cap surpasses Apple", …
            ]
          }
          
          Assistant: Nvidia is now more valuable than Apple
          
          User: What is Nvidia stock price?
          
          Assistant: {
            "name": "get_stock_price",
            "arguments": {"ticker": "NVDA"}
          }

2. Function Call Token:

  • Purpose:
    • Indicate to the client when the model is switching to function call mode.
    • Enable efficient parsing of model responses, especially in mixed natural language and function call outputs.
  • Implementation: Introduce a special token to prefix function calls.
    •   Assistant: <function_call_token>{
          "name": "get_stock_price",
          "arguments": {"ticker": "NVDA"}
        }
  • Benefits:
    • Easier parsing of model responses.
    • Improved streaming generation by enabling the client to wait for the entire function call signature before processing.
    • Facilitates constraint generation, ensuring the model adheres to predefined function schemas.

3. Syntax for Function Calling:

  • Options:
    • Python Syntax: Generate function calls using Python function call signature syntax.
    • JSON Schema: Generate JSON structures describing the function name and parameters.
  • Trade-offs:
    • Python Syntax:
      • Advantages: Easier for LLMs to generate due to extensive training on Python code.
      • Disadvantages: Less natural for representing complex, nested parameter structures within a single-line invocation.
    • JSON Schema:
      • Advantages: Better suited for complex, nested parameter types; easier to enforce schema with constraint generation; compatible with OpenAI APIs.
      • Disadvantages: Potentially more challenging for LLMs to generate compared to Python syntax.

4. Preserving Existing Model Capabilities:

  • Challenge: Fine-tuning for function calling can inadvertently degrade pre-existing instruction following and general language capabilities.
  • Recommendations:
    • Fine-tune on Instruction-Tuned Models: Use the “Instruct” version of the base model instead of the “Base” version when mixing general chat with function calling.
      • Using the “Base” version is fine for forced function calling.
    • Reduce Training Data: Minimize the amount of training data to reduce the risk of overwriting existing capabilities.
    • High-Quality Data: Use a smaller volume of carefully curated, high-quality training data.

5. Full-Weight Tuning vs. LoRA Tuning:

  • Recommendation: LoRA tuning is generally sufficient and preferable for function calling, particularly in low-data regimes.
  • Advantages of LoRA:
    • Fewer parameters to converge, leading to faster training and better results with limited data.
    • Faster iteration cycles, enabling more experimentation.
    • Lower hosting and experimentation costs, especially with efficient LoRA serving solutions like Fireworks AI’s platform.

6. Constraint Generation:

  • Purpose: Reduce hallucinations in model-generated function calls by leveraging the known schema of available functions.
  • Implementation:
    • Provide the model with the schema of the functions (e.g., function name, parameter names and types).
    • Use a constraint generation mechanism (like a context-free grammar) to guide the model’s output and enforce adherence to the schema.
  • Benefits:
    • Reduced Hallucinations: Significantly minimizes or even eliminates hallucinations in function call outputs.
    • Faster Generation: Enables short-circuiting generation by autocompleting predictable tokens based on the grammar, improving inference speed.
  • Fireworks AI: Offers constraint generation support for function calling, requiring users to provide the function schemas.

General Recommendations and Considerations

  • Work Smart: Utilize existing open-source function-calling models whenever possible, as they are often sufficient for many use cases.
  • Fine-Tuning Effort: Be prepared for an iterative and potentially time-consuming process when fine-tuning for complex function-calling objectives.

Fireworks AI’s Fire Function Models

  • Playground: Firefunction V2

  • Blog Post: Firefunction-v2: Function calling capability on par with GPT4o at 2.5x the speed and 10% of the cost

  • Fire Function V2:

    • Based on LLaMa 3 70B (Instruct variant).
    • Outperforms GPT-4 on the Gorilla benchmark.
    • Designed to approximate GPT-4’s conversational capabilities mixed with function calling.
    • Addresses limitations of existing datasets by leveraging:
      • Naturally occurring function-calling conversations.
      • Open-source multi-agent system data (e.g., AutoGPT).
      • Synthetic datasets with complex instructions and system prompts.
  • Benchmark Comparison:

    Firefunction v2 Gpt-4o
    Gorilla simple 0.94 0.88
    Gorilla multiple_function 0.91 0.91
    Gorilla parallel_function 0.89 0.89
    Gorilla parallel_multiple_function 0.79 0.72
    Nexus parallel 0.53 0.47
    Mtbench 0.84 0.93

Challenges in Fine-tuning for Function Calling

  • Data Scarcity: Unlike general language modeling, readily available datasets for function calling are limited.
    • Existing datasets often focus on specific use cases (e.g., GPT-4 conversations or a limited number of functions).
    • Solution: Invest in building custom datasets.
  • Data Set Design:
    • Define Data Categories: Consider types of function calls (parallel, nested), number of turns, and number of functions supported.
      • Parallel function calling: Multiple functions are called simultaneously.
      • Nested function calling: Functions are called within other functions.
      • Turn-based conversations: Single-turn or multi-turn interactions (e.g., exceeding 10 turns).
      • Number of functions supported: Fine-tuning for a small set of functions (e.g., 5) is different from tuning for a larger set (e.g., 50).
    • Objective Alignment: Ensure the dataset represents the model’s intended use cases and boundary conditions.
    • Leverage Existing Resources: Explore open-source datasets (e.g., Glaive) and multi-agent systems (e.g., Autogen) for inspiration and data.
      • Autogen, a multi-agent system, can be a good data source, especially for scenarios with multiple agents and complex prompts.
  • Complex System Prompts: Real-world applications often require intricate instructions for function selection, which are difficult to find in existing datasets.
    • Solution: Invest in generating synthetic datasets with complex instructions.
  • Security Concerns: Allowing arbitrary function calls raises security risks, especially with functions that modify data.
    • Mitigation:
      • Focus on read-only functions.
      • Include precise instructions in system prompts.
    • Ongoing Research: This area requires further exploration as function calling and multi-agent systems become more prevalent.

Prompt Templates for Fine-tuning

  • System prompts provide context and instructions to the model.
  • General Guidelines:
    • Preserve Instruct Model Capabilities: When fine-tuning on top of existing instruct models, maintain the prompt format to retain existing capabilities.
    • Clear Role Prefixes: Use distinct prefixes for different roles (e.g., system, user, assistant, tool) in multi-turn conversations.
  • Message Format:
    • Parsability: Ensure the format allows easy parsing of function calls by the client.
    • Mixed Output Handling: Use special tokens to delineate between natural language and function call sections in assistant responses.

Successful Fine-tuning Examples

  • GPT-4 Limitations: Fine-tuning can overcome limitations in existing models, such as character limits in function descriptions.
  • Complex Instructions: Fine-tuning is particularly effective for scenarios with complex instructions on when to call specific functions, even with relatively simple functions.

Function Calling Data Sets and Evaluation

  • Datasets:
    • Glaive: High-quality but limited coverage of use cases.
    • Gorilla: Simple functions, Python syntax focus.
    • Nexus Raven: More complex parameters, Python focus.
  • Evaluation:
    • Gorilla Leaderboard: Useful for initial assessment, but consider its limitations (e.g., focus on Python syntax).
    • Nexus Benchmarks: More challenging than Gorilla.
    • Empty Bench: Evaluates general instruction following without function calling.
    • Evaluation Challenges:
      • Real-World Use Case Mismatch: Benchmarks may not fully capture the complexities of real-world scenarios, such as those requiring precise system prompting and multi-turn conversations.
  • Recommendations:
    • Benchmark Selection: Start with publicly available benchmarks to get a general idea of the model’s capabilities.
    • Real-World Testing: It’s essential to test and evaluate models on the specific use cases they are intended for.
    • Model Selection: Don’t rely solely on benchmark scores; try out the top-performing models on your own data and use case to determine the best fit.

Base Models for Fine-tuning

  • Llama 3 & Llama 3.1: Strong general-purpose models.
    • FireFunction V1: Based on Mistral.
    • FireFunction V2: Based on Llama, showed significant improvement.
  • Coding Models: Consider coding-focused models (e.g., Llama 2 Python code generation model) for single-turn, Python-based function calling.
  • Qwen Models: Show promise but require further exploration.
  • Phi (Microsoft): Smaller models that perform well for their size and can potentially run without a GPU.
  • Model Selection: Consider the specific objective (e.g., forced function calling, Python syntax) when choosing a base model.

Memory Retention in Long Chains of Calls

  • Longer Context Models: Opt for models with larger context windows (e.g., beyond Llama 3’s 8K context) for extended conversations.
  • Dataset Representation: Include sufficient long conversation examples in the training data.
  • Intelligent Conversation Pruning:
    • Develop algorithms to selectively retain the most relevant messages from previous turns when the conversation exceeds the context window.
    • Explore semantic matching techniques to identify relevant past messages.
  • Whiteboard Approach:
    • Have the model summarize the key aspects of the conversation at the end of each turn.
    • Pass only the summary to the model in subsequent turns, effectively resetting the context while retaining essential information.

Multi-Agent Systems and Function Calling

  • Function as Agent: A function can be considered an agent within a multi-agent system, interacting with other agents (potentially other functions or models) to complete tasks.

  • Orchestration: Multi-agent frameworks like Autogen provide tools for defining agents, extracting function schemas, routing messages, and executing function calls based on model responses.

  • Agent Team Creation:

    • Identify Strengths and Weaknesses: Analyze individual models to understand their capabilities and limitations.
    • Define Agent Roles: Assign roles and responsibilities to each agent based on their strengths.
    • Routing Layer: Design a system for efficiently routing messages and tasks to the appropriate agents.
    • Context Management: Implement mechanisms for sharing and summarizing context between agents, especially in long conversations.
  • Merging Models: Explore techniques like MergeKit to combine layers from multiple models, potentially creating more capable composite models.

  • Cost and Latency Optimization: Consider using smaller, specialized models for specific tasks to reduce cost and latency.

Comparison with Gorilla Project

  • Gorilla:
    • Focuses on single-turn, forced function calling with Python signature generation.
    • Supports various function calling scenarios (single, parallel, nested).
    • Primarily designed for functions with simple parameters.
  • FireFunction:
    • Addresses real-world use cases involving complex system prompts and mixed conversations with function calling.
    • Handles functions with more complex parameters and instructions.
  • Benchmarks: Gorilla leaderboard lacks tasks for complex system prompts and mixed conversation scenarios.

Smallest Model for Local Smart Home Assistant

  • Challenges: Running a model locally with hundreds of functions on a resource-constrained device.
  • Potential Solutions:
    • Pre-populate KV Cache: Pre-load function definitions into the model’s KV cache to reduce inference time.
    • Function Retrieval with RAG: Use retrieval augmented generation (RAG) to dynamically select relevant functions based on user input, reducing the number of functions in the prompt.
    • Smaller Models: Explore smaller models like Phi or Qwen2 (2 billion parameters) that can potentially run without a GPU.

Function Calling with GraphQL

  • GraphQL as Structured Data: GraphQL can be treated as a structured data format similar to function call schemas.
  • Leveraging Function Calling Models: Explore using existing function calling models to generate or complete GraphQL queries by defining GraphQL operations as functions.
  • Grammar Mode: Leverage the grammar enforcement capabilities of function calling models to ensure syntactically correct GraphQL queries.

Handling API Changes

  • Canonical Data Format: Store data in a format that can be easily translated to different API syntaxes.
  • Client-Side Translation: Implement a wrapper around the API to handle syntax conversions, allowing the model to remain agnostic to specific API changes.
  • Prompt-Based Function Definitions: Consider defining functions within the prompt itself. This approach allows for easier updates when APIs change, eliminating the need for retraining.

Synthetic Data Generation Best Practices

  • High-Quality Prompt and Seed Data: Start with well-crafted prompts and a small, high-quality seed dataset.
  • Good Generation Model: Utilize a capable language model for generation, balancing the legal constraints of using closed-source models with the effort required for filtering outputs from open-source models.
  • Data Variety over Quantity: Prioritize diverse use cases and scenarios over a large number of examples for a single case.
  • Few-Shot Examples in Prompts: Include examples of desired outputs in the prompts to guide the generation process.
  • Temperature Variation: Experiment with different temperature settings to encourage creativity and diversity in generated samples.
  • Post-Filtering: Implement filtering mechanisms to remove low-quality or incorrect samples.
  • DPO Alignment (Optional): Use DPO to refine the model’s behavior, especially for complex system prompts, by providing examples of both desired and undesired outputs.

Importance of Data vs. Hyperparameters vs. Base Model

  • Data Quality: As models become more intelligent and training data becomes smaller, the quality of the data becomes increasingly crucial.
  • Hyperparameter Sensitivity: Smaller datasets often lead to increased sensitivity to hyperparameters, requiring careful tuning.
  • Base Model: The choice of base model significantly impacts performance, especially for specialized tasks like Python code generation.

About Me:
  • I’m Christian Mills, a deep learning consultant specializing in computer vision and practical AI implementations.
  • I help clients leverage cutting-edge AI technologies to solve real-world problems.
  • Learn more about me or reach out via email at [email protected] to discuss your project.