Conference Talk 13: When to Fine-Tune with Paige Bailey

notes

llms

In this talk, Paige Bailey, Generative AI Developer Relations lead at Google, discusses Google’s AI landscape with a focus on Gemini models and their applications.

Author

Christian Mills

Published

July 25, 2024

This post is part of the following series:

Mastering LLMs Course Notes: My notes from the course Mastering LLMs: A Conference For Developers & Data Scientists by Hamel Husain and Dan Becker.

Google AI Landscape and Gemini
Understanding Context Windows
Fine-tuning vs. Prompting vs. Retrieval
Prompting Strategies and Examples
Retrieval Augmented Generation
Fine-tuning Considerations and Gemma

Google AI Landscape and Gemini

Vertex AI:
- Vertex AI: Collection of APIs, compute infrastructure, model deployment tools available through Google Cloud, geared towards enterprise use. Comparable to Azure Open AI services.
- Gemini Developer API (through AI Studio): Easier path for rapid prototyping and personal projects. Comparable to OpenAI APIs.
Gemini Flash Fine-Tuning:
- Gemini 1.5 Flash: Google’s most performant, efficient, and cost-effective model, boasting a 1 million token context window (and growing).
- Supports fine-tuning and is part of an early tester program.
Gemini Nano & Gemma:
- Gemini Nano: Brief mention of its planned integration into Chrome and Pixel/Android devices (details deferred).
- Gemma:
  - Open-source versions of Gemini, available on Hugging Face, Kaggle, and Ollama, making local experimentation easy.
  - Kaggle hosts checkpoints, code samples, and runnable notebooks.

Generative AI and Google

Google’s history in machine learning: TensorFlow, transformer models (BERT, AlphaFold, AlphaStar, AlphaGo, T5), and now Gemini.
Generative AI extends beyond text and code, mentioning:
- Imagen 2: Detailed image generation.
- Chirp: Speech-to-text with multilingual capabilities and a small model footprint.
Gemini: Google’s flagship model (currently on version 1.5)

Gemini Model Features

Multimodal Understanding: Processes images, audio, text, code, video, and more simultaneously.
State-of-the-art Performance: Excels across various tasks, though reliant on academic benchmarks (discussed later).
Embedded Reasoning: Strong capabilities in chain-of-thought and step-by-step reasoning.
Scalable Deployment: Optimized for both large-scale (Google products) and small-scale (edge devices) use cases.
Efficiency and Privacy: Focus on cost-effective token analysis, reduced inference compute, and on-device processing for privacy preservation.
Model Options:
- Gemini 1.5 Pro: High-performance, efficient model.
- Gemini Nano: Ultra-small model for edge deployments.
- Gemma: Open-sourced models (2B and 7B parameters)
Key considerations for integration: user experience, performance, and cost trade-offs.
Available Options:
- Gemini 1.5 Flash: Fast, 1 million token context window.
- Gemini 1.5 Pro: 2 million token context window
Gemini Flash for Code:
- Performs well for code generation and structured outputs like JSON out-of-the-box.
- Fine-tuning and using code examples in the context window further enhance results.
- Applicable to code generation, translation, debugging, code review, etc.

Understanding Context Windows

Importance of Context Window Size:
- Historically limited to 2,000-8,000 tokens, hindering model capability.
- Current models: GPT-4 Turbo (128,000+), Claude (2,000), Gemini (2 million).
Impact of Larger Context Windows:
- Can handle massive amounts of data (emails, texts, videos, codebases, research papers).
- Reduces the need for fine-tuning, as more information can be provided at inference time.
- Allows for more complex and nuanced outputs.

Fine-tuning vs. Prompting vs. Retrieval

Common Questions & Trade-offs

Key decision points when working with large language models.
Considerations:
- Prompt Design: Simple, cost-effective, but may require detailed prompts.
- Fine-Tuning:
  - Increasingly difficult to justify due to maintenance overhead and rapid release of new open-source models.
  - Recommended only when other options fail or for on-premise/local data requirements.
Recommendations:
- Start with Closed-Source APIs: Rapid iteration, prove product-market fit, focus on UX.
- Hire ML Team When Necessary: If highly specialized fine-tuning becomes essential.

Model Evaluation & Its Importance

Limitations of Academic Benchmarks:
- Example: HumanEval
  - Often misinterpreted as involving human evaluation (it doesn’t).
  - Tests a narrow scope of Python function completion with simplistic tasks.
  - Not representative of real-world software engineering or other programming languages.
HumanEval X: Created to address some limitations of HumanEval, but still has limitations.
Key Takeaways:
- Carefully consider the relevance and limitations of evaluation metrics.
- Prioritize custom evaluations tailored to your specific use case and business needs.

Prompting Strategies and Examples

Power of Prompting & Video Understanding

Detailed Example:
- Using Gemini in AI Studio to analyze a 44-minute video.
- Asking the model to find a specific event (paper removed from a pocket), identify information on the paper, and provide the timestamp.
- Demonstrates the ability to understand and extract information from lengthy video content, potentially revolutionizing video analysis workflows.
Implications:
- Transforms how we interact with video content, making it searchable and analyzable at scale.
- Also applicable to large text documents (PDFs with images, graphs, code) for summarization, analysis, and research.
Prefix Caching:
- Optimizes API calls for repeated analysis of the same codebase or repository.
- Improves latency and grounds responses within a consistent context.

AI Studio Overview & Examples

Key Features:
- Adjust stop sequences, top-k configurations, and temperature.
- Toggle between Gemini models (Pro, Flash, etc.).
- Access prompt gallery, cookbook, and getting started resources.
- View past prompts and outputs.
Examples:
- Scraping GitHub issues and Stack Overflow questions for analysis.
- Converting COBOL code to Java with specific instructions and architecture preferences.
Key Takeaway: With detailed instructions, models can achieve impressive results, much like a skilled contractor team.

Retrieval Augmented Generation

Retrieval in Google Products

gemini.google.com (formerly Bard):
- Example: Querying for information about the San Francisco Ferry Building and requesting recommendations.
- Results are grounded in Google Search, with an option to view source citations and confidence levels.
Personalized Retrieval: The concept can be extended to internal corporate data and codebases.

Fine-tuning Considerations and Gemma

Fine-Tuning:
- Should be approached with caution and a clear understanding of the maintenance commitment.
- Consider the rapid evolution of open-source models.
Gemma Family:
- Solid starting point for open-source fine-tuning.
- Available in 2B and 7B parameter sizes, with both instruction-tuned and non-instruction-tuned variants.
- CodeGemma: For code-related tasks.
- RecurrentGemma: For sequential data.
  - Paper: RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
  - HuggingFace Hub: google/recurrentgemma-2b-it
- PaliGemma: Open-vision language model.
  - Paper: PaliGemma: A versatile 3B VLM for transfer
  - Blog Post: PaliGemma – Google’s Cutting-Edge Open Vision Language Model
Resources:
- Model Garden on Vertex AI
- HuggingFace Hub
Deployment: Easy one-click deployment to Google Cloud.
Model Builders: Provides automatic comparisons and prompt management.

About Me:

I’m Christian Mills, a deep learning consultant specializing in practical AI implementations. I help clients leverage cutting-edge AI technologies to solve real-world problems.

Interested in working together? Fill out my Quick AI Project Assessment form or learn more about me.