Conference Talk 8: Creating, curating, and cleaning data for LLMs

notes
llms
In this talk, Daniel van Strien from 🤗 outlines key considerations and techniques for creating high-quality datasets for fine-tuning LLMs.
Author

Christian Mills

Published

July 18, 2024

This post is part of the following series:
  • Mastering LLMs Course Notes: My notes from the course Mastering LLMs: A Conference For Developers & Data Scientists by Hamel Husain and Dan Becker.

Reusing Existing Datasets

  • 🤗 Datasets: https://huggingface.co/datasets
  • What kind of existing dataset?
    • Consider the use case: Datasets like fineweb are designed for pre-training, not fine-tuning.
    • Look beyond research datasets: Community-contributed datasets can offer unique and valuable data.
  • Browsing datasets
  • Searching for datasets
    • Full-text search: Useful if dataset names are not descriptive enough.
  • Reviewing if a dataset is suitable: Vibe checks
    • Dataset viewer: Provides a preview of the data, including metadata and example rows.
    • Analyze conversation length and content: Assess if the dataset aligns with your target application.

Creating Your Own Dataset

  • Adapt existing NLP datasets: Restructure data from classic NLP tasks for LLM fine-tuning.
  • Leverage user feedback: Analyze existing user interactions (e.g., thumbs up/down) for preference data.
  • Synthetic data: A powerful method for jumpstarting dataset creation.
  • Getting data?
    • Data format: Ensure the data format closely resembles the intended use case for the LLM.
    • Preprocessing: The effort invested in data preparation will benefit the deployment stage.
  • What kind of dataset do you need for fine-tuning
    • Specificity over diversity: Focus on data relevant to your specific use case, even if it means the model loses some general abilities.
    • Data diversity should reflect the target application: Don’t aim for broad diversity if your application is narrow.

Common Dataset Genres

  • SFT (Supervised Fine Tuning)
    • Structure: Question and answer pairs.
  • RLHF (Reinforcement Learning by Human Feedback)
    • Structure: Similar to SFT, but with additional preference information.
  • DPO (Direct Preference Optimization)
    • Structure: Input, chosen response, and rejected response.
    • Flexibility: “Chosen” and “rejected” can be generated creatively (e.g., using human-written text as “chosen” and model-generated text as “rejected”).
  • KTO (Kahneman-Tversky Optimization)
    • argilla’s Collections: Preference Datasets for KTO
    • Structure: Model response and a binary preference (thumbs up/down).
    • Easy to collect: Users can readily provide simple preference feedback.
  • SPIN/ORPO
    • HuggingFaceH4’s Collections: Awesome SFT datasets
    • SPIN: Iterative approach to reduce data requirements by synthetically generating responses.
    • ORPO: Similar to DPO but doesn’t require a separate supervised fine-tuning step, making it more efficient.

Synthetic Data

  • Definition: Data generated by LLMs, often used for fine-tuning other LLMs.
  • Methods:
    • Generating prompts and completions from scratch.
    • Rephrasing existing prompts to improve quality and diversity.
    • AI feedback: Using LLMs to judge the quality of other LLM outputs.

Basic Taxonomy of Synthetic Data Uses

  • Blog Post: Synthetic data: Anthropic’s CAI, from fine-tuning to pretraining, OpenAI’s Superalignment, tips, types, and open examples

  • Instructions

    • Generated text for SFT / IFT
    • Completions
      • Prompt ➡️ :robot: ➡️ instruction
  • Self-Instruct Bootstrapping

    • Process
      • Prompt ➡️ :robot: ➡️ More prompts ➡️ :robot: ➡️ instruction :arrow_right: Filtering :repeat:
  • Using LLM to Generate Corrected Sample

    • Process
      • principles + instruction ➡️ :robot: ➡️ corrected instruction :arrow_right: Repeat to filter :repeat:
  • Preferences

    • Scoring / Choosing Response for RM / RLHF Training
      • instruction-1…instruction-N :arrow_right: :robot: ➡️ scores or chosen/rejected :arrow_right: Filtering :repeat:
  • Critiques

    • Initial Instruction
      • (rejected response)
    • Using LLM Principles to Generate Pairwise Completions
      • initial instruction + principles ➡️ :robot: ➡️ corrected instruction (chosen response) :arrow_right: Filtering :repeat:

Challenges

  • Alpaca 7B model: Trained on synthetic data generated by prompting a language model with instructions.
  • Ultrafeedback paper: Used multiple models and GPT-4 for judging synthetic data quality based on multiple criteria.
    • Paper: UltraFeedback: Boosting Language Models with Scaled AI Feedback
    • GitHub Repository: https://github.com/OpenBMB/UltraFeedback
    • Challenges: Coding errors, API failures, and data handling issues can severely impact data quality, even with advanced models.
      • Example: Errors in the Ultrafeedback dataset highlighted the importance of human review and data cleaning.
  • Scaling challenges: Generating high-quality synthetic data at scale requires careful consideration of cost, vendor lock-in, and data quality.

Tools

Improving Data

Ensuring sufficient quantity

  • More ≠ better
  • Be okay with “throwing away” annotations and data
  • Data Requirements (based on paper reported numbers)
    • SFT: 10K
    • ORPO: 7K
    • DPO: 3K
    • SPIN: 2K
  • Data requirements will be higher for more diverse inputs + outputs (would you have used BERT a few years ago to do this task?)
  • Blog Post: https://argilla.io/blog/mantisnlp-rlhf-part-9/

Deduplication

  • Blog Post: FineWeb: decanting the web for the finest text data at scale
  • Challenges: Deduplication pipelines are often not reproducible or well-documented.
  • Approaches:
    • Intuitive rules and metadata filtering.
    • Topic-wise deduplication.
    • Custom metadata and feature engineering.
    • Embedding similarity and exemplar selection.

Rule based

  • Regex: Useful for simple rule-based cleaning (e.g., removing unwanted phrases or patterns).

Quality

Human Annotation

  • Finding the right balance: Choose tools and approaches that fit your needs and budget, from simple spreadsheets to fully customized solutions.
  • Custom
    • Example: Vincent D. Warmerdon’s bulk annotation tool using Bokeh and Pandas for interactive data exploration and annotation.
  • Notebooks
    • Example: ipyannotations for in-notebook annotation with customizable callbacks for post-processing and active learning.
  • Apps
    • Example: Gradio, Streamlit, and Shiny for building custom web apps with intuitive UIs for annotation tasks.
  • Lilac
  • Argilla
    • Website: https://argilla.io/
    • Features: Similar to Lilac, with a focus on dataset management and annotation workflows.

Example Datasets

  • distilabel Orca Pairs for DPO

    • Dataset: argilla/distilabel-intel-orca-dpo-pairs
    • Why it’s cool and what you can learn:
      • Filtering: less is more
      • Choosing the “stronger” model as always chosen doesn’t always work
      • Doing some data focused work using humans can make a big impact on model performance
  • Gutenberg DPO

    • Dataset: jondurbin/gutenberg-dpo-v0.1
    • Approach: Uses human-written books and LLM-generated summaries to create a DPO dataset for improving LLM writing ability.
  • PlatVR KTO Dataset

    • Dataset: ITG/PlatVR-kto
    • Approach: Collects thumbs up/down ratings on model outputs to create a KTO dataset for a vision-language task.
    • Why it’s cool and what you can learn:
      • Example of how KTO datasets can work well as a data flywheel
      • Users submit a prompt. They get a response they đź‘Ť/đź‘Ž that response
      • Cheap to collect and can be useful data even if you don’t end up using KTO
    • Disclaimer: The creation process was done using a crowdsourcing methodology. Therefore, the preferences in the data align with the user group that participated in the process (i.e., these are real preference data).

Case Study: LLM Summarizer

  • Goal: Build an LLM-based summarizer for data set cards on Hugging Face.
  • Approach: Uses a Distilabel pipeline to generate and judge summaries, incorporating both model and human feedback.

LLM Summarizer Load dataset card in markdown format Load dataset card in markdown format Parse out YAML and remove unwanted content Parse out YAML and remove unwanted content Load dataset card in markdown format->Parse out YAML and remove unwanted content Step 1 Pass reduced text to LLM Pass reduced text to LLM Parse out YAML and remove unwanted content->Pass reduced text to LLM Step 2 Get summary of dataset card Get summary of dataset card Pass reduced text to LLM->Get summary of dataset card Step 3 Get feedback from users about summary quality Get feedback from users about summary quality Get summary of dataset card->Get feedback from users about summary quality Step 4

The distilabel pipeline

G llama_summary llama_summary combine_columns combine_columns llama_summary->combine_columns mistral_summary mistral_summary mistral_summary->combine_columns zephyr_summary zephyr_summary zephyr_summary->combine_columns llama_3_70_b llama-3-70-B Instruct llama_3_70_b->llama_summary ultrafeedback ultrafeedback llama_3_70_b->ultrafeedback remove_bad_ratings remove_bad_ratings ultrafeedback->remove_bad_ratings to_argilla to_argilla end 🤗 to_argilla->end push_to_hub Push to Hub push_to_hub->end start 🤗 load_dataset load_dataset start->load_dataset card_filter card_filter load_dataset->card_filter format_input_card format_input_card card_filter->format_input_card format_input_card->llama_summary format_input_card->mistral_summary format_input_card->zephyr_summary combine_columns->ultrafeedback remove_bad_ratings->to_argilla remove_bad_ratings->push_to_hub

  • Key steps:
    • Data loading and filtering.
    • Prompt formatting.
    • Summary generation using multiple models.
    • LLM-based judging using Ultrafeedback and Lama three.
    • Human review and annotation using Argilla.
  • Iterative process: Experiment with different prompts, models, and judging criteria to optimize the pipeline.

Resources

Q&A

  • Question: How to generate synthetic data when fine-tuning on proprietary data and human annotation is expensive?
  • Answer: The decision to use synthetic data vs. proprietary data involves trade-offs related to data ownership, privacy, and control over the model.

About Me:
  • I’m Christian Mills, a deep learning consultant specializing in computer vision and practical AI implementations.
  • I help clients leverage cutting-edge AI technologies to solve real-world problems.
  • Learn more about me or reach out via email at [email protected] to discuss your project.