Conference Talk 8: Creating, curating, and cleaning data for LLMs

notes

llms

In this talk, Daniel van Strien from 🤗 outlines key considerations and techniques for creating high-quality datasets for fine-tuning LLMs.

Author

Christian Mills

Published

July 18, 2024

This post is part of the following series:

Mastering LLMs Course Notes: My notes from the course Mastering LLMs: A Conference For Developers & Data Scientists by Hamel Husain and Dan Becker.

Reusing Existing Datasets
Creating Your Own Dataset
Common Dataset Genres
Synthetic Data
Improving Data
Human Annotation
Example Datasets
Case Study: LLM Summarizer
Resources
Q&A

Reusing Existing Datasets

🤗 Datasets: https://huggingface.co/datasets
What kind of existing dataset?
- Consider the use case: Datasets like fineweb are designed for pre-training, not fine-tuning.
  - 🤗 Dataset: HuggingFaceFW/fineweb
- Look beyond research datasets: Community-contributed datasets can offer unique and valuable data.
Browsing datasets
- Use tags: Filter datasets based on specific formats or tasks (e.g., DPO datasets).
  - 🤗 DPO Datasets
Searching for datasets
- Full-text search: Useful if dataset names are not descriptive enough.
Reviewing if a dataset is suitable: Vibe checks
- Dataset viewer: Provides a preview of the data, including metadata and example rows.
- Analyze conversation length and content: Assess if the dataset aligns with your target application.

Creating Your Own Dataset

Adapt existing NLP datasets: Restructure data from classic NLP tasks for LLM fine-tuning.
Leverage user feedback: Analyze existing user interactions (e.g., thumbs up/down) for preference data.
Synthetic data: A powerful method for jumpstarting dataset creation.
Getting data?
- Data format: Ensure the data format closely resembles the intended use case for the LLM.
- Preprocessing: The effort invested in data preparation will benefit the deployment stage.
What kind of dataset do you need for fine-tuning
- Specificity over diversity: Focus on data relevant to your specific use case, even if it means the model loses some general abilities.
- Data diversity should reflect the target application: Don’t aim for broad diversity if your application is narrow.

Common Dataset Genres

SFT (Supervised Fine Tuning)
- Structure: Question and answer pairs.
RLHF (Reinforcement Learning by Human Feedback)
- Structure: Similar to SFT, but with additional preference information.
DPO (Direct Preference Optimization)
- Structure: Input, chosen response, and rejected response.
- Flexibility: “Chosen” and “rejected” can be generated creatively (e.g., using human-written text as “chosen” and model-generated text as “rejected”).
KTO (Kahneman-Tversky Optimization)
- argilla’s Collections: Preference Datasets for KTO
- Structure: Model response and a binary preference (thumbs up/down).
- Easy to collect: Users can readily provide simple preference feedback.
SPIN/ORPO
- HuggingFaceH4’s Collections: Awesome SFT datasets
- SPIN: Iterative approach to reduce data requirements by synthetically generating responses.
- ORPO: Similar to DPO but doesn’t require a separate supervised fine-tuning step, making it more efficient.

Synthetic Data

Definition: Data generated by LLMs, often used for fine-tuning other LLMs.
Methods:
- Generating prompts and completions from scratch.
- Rephrasing existing prompts to improve quality and diversity.
- AI feedback: Using LLMs to judge the quality of other LLM outputs.

Basic Taxonomy of Synthetic Data Uses

Blog Post: Synthetic data: Anthropic’s CAI, from fine-tuning to pretraining, OpenAI’s Superalignment, tips, types, and open examples
Instructions
- Generated text for SFT / IFT
- Completions
  - Prompt ➡️ :robot: ➡️ instruction
Self-Instruct Bootstrapping
- Process
  - Prompt ➡️ :robot: ➡️ More prompts ➡️ :robot: ➡️ instruction :arrow_right: Filtering :repeat:
Using LLM to Generate Corrected Sample
- Process
  - principles + instruction ➡️ :robot: ➡️ corrected instruction :arrow_right: Repeat to filter :repeat:
Preferences
- Scoring / Choosing Response for RM / RLHF Training
  - instruction-1…instruction-N :arrow_right: :robot: ➡️ scores or chosen/rejected :arrow_right: Filtering :repeat:
Critiques
- Initial Instruction
  - (rejected response)
- Using LLM Principles to Generate Pairwise Completions
  - initial instruction + principles ➡️ :robot: ➡️ corrected instruction (chosen response) :arrow_right: Filtering :repeat:

Challenges

Alpaca 7B model: Trained on synthetic data generated by prompting a language model with instructions.
- Blog Post: Alpaca: A Strong, Replicable Instruction-Following Model
- Challenges: Synthetic data can contain hallucinations, toxicity, and biases inherited from the generating model.
  - Example: Alpaca 7B exhibiting hallucinations and inaccurate claims.
Ultrafeedback paper: Used multiple models and GPT-4 for judging synthetic data quality based on multiple criteria.
- Paper: UltraFeedback: Boosting Language Models with Scaled AI Feedback
- GitHub Repository: https://github.com/OpenBMB/UltraFeedback
- Challenges: Coding errors, API failures, and data handling issues can severely impact data quality, even with advanced models.
  - Example: Errors in the Ultrafeedback dataset highlighted the importance of human review and data cleaning.
Scaling challenges: Generating high-quality synthetic data at scale requires careful consideration of cost, vendor lock-in, and data quality.

Tools

Outlines:
- Enables structured text generation in various formats (JSON, Regex, etc.) by modifying token sampling.
- GitHub Repository: https://github.com/outlines-dev/outlines
DSPy:
- Focuses on “programming” LLM behavior through a defined signature, optimizing prompts and fine-tuning for specific tasks.
- GitHub Repository: https://github.com/stanfordnlp/dspy
- Blog Post: Fuck You, Show Me The Prompt.
distilabel:
- A pipeline framework for synthetic data generation and AI feedback, emphasizing scalability and dataset engineer workflows.
- GitHub Repository: https://github.com/argilla-io/distilabel

Improving Data

Ensuring sufficient quantity

More ≠ better
Be okay with “throwing away” annotations and data
Data Requirements (based on paper reported numbers)
- SFT: 10K
- ORPO: 7K
- DPO: 3K
- SPIN: 2K
Data requirements will be higher for more diverse inputs + outputs (would you have used BERT a few years ago to do this task?)
Blog Post: https://argilla.io/blog/mantisnlp-rlhf-part-9/

Deduplication

Blog Post: FineWeb: decanting the web for the finest text data at scale
Challenges: Deduplication pipelines are often not reproducible or well-documented.
Approaches:
- Intuitive rules and metadata filtering.
- Topic-wise deduplication.
- Custom metadata and feature engineering.
- Embedding similarity and exemplar selection.

Rule based

Regex: Useful for simple rule-based cleaning (e.g., removing unwanted phrases or patterns).

Quality

Tutorial: Filtering corpora using Quality
Techniques:
- Basic heuristics
- Topic modeling
- Embeddings
- Classifiers (don’t dismiss these even if you have $$$)
  - Model: MoritzLaurer/deberta-v3-large-zeroshot-v2.0
  - Model: HuggingFaceFW/fineweb-edu-classifier
- LLM as judge/juries/rationalizers
- Human annotation

Human Annotation

Finding the right balance: Choose tools and approaches that fit your needs and budget, from simple spreadsheets to fully customized solutions.
Custom
- Example: Vincent D. Warmerdon’s bulk annotation tool using Bokeh and Pandas for interactive data exploration and annotation.
  - GitHub Repository: koaning/bulk
Notebooks
- Example: ipyannotations for in-notebook annotation with customizable callbacks for post-processing and active learning.
  - Documentation: ipyannotations
Apps
- Example: Gradio, Streamlit, and Shiny for building custom web apps with intuitive UIs for annotation tasks.
Lilac
- Website: https://www.lilacml.com/
- Features: Dataset overview, semantic search, and integrated annotation tools.
Argilla
- Website: https://argilla.io/
- Features: Similar to Lilac, with a focus on dataset management and annotation workflows.

Example Datasets

distilabel Orca Pairs for DPO
- Dataset: argilla/distilabel-intel-orca-dpo-pairs
- Why it’s cool and what you can learn:
  - Filtering: less is more
  - Choosing the “stronger” model as always chosen doesn’t always work
  - Doing some data focused work using humans can make a big impact on model performance
Gutenberg DPO
- Dataset: jondurbin/gutenberg-dpo-v0.1
- Approach: Uses human-written books and LLM-generated summaries to create a DPO dataset for improving LLM writing ability.
PlatVR KTO Dataset
- Dataset: ITG/PlatVR-kto
- Approach: Collects thumbs up/down ratings on model outputs to create a KTO dataset for a vision-language task.
- Why it’s cool and what you can learn:
  - Example of how KTO datasets can work well as a data flywheel
  - Users submit a prompt. They get a response they 👍/👎 that response
  - Cheap to collect and can be useful data even if you don’t end up using KTO
- Disclaimer: The creation process was done using a crowdsourcing methodology. Therefore, the preferences in the data align with the user group that participated in the process (i.e., these are real preference data).

Case Study: LLM Summarizer

Goal: Build an LLM-based summarizer for data set cards on Hugging Face.
Approach: Uses a Distilabel pipeline to generate and judge summaries, incorporating both model and human feedback.

The distilabel pipeline

Key steps:
- Data loading and filtering.
- Prompt formatting.
- Summary generation using multiple models.
- LLM-based judging using Ultrafeedback and Lama three.
- Human review and annotation using Argilla.
Iterative process: Experiment with different prompts, models, and judging criteria to optimize the pipeline.

Resources

GitHub repository: davanstrien/data-for-fine-tuning-llms
- Contains notebooks with code examples for deduplication, data checks, and synthetic data generation.
GitHub Repository: davanstrien/awesome-synthetic-datasets
- Organizes resources focused on helping people get started with building synthetic datasets.

Q&A

Question: How to generate synthetic data when fine-tuning on proprietary data and human annotation is expensive?
Answer: The decision to use synthetic data vs. proprietary data involves trade-offs related to data ownership, privacy, and control over the model.

About Me:

I’m Christian Mills, a deep learning consultant specializing in practical AI implementations. I help clients leverage cutting-edge AI technologies to solve real-world problems.

Interested in working together? Fill out my Quick AI Project Assessment form or learn more about me.