Conference Talk 8: Creating, curating, and cleaning data for LLMs
notes
llms
In this talk, Daniel van Strien from 🤗 outlines key considerations and techniques for creating high-quality datasets for fine-tuning LLMs.
This post is part of the following series:
- Mastering LLMs Course Notes: My notes from the course Mastering LLMs: A Conference For Developers & Data Scientists by Hamel Husain and Dan Becker.
- Reusing Existing Datasets
- Creating Your Own Dataset
- Common Dataset Genres
- Synthetic Data
- Improving Data
- Human Annotation
- Example Datasets
- Case Study: LLM Summarizer
- Resources
- Q&A
Reusing Existing Datasets
- 🤗 Datasets: https://huggingface.co/datasets
- What kind of existing dataset?
- Consider the use case: Datasets like
fineweb
are designed for pre-training, not fine-tuning.- 🤗 Dataset: HuggingFaceFW/fineweb
- Look beyond research datasets: Community-contributed datasets can offer unique and valuable data.
- Consider the use case: Datasets like
- Browsing datasets
- Use tags: Filter datasets based on specific formats or tasks (e.g., DPO datasets).
- Searching for datasets
- Full-text search: Useful if dataset names are not descriptive enough.
- Reviewing if a dataset is suitable: Vibe checks
- Dataset viewer: Provides a preview of the data, including metadata and example rows.
- Analyze conversation length and content: Assess if the dataset aligns with your target application.
Creating Your Own Dataset
- Adapt existing NLP datasets: Restructure data from classic NLP tasks for LLM fine-tuning.
- Leverage user feedback: Analyze existing user interactions (e.g., thumbs up/down) for preference data.
- Synthetic data: A powerful method for jumpstarting dataset creation.
- Getting data?
- Data format: Ensure the data format closely resembles the intended use case for the LLM.
- Preprocessing: The effort invested in data preparation will benefit the deployment stage.
- What kind of dataset do you need for fine-tuning
- Specificity over diversity: Focus on data relevant to your specific use case, even if it means the model loses some general abilities.
- Data diversity should reflect the target application: Don’t aim for broad diversity if your application is narrow.
Common Dataset Genres
- SFT (Supervised Fine Tuning)
- Structure: Question and answer pairs.
- RLHF (Reinforcement Learning by Human Feedback)
- Structure: Similar to SFT, but with additional preference information.
- DPO (Direct Preference Optimization)
- Structure: Input, chosen response, and rejected response.
- Flexibility: “Chosen” and “rejected” can be generated creatively (e.g., using human-written text as “chosen” and model-generated text as “rejected”).
- KTO (Kahneman-Tversky Optimization)
- argilla’s Collections: Preference Datasets for KTO
- Structure: Model response and a binary preference (thumbs up/down).
- Easy to collect: Users can readily provide simple preference feedback.
- SPIN/ORPO
- HuggingFaceH4’s Collections: Awesome SFT datasets
- SPIN: Iterative approach to reduce data requirements by synthetically generating responses.
- ORPO: Similar to DPO but doesn’t require a separate supervised fine-tuning step, making it more efficient.
Synthetic Data
- Definition: Data generated by LLMs, often used for fine-tuning other LLMs.
- Methods:
- Generating prompts and completions from scratch.
- Rephrasing existing prompts to improve quality and diversity.
- AI feedback: Using LLMs to judge the quality of other LLM outputs.
Basic Taxonomy of Synthetic Data Uses
Instructions
- Generated text for SFT / IFT
- Completions
Prompt
➡️ :robot: ➡️instruction
Self-Instruct Bootstrapping
- Process
Prompt
➡️ :robot: ➡️More prompts
➡️ :robot: ➡️instruction
:arrow_right: Filtering :repeat:
- Process
Using LLM to Generate Corrected Sample
- Process
principles
+instruction
➡️ :robot: ➡️corrected instruction
:arrow_right: Repeat to filter :repeat:
- Process
Preferences
- Scoring / Choosing Response for RM / RLHF Training
instruction-1
…instruction-N
:arrow_right: :robot: ➡️scores
orchosen/rejected
:arrow_right: Filtering :repeat:
- Scoring / Choosing Response for RM / RLHF Training
Critiques
- Initial Instruction
- (rejected response)
- Using LLM Principles to Generate Pairwise Completions
initial instruction
+principles
➡️ :robot: ➡️corrected instruction
(chosen response) :arrow_right: Filtering :repeat:
- Initial Instruction
Challenges
- Alpaca 7B model: Trained on synthetic data generated by prompting a language model with instructions.
- Blog Post: Alpaca: A Strong, Replicable Instruction-Following Model
- Challenges: Synthetic data can contain hallucinations, toxicity, and biases inherited from the generating model.
- Example: Alpaca 7B exhibiting hallucinations and inaccurate claims.
- Ultrafeedback paper: Used multiple models and GPT-4 for judging synthetic data quality based on multiple criteria.
- Paper: UltraFeedback: Boosting Language Models with Scaled AI Feedback
- GitHub Repository: https://github.com/OpenBMB/UltraFeedback
- Challenges: Coding errors, API failures, and data handling issues can severely impact data quality, even with advanced models.
- Example: Errors in the Ultrafeedback dataset highlighted the importance of human review and data cleaning.
- Scaling challenges: Generating high-quality synthetic data at scale requires careful consideration of cost, vendor lock-in, and data quality.
Tools
- Outlines:
- Enables structured text generation in various formats (JSON, Regex, etc.) by modifying token sampling.
- GitHub Repository: https://github.com/outlines-dev/outlines
- DSPy:
- Focuses on “programming” LLM behavior through a defined signature, optimizing prompts and fine-tuning for specific tasks.
- GitHub Repository: https://github.com/stanfordnlp/dspy
- Blog Post: Fuck You, Show Me The Prompt.
- distilabel:
- A pipeline framework for synthetic data generation and AI feedback, emphasizing scalability and dataset engineer workflows.
- GitHub Repository: https://github.com/argilla-io/distilabel
Improving Data
Ensuring sufficient quantity
- More ≠better
- Be okay with “throwing away” annotations and data
- Data Requirements (based on paper reported numbers)
- SFT:
10K
- ORPO:
7K
- DPO:
3K
- SPIN:
2K
- SFT:
- Data requirements will be higher for more diverse inputs + outputs (would you have used BERT a few years ago to do this task?)
- Blog Post: https://argilla.io/blog/mantisnlp-rlhf-part-9/
Deduplication
- Blog Post: FineWeb: decanting the web for the finest text data at scale
- Challenges: Deduplication pipelines are often not reproducible or well-documented.
- Approaches:
- Intuitive rules and metadata filtering.
- Topic-wise deduplication.
- Custom metadata and feature engineering.
- Embedding similarity and exemplar selection.
Rule based
- Regex: Useful for simple rule-based cleaning (e.g., removing unwanted phrases or patterns).
Quality
- Tutorial: Filtering corpora using Quality
- Techniques:
- Basic heuristics
- Topic modeling
- Embeddings
- Classifiers (don’t dismiss these even if you have $$$)
- LLM as judge/juries/rationalizers
- Human annotation
Human Annotation
- Finding the right balance: Choose tools and approaches that fit your needs and budget, from simple spreadsheets to fully customized solutions.
- Custom
- Example: Vincent D. Warmerdon’s bulk annotation tool using Bokeh and Pandas for interactive data exploration and annotation.
- GitHub Repository: koaning/bulk
- Example: Vincent D. Warmerdon’s bulk annotation tool using Bokeh and Pandas for interactive data exploration and annotation.
- Notebooks
- Example: ipyannotations for in-notebook annotation with customizable callbacks for post-processing and active learning.
- Documentation: ipyannotations
- Example: ipyannotations for in-notebook annotation with customizable callbacks for post-processing and active learning.
- Apps
- Example: Gradio, Streamlit, and Shiny for building custom web apps with intuitive UIs for annotation tasks.
- Lilac
- Website: https://www.lilacml.com/
- Features: Dataset overview, semantic search, and integrated annotation tools.
- Argilla
- Website: https://argilla.io/
- Features: Similar to Lilac, with a focus on dataset management and annotation workflows.
Example Datasets
distilabel Orca Pairs for DPO
- Dataset: argilla/distilabel-intel-orca-dpo-pairs
- Why it’s cool and what you can learn:
- Filtering: less is more
- Choosing the “stronger” model as always chosen doesn’t always work
- Doing some data focused work using humans can make a big impact on model performance
Gutenberg DPO
- Dataset: jondurbin/gutenberg-dpo-v0.1
- Approach: Uses human-written books and LLM-generated summaries to create a DPO dataset for improving LLM writing ability.
PlatVR KTO Dataset
- Dataset: ITG/PlatVR-kto
- Approach: Collects thumbs up/down ratings on model outputs to create a KTO dataset for a vision-language task.
- Why it’s cool and what you can learn:
- Example of how KTO datasets can work well as a data flywheel
- Users submit a prompt. They get a response they đź‘Ť/đź‘Ž that response
- Cheap to collect and can be useful data even if you don’t end up using KTO
- Disclaimer: The creation process was done using a crowdsourcing methodology. Therefore, the preferences in the data align with the user group that participated in the process (i.e., these are real preference data).
Case Study: LLM Summarizer
- Goal: Build an LLM-based summarizer for data set cards on Hugging Face.
- Approach: Uses a Distilabel pipeline to generate and judge summaries, incorporating both model and human feedback.
The distilabel pipeline
- Key steps:
- Data loading and filtering.
- Prompt formatting.
- Summary generation using multiple models.
- LLM-based judging using Ultrafeedback and Lama three.
- Human review and annotation using Argilla.
- Iterative process: Experiment with different prompts, models, and judging criteria to optimize the pipeline.
Resources
- GitHub repository: davanstrien/data-for-fine-tuning-llms
- Contains notebooks with code examples for deduplication, data checks, and synthetic data generation.
- GitHub Repository: davanstrien/awesome-synthetic-datasets
- Organizes resources focused on helping people get started with building synthetic datasets.
Q&A
- Question: How to generate synthetic data when fine-tuning on proprietary data and human annotation is expensive?
- Answer: The decision to use synthetic data vs. proprietary data involves trade-offs related to data ownership, privacy, and control over the model.
About Me:
I’m Christian Mills, a deep learning consultant specializing in practical AI implementations. I help clients leverage cutting-edge AI technologies to solve real-world problems.
Interested in working together? Fill out my Quick AI Project Assessment form or learn more about me.