Conference Talk 6: Train Almost Any LLM Model Using 🤗 autotrain

notes

llms

In this talk, Abhishek Thakur, who leads AutoTrain at 🤗, shows how to use 🤗 AutoTrain to train/fine-tune LLMs without having to write any code.

Author

Christian Mills

Published

July 12, 2024

This post is part of the following series:

Mastering LLMs Course Notes: My notes from the course Mastering LLMs: A Conference For Developers & Data Scientists by Hamel Husain and Dan Becker.

Introduction to AutoTrain
Getting Started with AutoTrain
Fine-tuning LLMs with AutoTrain
Training Your Model
Config Files and Advanced Options
Additional Features and Considerations
Q&A Session

Introduction to AutoTrain

Homepage: https://huggingface.co/autotrain
Documentation: https://huggingface.co/docs/autotrain/index
GitHub Repository: autotrain-advanced
Simplifies model training and fine-tuning for users with varying levels of expertise, from beginners to experienced data scientists.
Supported Tasks:
- NLP: Token classification, text classification, LLM tasks (e.g., SFT, R4, DPO, reward tuning), sentence transformer fine-tuning, etc.
- Computer Vision: Image classification, Object Detection
- Tabular Data: Classification, Regression
Leverages the Hugging Face ecosystem, including transformers, datasets, diffusers, and Accelerate, ensuring compatibility with the latest models and tools.

Getting Started with AutoTrain

Create a new project:
- Link: Create new project
- Optionally specify an organization and attach hardware (local or Hugging Face spaces).
- Choose the desired task (e.g., LLM SFT).
User-Friendly Interface:
- Select a task.
- Upload your data or use a dataset from the Hugging Face Hub.
- Configure parameters or use default settings.
- Monitor training progress and logs.
Documentation: Creating a New AutoTrain Space

Fine-tuning LLMs with AutoTrain

Supervised Fine-tuning (SFT) and Generic Fine-tuning

Documentation: Supervised Fine-tuning Trainer
Both trainers are similar, but SFT uses the TRL library’s SFT trainer.
- Documentation: TRL - Transformer Reinforcement Learning
Requires a “text” column in your dataset (can be mapped from a different column name).
Supports chat template formatting (chatML, Sapphire, tokenizer’s template).
Example datasets:
- Salesforce/wikitext: plain text format
- Chat format with “content” and “role” fields (requires chat template).

Reward Modeling

Documentation: Reward Modeling
Trains a custom reward model for sequence classification.
Dataset requires “chosen” and “rejected” text columns.

DPO and ORPO

DPO - Direct Preference Optimization
- Documentation: DPO Trainer
ORPO - Odds Ratio Preference Optimization
- Documentation: ORPO Trainer
ORPO is recommended over DPO as it requires less memory and compute.
Dataset requires “prompt,” “chosen,” and “rejected” columns (all conversations).
Supports chat templates.

Training Your Model

Data Format

Use CSV or JSON Lines (JSONL) format
- JSONL preferred for readability and ease of use.
Format examples:
- Alpaca dataset: Single “text” field with formatted text (no chat template needed).
- Chat format: Requires chat template or offline conversion to plain text.

Training Locally

Documentation: Quickstart
Set Hugging Face token:
- ```
  export HF_TOKEN=<your_token>
```

Run the AutoTrain app: autotrain app

  autotrain app --port 8080 --host 127.0.0.1

Alternatively, use config files or CLI commands.
- ```
  autotrain --config <path_to_config_file>
```

pip install autotrain-advanced

conda create -n autotrain python=3.10
conda activate autotrain
pip install autotrain-advanced
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
conda install -c "nvidia/label/cuda-12.1.0" cuda-nvcc

Training on Other Platforms

Jarvis Labs: Provides AutoTrain templates for easy setup and training.
DGX Cloud: Rent high-performance GPUs for training large models.
Google Colab: Run AutoTrain directly in Colab using provided notebooks and UI.

Config Files and Advanced Options

Documentation: AutoTrain Configs
Config files offer more flexibility and control over training parameters.
Define task, base model, data paths, column mapping, hyperparameters, logging, and more.
Access example config files in the AutoTrain GitHub repository.
- GitHub Repository: autotrain-advanced/configs

Additional Features and Considerations

AutoTrain automatically handles multi-GPU training using DeepSpeed or distributed data parallel.
QLORA is supported on DeepSpeed for efficient training.
Sentence Transformer fine-tuning is available for tasks like improving RAG models.

Q&A Session

Logging: Supports Weights & Biases (W&B) logging when using config files.
Mixed Precision: Supports BF16 and FP16, but not FP8.
Parameter Compatibility: AutoTrain ensures parameter compatibility based on the chosen base model.
Hyperparameter Optimization: Not currently supported for LLMs due to long training times.
CPU Training: Possible, but may come with performance limitations.
Custom Chat Templates: Can be added by modifying the tokenizer_config.json file of a cloned model.
Synthetic Data Generation: Not currently supported, but users can generate their own.

About Me:

I’m Christian Mills, a deep learning consultant specializing in practical AI implementations. I help clients leverage cutting-edge AI technologies to solve real-world problems.

Interested in working together? Fill out my Quick AI Project Assessment form or learn more about me.