Conference Talk 5: Napkin Math For Fine Tuning with Johno Whitaker

In this talk, Jonathan Whitaker from shows how to build intuition around training performance with a focus on GPU-poor fine tuning.

Christian Mills


July 7, 2024

This post is part of the following series:
  • Mastering LLMs Course Notes: My notes from the course Mastering LLMs: A Conference For Developers & Data Scientists by Hamel Husain and Dan Becker.
Presentation Resources


  • Goal: To provide insights into the factors influencing fine-tuning performance, enabling better decision-making in training models.
  • Target audience: Individuals new to training models, particularly by fine-tuning existing large language models.
  • Key questions addressed:
    • Factors affecting model performance.
    • Strategies for improving performance.
    • Reasons for memory limitations and slow training times.
    • Understanding and adjusting various parameters in configuration files.
  • Approach: Utilizing a “napkin math” approach to provide a general understanding of the concepts without delving into intricate mathematical details.
  • Disclaimer: Emphasizes that the information presented is simplified for clarity and may not be entirely accurate due to the constantly evolving nature of AI implementations.

Good News & Bad News

  • Good news: The mathematical operations underlying model training are well-understood, enabling analysis and experimentation.
  • Bad news:
    • Discrepancies can exist between research papers, code implementations, and framework-specific implementations.
    • The complexity of the underlying processes can be daunting.
    • Keeping track of all the details and nuances is challenging.
    • Multi-GPU setups introduce additional complexity
  • Key takeaway: While a simplified approach can be helpful, acknowledging the inherent complexities is essential for accurate analysis and problem-solving.

Training Neural Networks

Training Loop

  • The core cycle: Load data, feed it through the model, generate an answer, evaluate its quality, and update the model accordingly.
  • Language model context: The correct answer typically refers to the next word in a sequence, and predictions are represented as probabilities for potential next words.
  • Fine-tuning data: Datasets used for fine-tuning contain instruction-response pairs, allowing the model to learn from desired output patterns.

On Computers

  • Hardware components: Understanding the roles of CPU, GPU, RAM, and their interconnectivity is crucial for performance analysis.
  • Memory hierarchy: Different memory types (e.g., CPU cache, RAM, hard drive) have varying access speeds, influencing data transfer times.

Training Neural Networks

  • Factors affecting performance:
    • Data loading: Reading data from storage.
    • Model computation: Performing mathematical operations within the model’s layers.
    • Parameter storage: Memory required to store model parameters (weights and biases).
    • Gradient calculation and storage: Computing and storing gradients for model updates.
    • Optimizer operations: Additional storage and computations for parameter optimization.
  • Key takeaway: Each step in the training loop incurs computational costs and memory demands, understanding these factors is crucial for optimization.

What takes up time?

  • Computation: The number of mathematical operations performed.
  • Memory management: Data transfer and storage within the memory hierarchy.

Why are we copying data around?

  • Memory hierarchy and data transfer:
    • Different memory types: CPU cache, RAM, GPU RAM, hard drives, etc., have varying speeds.
    • Data movement: Copying data between these memory locations consumes time, impacting performance.
  • Optimizing data flow: Minimizing unnecessary data transfers and utilizing faster memory options are crucial for optimization.
  • Multi-GPU and distributed training:
    • Inter-GPU communication: Sharing data between GPUs can introduce latency.
    • Network communication: In multi-node setups, communication over the network becomes a bottleneck.

Goal: Keep the GPU Fed

  • Goal: Continuously provide the GPU with data and instructions to avoid downtime.
  • Ideal: Fit the entire model and data within the GPU RAM for fastest processing.
  • Bottleneck: Large models and limited GPU memory can cause frequent data loading from slower memory, slowing down the training process.

Tricks to improve memory efficiency

  • Techniques for keeping data closer to the GPU:
    • Flash Attention and Fused Kernels: These reduce memory footprints and data transfers by changing how computations are performed.
    • Gradient Checkpointing (Activation Checkpointing): Trades a small increase in compute time for significant memory savings by selectively recomputing activations during backpropagation.
    • CPU Offloading: Leverages larger CPU RAM by temporarily storing parts of the model or data that are not actively being used on the GPU.
    • LoRA (Low-Rank Adaptation):
      • Freeze most model parameters and train only a small set of adapter parameters.
      • Reduces memory requirements for gradients and optimizer states.
    • Quantization:
      • Represent model weights using fewer bits (e.g., 8-bit instead of 32-bit).
      • Reduces memory footprint, allowing for larger models or larger batch sizes.
      • Needs a little computation to dequantize

Napkin Math: Understanding Memory Usage

Full Fine-Tuning

  • Problem: Full fine-tuning requires storing model parameters, gradients, and optimizer states, leading to high memory consumption.
  • Explanation:
    • Each parameter in a model requires multiple bits (e.g., 32 bits) to represent its numeric value.
    • Gradients for each parameter are stored during training, consuming the same amount of memory as the parameters.
    • Optimizers like Adam store additional states (e.g., momentum), further increasing memory usage.
  • Example: A model with 100 million parameters using 32 bits per parameter consumes 400MB.
    • Considering gradients and optimizer states, the total memory usage can be 1.2GB or higher.

LoRA (Low-Rank Adaptation)

  • Solution: LoRA reduces memory usage by only training a small subset of parameters while keeping the rest frozen.
    • Example: For every 1000 x 1000 matrix, add a 1000 x 32 matrix
  • Explanation:
    • Only trainable parameters require gradients and optimizer state storage.
    • LoRA adds a small number of trainable parameters (e.g., 1% of the model size) to each large matrix in the model.
  • Benefit: LoRA significantly reduces memory requirements, enabling training on larger models without fitting the entire model and its associated data into GPU memory.


  • Solution: Quantization reduces memory usage by representing model parameters with fewer bits.
  • Explanation:
    • Instead of using 32 bits, quantization uses 8 bits or even 4 bits per parameter.
  • Benefit: Quantization shrinks the memory footprint of the model, allowing for larger model training or fitting on memory-constrained devices.
  • Combination with LoRA (QLoRA): Quantize the frozen base model weights while keeping the LoRA parameters in higher precision for accurate training.

CPU Offloading Illustration

  • Solution: Utilize CPU RAM to store model components and offload computations when GPU memory is insufficient.
  • Explanation: Load and process one layer of the model on the GPU at a time, storing the remaining layers and intermediate data in CPU memory.
  • Trade-off: While enabling larger model training, CPU offloading introduces latency due to data transfer between CPU and GPU.

LLM Context Length Considerations

  • Problem: Large context lengths lead to increased memory consumption for storing activations.
  • Explanation:
    • Each layer’s output activations need to be stored for gradient calculations, and these activations accumulate with longer input sequences.
  • Impact: Training LLMs with long context lengths requires careful memory management and might necessitate techniques like gradient checkpointing to reduce activation memory footprint.

Napkin Math Code Demo

  • Demonstrates using PyTorch’s memory tracking tools to analyze memory usage during training.

Measuring Memory

  • Challenge: Limited GPU RAM restricts the size of models and batch sizes during training.
  • Solution: Use PyTorch’s memory tracking capabilities to understand and optimize memory consumption.
  • Key Takeaway: Actively monitor memory usage to make informed decisions about model size, batch size, and other hyperparameters.
import torch, gc, inspect, transformers
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from accelerate.utils import set_seed
from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model

device = "cuda"
def cleanup():
    """Free up memory and reset stats"""
def print_memory_stats():
    """Print two different measures of GPU memory usage"""
    print(f"Max memory allocated: {torch.cuda.max_memory_allocated(device)/1e9:.2f}GB")
    # reserved (aka 'max_memory_cached') is ~the allocated memory plus pre-cached memory
    print(f"Max memory reserved: {torch.cuda.max_memory_reserved(device)/1e9:.2f}GB") 
Max memory allocated: 0.00GB
Max memory reserved: 0.00GB
# QLoRA Forward + Backward pass

cleanup() # Clean slate each time

# Quantization config for QLoRA versions
bnb_config = BitsAndBytesConfig(

# Load the model
model = AutoModelForCausalLM.from_pretrained(

# This function has caused me lots of pain!
model = prepare_model_for_kbit_training(model)

# Add LoRA
config = LoraConfig(
model = get_peft_model(model, config)

# Move to GPU

# Prepare the data
bs, ctx = 1, 1200 # batch size and context length
data = torch.randint(0, 10000, (bs, ctx), device=device)

# Forward pass
output = model(data, labels=data)

# Backward pass

# Print max memory stats

# Cleanup
del model, output, data
Max memory allocated: 6.42GB
Max memory reserved: 7.09GB

Memory History

Benefits of Memory History Analysis:

  • Identify memory spikes: pinpoint operations causing significant memory increases.
  • Understand memory allocation patterns: Visualize how memory usage evolves throughout the training process (forward pass, backward pass, gradient updates).
  • Evaluate optimization strategies: Observe the impact of techniques like gradient checkpointing and quantization on memory consumption.

Real-World Examples

  • Bug Detection: Unusually high memory spikes revealed incorrect implementations or unintended use of trainable parameters.
  • Memory Leak Identification: Gradual memory escalation over multiple training steps pointed to issues with gradient or loss clearing.
  • Kernel Optimization: Analysis highlighted inefficiencies in HuggingFace’s Transformers library, prompting a switch back to a more memory-efficient attention kernel.

# Load the model
model = AutoModelForCausalLM.from_pretrained(
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, config)

# Prep the data
data = torch.randint(0, 10000, (1, 1000), device=device)

# Start recording

# FOrward and backwards pass
output = model(data, labels=data)

# Save the snapshot and stop recording

# Clean up
del model, output, data

Results viewed in

Optimizing LLM Training for Different Hardware

Benchmarking QLoRA+FSDP

GPU Limitations & Data Transfer Bottleneck

  • Training LLMs often involves frequent data transfers between CPU memory and GPU memory.
  • In systems with limited GPU memory or slow interconnects, these transfers become a significant bottleneck.
  • Each layer of the model might require loading weights onto the GPU, performing computations, and then repeating the process for the next layer.

Optimizing for Memory-Bound Scenarios

  • Goal: Minimize data transfer cycles by maximizing computation done per load.
    • Example: Using a larger batch size allows processing more samples per load, reducing the total number of loads required.
  • Techniques:
    • Quantization: Reduces the precision of model weights, making them faster to transfer and process.
    • QLoRa: A memory-efficient training method that further reduces memory footprint compared to standard LoRa.
  • Example: The “3090 basement rig” showed significant speed improvements when using these techniques due to a larger achievable batch size.

Impact of Hardware on Optimization Strategies

  • Memory Bandwidth Constraints: On slower machines or those with limited interconnect bandwidth, optimizing for memory bandwidth is crucial. Techniques like quantization and larger batch sizes yield significant gains.

  • High-End Systems:

    • Systems like the H100 with fast NVLink interconnects and ample GPU memory minimize the data transfer bottleneck.
    • Model weights can reside in fast GPU memory, even across multiple GPUs, allowing for rapid loading.
    • Computation becomes the limiting factor, not memory bandwidth.
  • Example: In tests on an H100, increasing batch size or using quantization didn’t significantly impact runtime. The bottleneck was compute speed, not data transfer.

Adapting to Hardware Limitations

  • Understand Your Hardware: Identify whether your system is memory-bound or compute-bound.
  • Optimize Accordingly:
    • Memory-Bound: Prioritize techniques like quantization, QLoRa, and increasing batch size to minimize data transfer.
    • Compute-Bound: Focus on maximizing computational efficiency, as data transfer is less of a concern.

Q&A Session

CPU Offloading

  • Question: Is CPU offloading useful for training or just inference?

  • Answer: CPU offloading can be beneficial for training large models that exceed single GPU memory, even with multiple GPUs. It allows for larger batch sizes by storing model weights on the CPU and copying them to the GPU as needed. However, it’s less effective with smaller models or abundant GPU memory.

  • Recommendation: Start with default settings and gradually introduce optimizations like quantization, batch size increases, data parallelism, and sharding, evaluating performance gains at each step.

Quantization Sweet Spot

  • Question: How to find the optimal balance between compression and accuracy with quantization?

  • Answer: Quantization below 4 bits can negatively impact accuracy. Combining 4-bit quantization with LoRA (adapters) often maintains performance comparable to no quantization.

  • Recommendation: Use 4-bit quantization with LoRA as a starting point. Keep the LoRA in higher precision.

Gradient Accumulation

  • Question: Clarification on gradient accumulation and micro batch size.

  • Answer: Gradient accumulation simulates larger batch sizes by accumulating gradients over multiple smaller “micro batches” before updating model weights. This is useful when target batch sizes exceed GPU memory.

  • Example: To achieve an effective batch size of 32 with a GPU limited to 8 samples, run 4 micro batches of 8 samples each before updating.

  • Recommendation: Use gradient accumulation to reach a reasonable effective batch size (16, 32, 64), particularly when training with small batch sizes (1 or 2).

Multiple LoRAs

  • Question: Do multiple LoRAs update the same or different model parameters?

  • Answer: Typically, multiple LoRAs update the same weights. Each LoRA targets specific layers defined in its configuration.


  • Balancing Act: Fine-tuning involves minimizing data movement and maximizing computation efficiency.
  • Memory Management:
    • LoRA: Reduces memory by training a smaller set of parameters.
    • Quantization: Stores weights using fewer bits.
    • Experimentation: Key to finding the right trade-offs between memory, speed, and accuracy.
  • Memory Allocation:
    • Model Size: Larger models require more memory for weights and gradients.
    • Batch Size and Context Length: Increasing either requires more memory for activations and intermediate values.
  • Memory Optimization Strategies:
    • Reduce context length if possible.
    • Quantize weights.
    • Offload to CPU.
  • Practical Estimation: Instead of complex formulas, estimate memory requirements based on measurements from a single layer and extrapolate for the entire model.