Notes on Transformers Book Ch. 11

Chapter 11 explores scaling up transformers, methods to make self-attention more efficient, and multimodel transformers.

April 26, 2022

import transformers
import datasets
import accelerate

# Only print error messages

transformers.__version__, datasets.__version__, accelerate.__version__
    ('4.18.0', '2.1.0', '0.5.1')

import ast
import astor
import inspect
import textwrap
def print_source(obj, exclude_doc=True):
    # Get source code
    source = inspect.getsource(obj)
    # Remove any common leading whitespace from every line
    cleaned_source = textwrap.dedent(source)
    # Parse the source into an AST node.
    parsed = ast.parse(cleaned_source)

    for node in ast.walk(parsed):
        # Skip any nodes that are not class or function definitions
        if not isinstance(node, (ast.FunctionDef, ast.ClassDef, ast.AsyncFunctionDef)):
        if exclude_doc and len(node.body) > 1: node.body = node.body[1:]

Scaling Transformers

  • The Bitter Lesson
    • Richard Sutton argued that general methods that leverage computation are far more effective in AI than methods that leverage domain knowledge.
    • The human knowledge approach tends to complicate things, making them less suited to taking advantage of general methods leveraging computation.
    • Search methods and learning methods seem to scale arbitrarily with computation power.
  • Large language models perform better on downstream tasks.
  • Interesting capabilities like zero-shot and few-shot learning emerge in the 10 to 100-billion parameter range.
  • Computing power and training data must also scale with parameter count.
  • Large language models like GPT-3 are estimated to cost $4.6 million to train.
  • The high cost of training large models means we need a way to estimate the model’s performance in advance.
  • Scaling Laws for Neural Language Models
    • The performance of language models appears to obey a power-law relationship with model size and other factors.

import pandas as pd
import matplotlib.pyplot as plt

Plot parameter counts over time for prominent Transformer architectures

model_data = [
    {'date': '12-06-2017', 'name': 'Transformer', 'size': 213*1e6},
    {'date': '11-06-2018', 'name': 'GPT', 'size': 110*1e6},
    {'date': '11-10-2018', 'name': 'BERT', 'size': 340*1e6},
    {'date': '14-02-2019', 'name': 'GPT-2', 'size': 1.5*1e9},
    {'date': '23-10-2019', 'name': 'T5', 'size': 11*1e9},
    {'date': '17-09-2019', 'name': 'Megatron', 'size': 8.3*1e9},
    {'date': '13-02-2020', 'name': 'Turing-NLG', 'size': 17*1e9},
    {'date': '30-06-2020', 'name': 'GShard', 'size': 600*1e9},
    {'date': '28-05-2020', 'name': 'GPT-3', 'size': 175*1e9},
    {'date': '11-01-2021', 'name': 'Switch-C', 'size': 1.571*10e12},

def label_point(x, y, val, ax):
    a = pd.concat({"x": x, "y": y, "val": val}, axis=1)
    for i, point in a.iterrows():

df_lm = pd.DataFrame.from_records(model_data)
df_lm["date"] = pd.to_datetime(df_lm["date"], dayfirst=True)

fig, ax = plt.subplots(1, 1, figsize=(12, 4))
df_lm.plot(x="date", y="size", kind="scatter", s=15, ax=ax)
label_point(df_lm["date"], df_lm["size"], df_lm["name"], ax)
ax.set_xlabel("Release date")
ax.set_ylabel("Number of parameters")


Scaling Laws

  • Scaling laws allow us to empirically quantify the “bigger is better” paradigm for language models by studying their behavior with varying compute budgets \(C\), dataset sizes \(D\), and model sizes \(N\).
  • We measure dataset size in the number of tokens.
  • The model size excludes parameters from the embedding layers.
  • We chart the dependence of the cross-entropy loss on these three factors to determine if a relationship emerges.
  • Scaling laws imply that increasing compute budget, dataset size, and model size in tandem is more productive than architectural tweaks or hyperparameter optimization to improve performance.
  • The test loss has a power-law relationship with computation budget, dataset size, and model size across several orders of magnitude.
  • We can express \(L\left( X \right) \sim 1/X^{\alpha}\) for \(X = N, C, D\) where \(\alpha\) is a scaling exponent determined by a fit to the loss curve.
  • These power laws mean we can extrapolate the early part of a loss curve to predict the approximate loss from training longer.
  • Larger models can achieve the same performance as smaller models with fewer training steps.
  • Scaling laws are also present for other modalities like images, videos, and mathematical problem-solving.

Challenges with Scaling

  • Provisioning and managing hundreds or thousands of GPU nodes typically requires specialized engineers familiar with running large-scale, distributed experiments.
  • Most companies cannot afford the teams and resources to train models at the largest scales.
  • A recently proposed distributed deep learning framework enables smaller groups to pool their computational resources and pre-train models.
  • Large models require large, high-quality datasets.
    • It is hard to curate only high-quality training examples when the dataset contains terabytes of text.
    • We need a way to control common biases in the dataset to prevent the model from inheriting them.
    • There are potential licensing issues when using large-scale web-text corpora.
    • Large-scale text datasets might contain personal information.
  • Evaluating trained models on downstream tasks requires additional time and resources.
    • We need to probe the model for biased and toxic output, even when using a cleaned dataset.
  • Optimization approaches like distillation, pruning, and quantization might not be enough when starting with a model that is hundreds of gigabytes in size.
  • OpenAI API
  • Hugging Face Accelerated Inference API
  • BigScience is a one-year-long research workshop meant to foster discussions and reflections on the research questions surrounding large language models, the challenges of creating and sharing them, and datasets used for research.
    • The collaborative tasks involve creating, sharing, and evaluating a massive multilingual dataset and language model.
  • EleutherAI is a decentralized collective of volunteers focused on AI alignment, scaling, and open-source AI research.

Attention Please!

  • Self-attention involves performing pairwise comparisons of all the tokens in a sequence, which becomes a computational bottleneck.
  • The self-attention layer of the Transformer architecture naively scales like \(O(n^{2})\), where n is the length of the sequence.
  • A recent paper from Google shows we can reduce the memory complexity to \(O \left( \log{n} \right)\) via a simple reordering of the operations.
  • Much of the recent research on transformers focuses on making self-attention more efficient.
  • Common approaches to making attention more efficient involve introducing sparsity into the attention mechanism or applying kernels to the attention matrix.

Sparse Attention

  • We can reduce the number of computations performed in the self-attention layer by limiting the number of query-key pairs it generates according to a predefined pattern.
  • There are a handful of popular “atomic” sparsity patterns.
  • Global attention defines a few tokens in the sequence that are allowed to attend to all others.
  • Band attention computes attention over a diagonal band.
  • Dilated attention skips some query-key pairs using a dilated window with gaps.
  • Random attention samples a few keys for each query to compute attention scores.
  • Block local attention divides the sequence into blocks and restricts attention to within these blocks.
  • Most transformer models with sparse attention use a mix of atomic sparsity patterns to generate the final attention matrix.
  • Models like Longformer use a mix of global and band attention, while Bigbird adds random attention.
  • Introducing sparsity into the attention matrix enables models to process much longer sequences.
  • It is also possible to learn the sparsity pattern by clustering the tokens into chunks.
    • Reformer uses a hash function to cluster similar tokens.

Linearized Attention

  • Linearized attention involves changing the order of operations for computing attention scores.
  • We compute the self-attention score of the queries and keys using a similarity function like the dot product.
  • For a general similarity function \(sim \left( q_{i},k_{j} \right)\), we can express the attention outputs as the following equation: ### \[y_{i} = \sum_{j}{\frac{sim \left( Q_{i}, K_{j} \right)}{\sum_{k}{sim\left( Q_{i}, K_{k} \right)}}V_{j}}\]
  • The trick behind linearized attention mechanisms is to express the similarity function as a kernel function that decomposes the operation into two pieces: ### \[sim \left( Q_{j}, K_{j} \right) = \phi \left(Q_{i} \right)^{T} \phi \left( K_{j} \right)\]
  • where \(\phi\) is typically a high-dimensional feature map.
  • \(\phi \left( Q_{i} \right)\) is independent of \(j\) and \(k\), so we can pull it under the sums to write the attention output as follows: ### \[y_{i} = \frac{\phi \left(Q_{i} \right)^{T} \sum_{j}{\phi \left( K_{j} \right)} V_{j}^{T}}{\phi \left(Q_{i} \right)^{T} \sum_{k}{\phi \left( K_{k} \right)}}\]
  • By first computing \(\sum_{j}{\phi \left( K_{j} \right)} V_{j}^{T}\) and \(\sum_{k}{\phi \left( K_{k} \right)}\), we can effectively linearize the space and time complexity of self-attention.
  • Popular methods that implement linearized self-attention include Linear Transformer and Performer.

Going Beyond Text

  • Developing effective strategies for common textual tasks like classification and question answering allows us to address many types of real-world problems.

Limitations to using text

Human reporting bias

  • The frequencies of events in the training text my not represent their actual frequencies.
  • A model trained exclusively on text from the internet might have a distorted image of the world.

Common Sense

  • Most do not document their reasoning based on common sense.
  • Language models trained on text might know many facts about the world but lack basic common-sense reasoning.


  • A probabilistic language model cannot reliably store facts and can produce factually incorrect text.
  • Such models can detect named entities but have no direct way to access information about them.


  • Language models can’t connect to other modalities, such as audio, visual signals, or tabular data, that might address some of these limitations.


  • Transformers are now achieving efficiency similar to or better than Convolutional Neural Networks (CNNs).


  • iGPT (short for image GPT) uses the GPT architecture and autoregressive pretraining objective to predict future pixel values by viewing images as sequences of pixels.
  • Generative Pretraining From Pixels
  • Pretraining on large image datasets enables iGPT to “autocomplete” partial images.
  • iGPT achieves performant results on classification tasks when using a classification head.


  • Vision Transformer (Vit) is a BERT-style take on transformers for vision.
  • We split the image into smaller patches and then embed each of these patches with a linear projection.
  • We combine the patch embeddings with position embeddings and feed them through an ordinary transformer encoder.
  • We mask or distort some of the patches during training, and the objective is to predict the average color of the masked patch.
  • This approach did not produce better results when pretrained on the standard ImageNet dataset, but it scaled significantly better than Convolutional Neural Networks on larger datasets.
  • An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
  • The Hugging Face Transformers library includes Vision Transformer.

from PIL import Image
import matplotlib.pyplot as plt

Load an image of a dog

image ="dog.jpg")


import pandas as pd
pd.set_option('max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
from transformers import pipeline


Create an image classification pipeline

image_classifier = pipeline("image-classification")

Get the model architecture



  • Documentation
  • Create a ViT Model transformer with an image classification head for ImageNet.

Get the link to the Hugging Face model card


View potential Image classes


Perform image classification

preds = image_classifier(image)
preds_df = pd.DataFrame(preds)
score label
0 0.989680 golden retriever
1 0.002968 Labrador retriever
2 0.000502 kuvasz
3 0.000402 Irish setter, red setter
4 0.000345 tennis ball


  • The model correctly classifies the dog as a Golden Retriever.
  • Video models are a natural extension of image models and add a temporal dimension on top of the spatial dimension.
  • Video tasks are more challenging as the volume of data gets much larger, and we need to deal with an extra dimension.
  • Models such as TimeSformer introduce a spatial and temporal attention mechanism.



Create some sample table data

book_data = [
    {"chapter": 0, "name": "Introduction", "start_page": 1, "end_page": 11},
    {"chapter": 1, "name": "Text classification", "start_page": 12, 
     "end_page": 48},
    {"chapter": 2, "name": "Named Entity Recognition", "start_page": 49,
     "end_page": 73},
    {"chapter": 3, "name": "Question Answering", "start_page": 74, 
     "end_page": 120},
    {"chapter": 4, "name": "Summarization", "start_page": 121, 
     "end_page": 140},
    {"chapter": 5, "name": "Conclusion", "start_page": 141, 
     "end_page": 144}

table = pd.DataFrame(book_data)
table['number_of_pages'] = table['end_page']-table['start_page']

Note: We need to make all columns of type str to play nicely with TAPAS.

table = table.astype(str)
chapter name start_page end_page number_of_pages
0 0 Introduction 1 11 10
1 1 Text classification 12 48 36
2 2 Named Entity Recognition 49 73 24
3 3 Question Answering 74 120 46
4 4 Summarization 121 140 19
5 5 Conclusion 141 144 3

Create a table question answering pipeline

table_qa = pipeline("table-question-answering")
    TapasConfig {
      "_name_or_path": "google/tapas-base-finetuned-wtq",
      "aggregation_labels": {
        "0": "NONE",
        "1": "SUM",
        "2": "AVERAGE",
        "3": "COUNT"
      "aggregation_loss_weight": 1.0,
      "aggregation_temperature": 1.0,
      "allow_empty_column_selection": false,
      "answer_loss_cutoff": 0.664694,
      "answer_loss_importance": 1.0,
      "architectures": [
      "attention_probs_dropout_prob": 0.1,
      "average_approximation_function": "ratio",
      "average_logits_per_cell": false,
      "cell_selection_preference": 0.207951,
      "disable_per_token_loss": false,
      "gradient_checkpointing": false,
      "hidden_act": "gelu",
      "hidden_dropout_prob": 0.1,
      "hidden_size": 768,
      "huber_loss_delta": 0.121194,
      "init_cell_selection_weights_to_zero": true,
      "initializer_range": 0.02,
      "intermediate_size": 3072,
      "layer_norm_eps": 1e-12,
      "max_num_columns": 32,
      "max_num_rows": 64,
      "max_position_embeddings": 1024,
      "model_type": "tapas",
      "no_aggregation_label_index": 0,
      "num_aggregation_labels": 4,
      "num_attention_heads": 12,
      "num_hidden_layers": 12,
      "pad_token_id": 0,
      "positive_label_weight": 10.0,
      "reset_position_index_per_cell": true,
      "select_one_column": true,
      "softmax_temperature": 1.0,
      "temperature": 0.0352513,
      "transformers_version": "4.18.0",
      "type_vocab_size": [
      "type_vocab_sizes": [
      "use_answer_as_supervision": true,
      "use_gumbel_for_aggregation": false,
      "use_gumbel_for_cells": false,
      "use_normalized_answer_loss": false,
      "vocab_size": 30522

Get the link to the Hugging Face model card




  • Documentation
  • Create a Tapas Model with a cell selection head and optional aggregation head for question answering tasks.

Pass some queries to the model

queries = ["What's the topic in chapter 4?",
           "What is the total number of pages?",
           "On which page does the chapter about question-answering start?",
           "How many chapters have more than 20 pages?"]
preds = table_qa(table, queries)
for query, pred in zip(queries, preds):
    if pred["aggregator"] == "NONE": 
        print("Predicted answer: " + pred["answer"])
        print("Predicted answer: " + pred["answer"])
    What's the topic in chapter 4?
    Predicted answer: Summarization
    What is the total number of pages?
    Predicted answer: SUM > 10, 36, 24, 46, 19, 3
    On which page does the chapter about question-answering start?
    Predicted answer: AVERAGE > 74
    How many chapters have more than 20 pages?
    Predicted answer: COUNT > 1, 2, 3

Note: * The model predicted exactly one cell with no aggregation for the first query, and the answer is correct. * For the second query, the model correctly predicted that we need to sum the individual page counts for each chapter to determine the total number of pages. * The model correctly answered question three but included an unnecessary average aggregation. * The model correctly determined that chapters 1, 2, and 3 have more than 20 pages. * The ability to ask questions in natural language instead of Python code allows a much wider audience to query the data to answer specific questions.

Multimodal Transformers


  • Speaking is more convenient than reading and writing for a significant portion of the population.
  • Automatic speech recognition (ASR) involves converting spoken words to text and enables voice technologies like Siri to answer questions like “What is the weather like today?”.
  • The wave2vec 2.0 family of models is one of the most recent developments in ASR and uses a transformer layer in combination with a CNN.
  • These models leverage unlabeled data to achieve competitive results with only a few minutes of labeled data.
  • The Hugging Face Transformers library includes wave2vec 2.0 models.

Create an automatic speech recognition pipeline

asr = pipeline("automatic-speech-recognition")
    Wav2Vec2Config {
      "_name_or_path": "facebook/wav2vec2-base-960h",
      "activation_dropout": 0.1,
      "adapter_kernel_size": 3,
      "adapter_stride": 2,
      "add_adapter": false,
      "apply_spec_augment": true,
      "architectures": [
      "attention_dropout": 0.1,
      "bos_token_id": 1,
      "classifier_proj_size": 256,
      "codevector_dim": 256,
      "contrastive_logits_temperature": 0.1,
      "conv_bias": false,
      "conv_dim": [
      "conv_kernel": [
      "conv_stride": [
      "ctc_loss_reduction": "sum",
      "ctc_zero_infinity": false,
      "diversity_loss_weight": 0.1,
      "do_stable_layer_norm": false,
      "eos_token_id": 2,
      "feat_extract_activation": "gelu",
      "feat_extract_dropout": 0.0,
      "feat_extract_norm": "group",
      "feat_proj_dropout": 0.1,
      "feat_quantizer_dropout": 0.0,
      "final_dropout": 0.1,
      "gradient_checkpointing": false,
      "hidden_act": "gelu",
      "hidden_dropout": 0.1,
      "hidden_dropout_prob": 0.1,
      "hidden_size": 768,
      "initializer_range": 0.02,
      "intermediate_size": 3072,
      "layer_norm_eps": 1e-05,
      "layerdrop": 0.1,
      "mask_feature_length": 10,
      "mask_feature_min_masks": 0,
      "mask_feature_prob": 0.0,
      "mask_time_length": 10,
      "mask_time_min_masks": 2,
      "mask_time_prob": 0.05,
      "model_type": "wav2vec2",
      "num_adapter_layers": 3,
      "num_attention_heads": 12,
      "num_codevector_groups": 2,
      "num_codevectors_per_group": 320,
      "num_conv_pos_embedding_groups": 16,
      "num_conv_pos_embeddings": 128,
      "num_feat_extract_layers": 7,
      "num_hidden_layers": 12,
      "num_negatives": 100,
      "output_hidden_size": 768,
      "pad_token_id": 0,
      "proj_codevector_dim": 256,
      "tdnn_dilation": [
      "tdnn_dim": [
      "tdnn_kernel": [
      "transformers_version": "4.18.0",
      "use_weighted_layer_sum": false,
      "vocab_size": 32,
      "xvector_output_dim": 512

Get the link to the Hugging Face model card


Note: The model trained on 960 hours of speech audio.


  • Documentation
  • Create a Wav2Vec2 model with a language modeling head for Connectionist Temporal Classification (CTC).

from datasets import load_dataset

The SUPERB Dataset

  • Hugging Face Dataset Card
  • SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data.

Load the ASR subset of the SUPERB dataset

ds = load_dataset("superb", "asr", split="validation[:1]")
file audio text speaker_id chapter_id id
array /home/innom-dt/.cache/huggingface/datasets/downloads/extracted/aa91addd71e85ab524e5b5b56fa3d0de777838850cb76ec55ad066e969fd5144/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac [0.002380371, 0.0020751953, 0.0019836426, 0.002105713, 0.0016174316, 0.00030517578, 9.1552734e-05, 0.00033569336, 0.0009765625, 0.0018310547, 0.0020141602, 0.002105713, 0.001739502, 0.00045776367, -0.00039672852, 0.00045776367, 0.0010070801, 9.1552734e-05, 0.00048828125, 0.001159668, 0.0007324219, 0.0009460449, 0.0018005371, 0.0018310547, 0.00088500977, 0.0004272461, 0.00048828125, 0.0007324219, 0.0010986328, 0.002105713, 0.0025634766, 0.002532959, 0.0025634766, 0.0022888184, 0.0018005371, 0.0010681152, 0.00064086914, 0.00012207031, 0.0002746582, 0.001159668, 0.0015258789, 0.0015563965, 0.0019226074, 0.0012207031, -3.0517578e-05, -0.00036621094, -0.00039672852, -0.00039672852, -0.00015258789, 0.0006713867, 0.0012817383, 0.0018615723, 0.0015869141, 0.0012817383, 0.0007324219, 9.1552734e-05, -0.000579834, -0.00045776367, 9.1552734e-05, 0.00033569336, 0.00024414062, 0.0011291504, 0.001373291, 0.0012817383, 0.00088500977, 0.00030517578, -0.00088500977, -0.0014648438, -0.0008239746, 0.00012207031, 0.0011901855, 0.0019226074, 0.0016479492, 0.00088500977, 0.00076293945, 0.0004272461, -0.0005187988, -0.0005493164, -0.00036621094, -0.0004272461, -0.00018310547, 0.000579834, 0.0009460449, 0.0007324219, 0.0010070801, 0.0007019043, 0.00024414062, -0.00018310547, -0.00064086914, -0.00088500977, -0.00048828125, 0.0002746582, 0.0007324219, 0.0018310547, 0.0018005371, 0.0012512207, 0.00061035156, -0.00036621094, -0.0012817383, -0.00091552734, …] MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL 1272 128104 1272-128104-0000
path /home/innom-dt/.cache/huggingface/datasets/downloads/extracted/aa91addd71e85ab524e5b5b56fa3d0de777838850cb76ec55ad066e969fd5144/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac /home/innom-dt/.cache/huggingface/datasets/downloads/extracted/aa91addd71e85ab524e5b5b56fa3d0de777838850cb76ec55ad066e969fd5144/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL 1272 128104 1272-128104-0000
sampling_rate /home/innom-dt/.cache/huggingface/datasets/downloads/extracted/aa91addd71e85ab524e5b5b56fa3d0de777838850cb76ec55ad066e969fd5144/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac 16000 MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL 1272 128104 1272-128104-0000


  • The file column contains the path to the audio sample, and the text column contains the expected transcription.
  • We can use the SoundFile library to read each audio file and convert the audio to an array of floats.

import soundfile as sf

Add a new column storing each audio sample as an array of floats

def map_to_array(batch):
    speech, _ =["file"])
    batch["speech"] = speech
    return batch

ds =

Play a sample from the dataset

from IPython.display import Audio

display(Audio(ds[0]['speech'], rate=16000))


Pass the audio sample the pipeline

pred = asr(ds[0]["speech"])

Note: * The words in the transcription are correct, but the punctuation is missing. * It is hard to infer punctuation from audio alone, and we could add it in a post-processing step. * Building a model for a new language still requires a minimum amount of labeled data, which can be challenging to obtain. * A new method named wav2vec-U combines clever clustering and GAN training to build a speech-to-text model using only independent unlabeled speech and unlabeled text data. * This method requires not aligned speech and text data, enabling the training of highly performant speech-to-text models for a much larger spectrum of languages. * Unsupervised Speech Recognition


Vision and Text

  • There have been several developments in combining visual and textual information.



  • The LayoutLM family of models uses an enhanced Transformer architecture that receives a text sequence, an image, and a layout as input.
  • There are embedding layers associated with each modality, a spatially-aware self-attention mechanism, and a mix of image and text/image pretraining objectives to align the different modalities.
  • LayoutLM models pre-train on millions of scanned documents and can transfer to various downstream tasks, similar to BERT for NLP.
  • LayoutLM models are the current state of the art for analyzing scanned business documents like receipts, invoices, or reports.



  • DALLE uses the GPT architecture and autoregressive modeling to generate images from text.
  • It regards the words and pixels as one sequence of tokens and can, therefore, continue generating an image from a text prompt.
  • Zero-Shot Text-to-Image Generation


  • Learning Transferable Visual Models From Natural Language Supervision
  • We can use the pretrained model for classification by embedding the possible classes with the text encoder and comparing the class embeddings to the image embedding that we want to classify.
  • We select the class with the highest similarity.
  • CLIP has remarkable zero-shot image classification performance and is competitive with fully supervised-trained vision models while being more flexible.
  • We need to instantiate a processor that contains a feature extractor and a tokenizer for image-to-text tasks.
  • The feature extractor converts the image into a form suitable for the model, while the tokenizer decodes the model predictions into text.


from transformers import CLIPProcessor, CLIPModel


  • Documentation
  • Create a CLIP processor which wraps a CLIP feaure extractor and a CLIP tokenizer into a single processor.


Instantiate a CLIPModel and processor

clip_ckpt = "openai/clip-vit-base-patch32"
model = CLIPModel.from_pretrained(clip_ckpt)
processor = CLIPProcessor.from_pretrained(clip_ckpt)


    - feature_extractor: CLIPFeatureExtractor {
      "crop_size": 224,
      "do_center_crop": true,
      "do_normalize": true,
      "do_resize": true,
      "feature_extractor_type": "CLIPFeatureExtractor",
      "image_mean": [
      "image_std": [
      "resample": 3,
      "size": 224
    - tokenizer: PreTrainedTokenizerFast(name_or_path='openai/clip-vit-base-patch32', vocab_size=49408, model_max_len=77, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': '<|endoftext|>'})

Load a test image

image ="dog.jpg")


import torch

Create some sample image captions

texts = ["a photo of a golden retriever", "a photo of a dog", "a photo of agi"]

Compare the image to the captions

inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
    outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
pd.DataFrame(zip(texts, probs[0].numpy()), columns=['Text', "Probability"])
Text Probability
0 a photo of a golden retriever 0.868025
1 a photo of a dog 0.131801
2 a photo of agi 0.000174