Notes on Transformers Book Ch. 1

Chapter 1 covers essential advancements for transformers, recurrent architectures, the encoder-decoder framework, attention mechanisms, transfer learning in NLP, and the HuggingFace ecosystem.

Christian Mills


March 30, 2022

This post is part of the following series:

Key Advancements

  • Attention Is All You Need
    • published in June 2017 by researchers at Google
    • introduced the Transformer architecture for sequence modeling
    • outperformed Recurrent Neural Networks (RNNs) on machine translation tasks, both in terms of translation quality and training cost
  • Universal Language Model Fine-tuning for Text Classification
    • published in January 2018 by Jeremy Howard and Sebastian Ruder
    • introduced an effective training method called ULMFiT
    • showed that training Long Short-Term Memory Networks (LSTMs) on a very large and diverse corpus could produce state-of-the-art text classifiers with little labeled data
    • inspired other research groups to combine transformers with unsupervised learning
  • Improving Language Understanding with Unsupervised Learning
    • published by OpenAI in June 2018
    • introduced Generative Pretrained Transformer (GPT)
  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
    • published by researchers at Google in October 2018
  • Combining the Transformer architecture with unsupervised learning removed the need to train task-specific architectures from scratch.
  • Pretrained Transformers broke almost every benchmark in NLP by a significant margin.

Recurrent Architectures

  • Recurrent architectures such as LSTMs were state of the art in Natural Language Processing (NLP) before Transformers.
  • Recurrent architectures contain a feedback loop that allows information to propagate from one step to another.
    • ideal for sequential data like text.
  • A Recurrent Neural Network receives an input token and feeds it through the network.
  • The network outputs a vector called a hidden state, and it passes some information back to itself through a feedback loop.
  • The information passed through the feedback loop allows an RNN to keep track of details from previous steps and use it to make predictions.
  • Many still use recurrent architectures for NLP, speech processing, and time-series tasks.
  • The Unreasonable Effectiveness of Recurrent Neural Networks
    • provides an overview of RNNs and demonstrates how to train a language model on several datasets
  • RNNs were critical in developing systems to translate a sequence of words from one language to another.
    • known as machine translation
  • The computations for recurrent models are inherently sequential and cannot parallelize across the input.
    • The inability to parallelize computations is a fundamental shortcoming of recurrent models.

The Encoder-Decoder Framework

  • Sequence to Sequence Learning with Neural Networks
    • published in 2014 by researchers at Google
  • An encoder-decoder is also known as a sequence-to-sequence architecture.
  • This type of architecture is well-suited for situations where the input and output are both sequences of arbitrary length.
  • An encoder encodes information from an input sequence into a numerical representation.
    • This numerical representation is often called the last hidden state.
  • The decoder uses the numerical representation to generate the output sequence.
  • The encoder and decoder can use any neural network architecture that can model sequences.
  • The final hidden state of the encoder is the only information the decoder has access to when generating the output.
    • It has to represent the meaning of the whole input sequence.
    • This requirement creates an information bottleneck that can limit performance for longer sequences.

Attention Mechanisms

  • Attention mechanisms allow the decoder to access all of the encoder’s hidden states, not just the last one.
  • The decoder assigns a different amount of weight, called attention, to each state at every decoding time-step.
  • The attention values allow the decoder to prioritize which encoder state to use.
  • Attention-based models focus on which input tokens are most relevant at each time step.
    • They can learn nontrivial alignments between the words in a generated translation and those in a source sentence.
      • Example: An attention-based decoder can align the words “zone” and “Area” even when ordered differently in the source sentence and the translation.
  • Transformers use a special kind of attention called self-attention and do not use any form of recurrence.
    • Self-attention operates on all states in the same layer of a neural network.
    • The outputs of the self-attention mechanisms serve as input to feed-forward networks.
    • This architecture trains much faster than recurrent models.

Transfer Learning in NLP

  • Transfer learning involves using the knowledge a model learned from a previous task on a new one.
    • Computer vision models first train on large-scale datasets such as ImageNet to learn the basic features of images before being fine-tuned on a downstream task.
    • It was not initially clear how to perform transfer learning for NLP tasks.
  • Fine-tuned models are typically more accurate than supervised models trained from scratch on the same amount of labeled data.
  • We adapt a pretrained model to a new task by splitting the model into a body and a head.
  • The head is the task-specific portion of the network.
  • The body contains broad features from the source domain learned during training.
  • We can use the body weights to initialize a new model head for a new task.
  • Transfer learning typically produces high-quality models that we can efficiently train on many downstream tasks.
  • The ULMFit paper provided a general framework to perform transfer learning with NLP models.
    1. A model first trains to predict the next word based on those preceding it in a large-scale generic corpus to learn the basic features of the language.
      • This task is called language modeling.
    2. The pretrained model then trains on the same task using an in-domain corpus.
    3. Lastly, we fine-tune the model with a classification layer for the target task.
  • The ULMFit transfer learning framework provided the missing piece for transformers to take off.
  • Both GPT and BERT combine self-attention with transfer learning and set a new state of the art across many NLP benchmarks.
  • GPT only uses the decoder part of the Transformer architecture and the language modeling approach as ULMFiT.
  • BERT uses the encoder part of the Transformer architecture and a form of language modeling called masked language modeling.
    • Masked language modeling requires the model to fill in randomly missing words in a text.
  • GPT trained on the BookCorpus dataset while BERT trained on the BookCorpus dataset and English Wikipedia.

Bridging the Gap With Hugging Face Transformers

  • Applying a novel machine learning architecture to a new application can be complicated and requires custom logic for each model and task.
    1. Implement the model architecture in code.
      • PyTorch and TensorFlow are the most common frameworks for this.
    2. Load pretrained weights if available.
    3. Preprocess the inputs, pass them through the model, and apply task-specific postprocessing.
    4. Implement data loaders and define loss functions and optimizers to train the model.
  • Code released by research groups is rarely standardized and often requires days of engineering to adapt to new use cases.
    • Different research labs often release their models in incompatible frameworks, making it difficult for practitioners to port these models to their applications.
  • Hugging Face Transformers provides a standardized interface to a wide range of transformer models, including code and tools to adapt these models to new applications.
    • The availability of a standardized interface catalyzed the explosion of research into transformers and made it easy for NLP practitioners to integrate these models into real-life applications.
  • The library supports the PyTorch, TensorFlow, and JAX deep learning frameworks and provides task-specific model heads to fine-tune transformers on downstream tasks.

A Tour of Transformer Applications

  • Hugging Face Transformers has a layered API that allows users to interact with the library at various levels of abstraction.


  • Pipelines abstract away all the steps needed to convert raw text into a set of predictions from a fine-tuned model.
  • Hugging Face provides pipelines for several tasks.
  • Instantiate a pipeline by calling the pipeline() function and providing the name of the desired task.
    • 'audio-classification'
    • 'automatic-speech-recognition'
    • 'feature-extraction'
    • 'text-classification'
    • 'token-classification'
    • 'question-answering'
    • 'table-question-answering'
    • 'fill-mask'
    • 'summarization'
    • 'translation'
    • 'text2text-generation'
    • 'text-generation'
    • 'zero-shot-classification'
    • 'conversational'
    • 'image-classification'
    • 'object-detection'
  • The names for the supported tasks are available in the transformers.pipelines.SUPPORTED_TASKS dictionary.
  • The pipeline automatically downloads the model weights for the selected task and caches them for future use.
  • Each pipeline takes a string of text or a list of strings as input and returns a list of predictions.
    • Each prediction is in a Python dictionary along with the corresponding confidence score.
import transformers
import datasets
import pandas as pd
from transformers import pipeline

<function transformers.pipelines.pipeline(task: str, model: Optional = None, config: Union[str, transformers.configuration_utils.PretrainedConfig, NoneType] = None, tokenizer: Union[str, transformers.tokenization_utils.PreTrainedTokenizer, NoneType] = None, feature_extractor: Union[str, ForwardRef('SequenceFeatureExtractor'), NoneType] = None, framework: Optional[str] = None, revision: Optional[str] = None, use_fast: bool = True, use_auth_token: Union[bool, str, NoneType] = None, model_kwargs: Dict[str, Any] = {}, **kwargs) -> transformers.pipelines.base.Pipeline>
{'sentiment-analysis': 'text-classification', 'ner': 'token-classification'}
Sample Text: Customer Review
text = """Dear Amazon, last week I ordered an Optimus Prime action figure \
from your online store in Germany. Unfortunately, when I opened the package, \
I discovered to my horror that I had been sent an action figure of Megatron \
instead! As a lifelong enemy of the Decepticons, I hope you can understand my \
dilemma. To resolve the issue, I demand an exchange of Megatron for the \
Optimus Prime figure I ordered. Enclosed are copies of my records concerning \
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

Text Classification Pipeline

  • Documentation
  • The text-classification pipeline supports sentiment analysis, multiclass, and multilabel classification and performs sentiment analysis by default.
classifier = pipeline("text-classification")
# Classify the customer review as positive or negative
outputs = classifier(text)
label score
0 NEGATIVE 0.901546

Named Entity Recognition Pipeline

  • Documentation
  • Named entity recognition (NER) involves extracting real-world objects like products, places, and people from a piece of text.
  • Default Entity Labels
    • MISC: Miscellaneous
    • PER: Person
    • ORG: Organization
    • LOC: Location
# Create a named entity recognizer that groups words according to the model's predictions
ner_tagger = pipeline("ner", aggregation_strategy="simple")

Note: The simple aggregation strategy might end up splitting words undesirably.

{0: 'O',
 1: 'B-MISC',
 2: 'I-MISC',
 3: 'B-PER',
 4: 'I-PER',
 5: 'B-ORG',
 6: 'I-ORG',
 7: 'B-LOC',
 8: 'I-LOC'}
outputs = ner_tagger(text)
entity_group score word start end
0 ORG 0.879010 Amazon 5 11
1 MISC 0.990859 Optimus Prime 36 49
2 LOC 0.999755 Germany 90 97
3 MISC 0.556570 Mega 208 212
4 PER 0.590256 ##tron 212 216
5 ORG 0.669693 Decept 253 259
6 MISC 0.498349 ##icons 259 264
7 MISC 0.775362 Megatron 350 358
8 MISC 0.987854 Optimus Prime 367 380
9 PER 0.812096 Bumblebee 502 511

Note: The words Megatron, and Decepticons were split into separate words.

Note: The ## symbols are produced by the model’s tokenizer.

pd.DataFrame(ner_tagger.tokenizer.vocab, index=[0]).T.head(10)
Rees 24646
seeded 14937
Ruby 11374
Libraries 27927
foil 20235
collapsed 7322
membership 5467
Birth 20729
Texans 25904
Saul 18600

Question Answering Pipeline

  • Documentation
  • Question answering involves having a model find the answer to a specified question using a given passage of text.
reader = pipeline("question-answering")
question = "What does the customer want?"
outputs = reader(question=question, context=text)
score start end answer
0 0.631291 335 358 an exchange of Megatron

Note: This particular kind of question answering is called extractive question answering. The answer is extracted directly from the text.

Summarization Pipeline

  • Documentation
  • Text summarization involves generating a short version of a long passage of text while retaining all the relevant facts.
  • Tasks requiring a model to generate new text are more challenging than extractive ones.
summarizer = pipeline("summarization")
# Limit the generated summary to 45 words
outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)
 Bumblebee ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead.

Note: The model captured the essence of the customer message but directly copied some of the original text.

Translation Pipeline

  • Documentation
  • The model generates a translation of a piece of text in the target language.
# Create a translator that translates English to German
# Override the default model selection
translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")
# Require the model to generate a translation at least 100 words long
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
Sehr geehrter Amazon, letzte Woche habe ich eine Optimus Prime Action Figur aus Ihrem Online-Shop in Deutschland bestellt. Leider, als ich das Paket öffnete, entdeckte ich zu meinem Entsetzen, dass ich stattdessen eine Action Figur von Megatron geschickt worden war! Als lebenslanger Feind der Decepticons, Ich hoffe, Sie können mein Dilemma verstehen. Um das Problem zu lösen, Ich fordere einen Austausch von Megatron für die Optimus Prime Figur habe ich bestellt. Anbei sind Kopien meiner Aufzeichnungen über diesen Kauf. Ich erwarte, bald von Ihnen zu hören. Aufrichtig, Bumblebee.

Note: The model supposedly did a good job translating the text. (I don’t speak German.)

Text Generation Pipeline

  • Documentation
  • The model generates new text to complete a provided text prompt.
from transformers import set_seed
# Set the random seed to get reproducible results
generator = pipeline("text-generation")
response = "Dear Bumblebee, I am sorry to hear that your order was mixed up."
prompt = text + "\n\nCustomer service response:\n" + response
outputs = generator(prompt, max_length=200)
Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee.

Customer service response:
Dear Bumblebee, I am sorry to hear that your order was mixed up. The order was completely mislabeled, which is very common in our online store, but I can appreciate it because it was my understanding from this site and our customer service of the previous day that your order was not made correct in our mind and that we are in a process of resolving this matter. We can assure you that your order

The Hugging Face Ecosystem

  • Hugging Face Transformers is surrounded by an ecosystem of helpful tools that support the modern machine learning workflow.
  • This ecosystem consists of a family of code libraries and a hub of pretrained model weights, datasets, scripts for evaluation, other resources.

The Hugging Face Hub

  • The Hub hosts over 20,000 freely available models plus datasets and scripts for computing metrics.
  • Model and dataset cards document the contents of the models and datasets.
  • Filters are available for tasks, frameworks, datasets, and more designed to help quickly navigate the Hub.
  • Users can directly try out any model through task-specific widgets.

Hugging Face Tokenizers

  • Documentation
  • Tokenizers split the raw input text into smaller pieces called tokens.
  • Tokens can be words, parts of words, or single characters.
  • Hugging Face Tokenizers takes care of all the preprocessing and postprocessing steps, such as normalizing the inputs and transforming the model outputs to the required format.
  • The Tokenizers library uses a Rust backend for fast tokenization.

Hugging Face Datasets

  • Documentation
  • The Datasets library provides a standard interface for thousands of datasets to simplify loading, processing, and storing datasets.
  • Smart caching removes the need to perform preprocessing steps each time your run your code.
  • Memory mapping helps avoid RAM limitations by storing the contents of a file in virtual memory and enables multiple processes to modify the file more efficiently.
  • The library is interoperable with frameworks like Pandas and NumPy.
  • Scripts are available for many metrics to help make experiments more reproducible and trustworthy.

Hugging Face Accelerate

  • Documentation
  • The Accelerate library adds a layer of abstraction to training loops, which takes care of all the custom logic necessary for the training infrastructure.

Main Challenges with Transformers


  • It is hard to find pretrained models for languages other than English.

Data Availability

  • Even with transfer learning, transformers still need a lot of data compared to humans to perform a task.

Working With Long Documents

  • Self-attention becomes computationally expensive when working on full-length documents.


  • It is hard or impossible to determine precisely why a model made a given prediction.


  • Biases present in the training data imprint into the model.


Next: Notes on Transformers Book Ch. 2