Notes on Transformers Book Ch. 1

huggingface

nlp

notes

Chapter 1 covers essential advancements for transformers, recurrent architectures, the encoder-decoder framework, attention mechanisms, transfer learning in NLP, and the HuggingFace ecosystem.

Author

Christian Mills

Published

March 30, 2022

This post is part of the following series:

Natural Language Processing with Transformers

Key Advancements
Recurrent Architectures
The Encoder-Decoder Framework
Attention Mechanisms
Transfer Learning in NLP
Bridging the Gap With Hugging Face Transformers
A Tour of Transformer Applications
The Hugging Face Ecosystem
Main Challenges with Transformers
References

Key Advancements

Attention Is All You Need
- published in June 2017 by researchers at Google
- introduced the Transformer architecture for sequence modeling
- outperformed Recurrent Neural Networks (RNNs) on machine translation tasks, both in terms of translation quality and training cost
Universal Language Model Fine-tuning for Text Classification
- published in January 2018 by Jeremy Howard and Sebastian Ruder
- introduced an effective training method called ULMFiT
- showed that training Long Short-Term Memory Networks (LSTMs) on a very large and diverse corpus could produce state-of-the-art text classifiers with little labeled data
- inspired other research groups to combine transformers with unsupervised learning
Improving Language Understanding with Unsupervised Learning
- published by OpenAI in June 2018
- introduced Generative Pretrained Transformer (GPT)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- published by researchers at Google in October 2018
Combining the Transformer architecture with unsupervised learning removed the need to train task-specific architectures from scratch.
Pretrained Transformers broke almost every benchmark in NLP by a significant margin.

Recurrent Architectures

Recurrent architectures such as LSTMs were state of the art in Natural Language Processing (NLP) before Transformers.
Recurrent architectures contain a feedback loop that allows information to propagate from one step to another.
- ideal for sequential data like text.
A Recurrent Neural Network receives an input token and feeds it through the network.
The network outputs a vector called a hidden state, and it passes some information back to itself through a feedback loop.
The information passed through the feedback loop allows an RNN to keep track of details from previous steps and use it to make predictions.
Many still use recurrent architectures for NLP, speech processing, and time-series tasks.
The Unreasonable Effectiveness of Recurrent Neural Networks
- provides an overview of RNNs and demonstrates how to train a language model on several datasets
RNNs were critical in developing systems to translate a sequence of words from one language to another.
- known as machine translation
The computations for recurrent models are inherently sequential and cannot parallelize across the input.
- The inability to parallelize computations is a fundamental shortcoming of recurrent models.

The Encoder-Decoder Framework

Sequence to Sequence Learning with Neural Networks
- published in 2014 by researchers at Google
An encoder-decoder is also known as a sequence-to-sequence architecture.
This type of architecture is well-suited for situations where the input and output are both sequences of arbitrary length.
An encoder encodes information from an input sequence into a numerical representation.
- This numerical representation is often called the last hidden state.
The decoder uses the numerical representation to generate the output sequence.
The encoder and decoder can use any neural network architecture that can model sequences.
The final hidden state of the encoder is the only information the decoder has access to when generating the output.
- It has to represent the meaning of the whole input sequence.
- This requirement creates an information bottleneck that can limit performance for longer sequences.

Attention Mechanisms

Attention mechanisms allow the decoder to access all of the encoder’s hidden states, not just the last one.
The decoder assigns a different amount of weight, called attention, to each state at every decoding time-step.
The attention values allow the decoder to prioritize which encoder state to use.
Attention-based models focus on which input tokens are most relevant at each time step.
- They can learn nontrivial alignments between the words in a generated translation and those in a source sentence.
  - Example: An attention-based decoder can align the words “zone” and “Area” even when ordered differently in the source sentence and the translation.
Transformers use a special kind of attention called self-attention and do not use any form of recurrence.
- Self-attention operates on all states in the same layer of a neural network.
- The outputs of the self-attention mechanisms serve as input to feed-forward networks.
- This architecture trains much faster than recurrent models.

Transfer Learning in NLP

Transfer learning involves using the knowledge a model learned from a previous task on a new one.
- Computer vision models first train on large-scale datasets such as ImageNet to learn the basic features of images before being fine-tuned on a downstream task.
- It was not initially clear how to perform transfer learning for NLP tasks.
Fine-tuned models are typically more accurate than supervised models trained from scratch on the same amount of labeled data.
We adapt a pretrained model to a new task by splitting the model into a body and a head.
The head is the task-specific portion of the network.
The body contains broad features from the source domain learned during training.
We can use the body weights to initialize a new model head for a new task.
Transfer learning typically produces high-quality models that we can efficiently train on many downstream tasks.
The ULMFit paper provided a general framework to perform transfer learning with NLP models.
1. A model first trains to predict the next word based on those preceding it in a large-scale generic corpus to learn the basic features of the language.
  - This task is called language modeling.
2. The pretrained model then trains on the same task using an in-domain corpus.
3. Lastly, we fine-tune the model with a classification layer for the target task.
The ULMFit transfer learning framework provided the missing piece for transformers to take off.
Both GPT and BERT combine self-attention with transfer learning and set a new state of the art across many NLP benchmarks.
GPT only uses the decoder part of the Transformer architecture and the language modeling approach as ULMFiT.
BERT uses the encoder part of the Transformer architecture and a form of language modeling called masked language modeling.
- Masked language modeling requires the model to fill in randomly missing words in a text.
GPT trained on the BookCorpus dataset while BERT trained on the BookCorpus dataset and English Wikipedia.
- The BookCorpus dataset consists of thousands of unpublished books across many genres.

Bridging the Gap With Hugging Face Transformers

Applying a novel machine learning architecture to a new application can be complicated and requires custom logic for each model and task.
1. Implement the model architecture in code.
  - PyTorch and TensorFlow are the most common frameworks for this.
2. Load pretrained weights if available.
3. Preprocess the inputs, pass them through the model, and apply task-specific postprocessing.
4. Implement data loaders and define loss functions and optimizers to train the model.
Code released by research groups is rarely standardized and often requires days of engineering to adapt to new use cases.
- Different research labs often release their models in incompatible frameworks, making it difficult for practitioners to port these models to their applications.
Hugging Face Transformers provides a standardized interface to a wide range of transformer models, including code and tools to adapt these models to new applications.
- The availability of a standardized interface catalyzed the explosion of research into transformers and made it easy for NLP practitioners to integrate these models into real-life applications.
The library supports the PyTorch, TensorFlow, and JAX deep learning frameworks and provides task-specific model heads to fine-tune transformers on downstream tasks.

A Tour of Transformer Applications

Hugging Face Transformers has a layered API that allows users to interact with the library at various levels of abstraction.

Pipelines

Pipelines abstract away all the steps needed to convert raw text into a set of predictions from a fine-tuned model.
Hugging Face provides pipelines for several tasks.
Instantiate a pipeline by calling the pipeline() function and providing the name of the desired task.
- 'audio-classification'
- 'automatic-speech-recognition'
- 'feature-extraction'
- 'text-classification'
- 'token-classification'
- 'question-answering'
- 'table-question-answering'
- 'fill-mask'
- 'summarization'
- 'translation'
- 'text2text-generation'
- 'text-generation'
- 'zero-shot-classification'
- 'conversational'
- 'image-classification'
- 'object-detection'
The names for the supported tasks are available in the transformers.pipelines.SUPPORTED_TASKS dictionary.
The pipeline automatically downloads the model weights for the selected task and caches them for future use.
Each pipeline takes a string of text or a list of strings as input and returns a list of predictions.
- Each prediction is in a Python dictionary along with the corresponding confidence score.

import transformers
import datasets
import pandas as pd
from transformers import pipeline

transformers.logging.set_verbosity_error()
datasets.logging.set_verbosity_error()

pipeline

<function transformers.pipelines.pipeline(task: str, model: Optional = None, config: Union[str, transformers.configuration_utils.PretrainedConfig, NoneType] = None, tokenizer: Union[str, transformers.tokenization_utils.PreTrainedTokenizer, NoneType] = None, feature_extractor: Union[str, ForwardRef('SequenceFeatureExtractor'), NoneType] = None, framework: Optional[str] = None, revision: Optional[str] = None, use_fast: bool = True, use_auth_token: Union[bool, str, NoneType] = None, model_kwargs: Dict[str, Any] = {}, **kwargs) -> transformers.pipelines.base.Pipeline>

list(transformers.pipelines.SUPPORTED_TASKS.keys())

['audio-classification',
 'automatic-speech-recognition',
 'feature-extraction',
 'text-classification',
 'token-classification',
 'question-answering',
 'table-question-answering',
 'fill-mask',
 'summarization',
 'translation',
 'text2text-generation',
 'text-generation',
 'zero-shot-classification',
 'conversational',
 'image-classification',
 'object-detection']

transformers.pipelines.TASK_ALIASES

{'sentiment-analysis': 'text-classification', 'ner': 'token-classification'}

Sample Text: Customer Review

text = """Dear Amazon, last week I ordered an Optimus Prime action figure \
from your online store in Germany. Unfortunately, when I opened the package, \
I discovered to my horror that I had been sent an action figure of Megatron \
instead! As a lifelong enemy of the Decepticons, I hope you can understand my \
dilemma. To resolve the issue, I demand an exchange of Megatron for the \
Optimus Prime figure I ordered. Enclosed are copies of my records concerning \
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

Text Classification Pipeline

Documentation
The text-classification pipeline supports sentiment analysis, multiclass, and multilabel classification and performs sentiment analysis by default.

classifier = pipeline("text-classification")

# Classify the customer review as positive or negative
outputs = classifier(text)
pd.DataFrame(outputs)

	label	score
0	NEGATIVE	0.901546

Named Entity Recognition Pipeline

Documentation
Named entity recognition (NER) involves extracting real-world objects like products, places, and people from a piece of text.
Default Entity Labels
- MISC: Miscellaneous
- PER: Person
- ORG: Organization
- LOC: Location

# Create a named entity recognizer that groups words according to the model's predictions
ner_tagger = pipeline("ner", aggregation_strategy="simple")

Note: The simple aggregation strategy might end up splitting words undesirably.

ner_tagger.model.config.id2label

{0: 'O',
 1: 'B-MISC',
 2: 'I-MISC',
 3: 'B-PER',
 4: 'I-PER',
 5: 'B-ORG',
 6: 'I-ORG',
 7: 'B-LOC',
 8: 'I-LOC'}

outputs = ner_tagger(text)
pd.DataFrame(outputs)

	entity_group	score	word	start	end
0	ORG	0.879010	Amazon	5	11
1	MISC	0.990859	Optimus Prime	36	49
2	LOC	0.999755	Germany	90	97
3	MISC	0.556570	Mega	208	212
4	PER	0.590256	##tron	212	216
5	ORG	0.669693	Decept	253	259
6	MISC	0.498349	##icons	259	264
7	MISC	0.775362	Megatron	350	358
8	MISC	0.987854	Optimus Prime	367	380
9	PER	0.812096	Bumblebee	502	511

Note: The words Megatron, and Decepticons were split into separate words.

Note: The ## symbols are produced by the model’s tokenizer.

ner_tagger.tokenizer.vocab_size

pd.DataFrame(ner_tagger.tokenizer.vocab, index=[0]).T.head(10)

	0
Rees	24646
seeded	14937
Ruby	11374
Libraries	27927
foil	20235
collapsed	7322
membership	5467
Birth	20729
Texans	25904
Saul	18600

Question Answering Pipeline

Documentation
Question answering involves having a model find the answer to a specified question using a given passage of text.

reader = pipeline("question-answering")

question = "What does the customer want?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])

	score	start	end	answer
0	0.631291	335	358	an exchange of Megatron

Note: This particular kind of question answering is called extractive question answering. The answer is extracted directly from the text.

Summarization Pipeline

Documentation
Text summarization involves generating a short version of a long passage of text while retaining all the relevant facts.
Tasks requiring a model to generate new text are more challenging than extractive ones.

summarizer = pipeline("summarization")

# Limit the generated summary to 45 words
outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

 Bumblebee ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead.

Note: The model captured the essence of the customer message but directly copied some of the original text.

Translation Pipeline

Documentation
The model generates a translation of a piece of text in the target language.

# Create a translator that translates English to German
# Override the default model selection
translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")

# Require the model to generate a translation at least 100 words long
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

Sehr geehrter Amazon, letzte Woche habe ich eine Optimus Prime Action Figur aus Ihrem Online-Shop in Deutschland bestellt. Leider, als ich das Paket öffnete, entdeckte ich zu meinem Entsetzen, dass ich stattdessen eine Action Figur von Megatron geschickt worden war! Als lebenslanger Feind der Decepticons, Ich hoffe, Sie können mein Dilemma verstehen. Um das Problem zu lösen, Ich fordere einen Austausch von Megatron für die Optimus Prime Figur habe ich bestellt. Anbei sind Kopien meiner Aufzeichnungen über diesen Kauf. Ich erwarte, bald von Ihnen zu hören. Aufrichtig, Bumblebee.

Note: The model supposedly did a good job translating the text. (I don’t speak German.)

Text Generation Pipeline

Documentation
The model generates new text to complete a provided text prompt.

from transformers import set_seed
# Set the random seed to get reproducible results
set_seed(42)

generator = pipeline("text-generation")

response = "Dear Bumblebee, I am sorry to hear that your order was mixed up."
prompt = text + "\n\nCustomer service response:\n" + response
outputs = generator(prompt, max_length=200)
print(outputs[0]['generated_text'])

Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee.

Customer service response:
Dear Bumblebee, I am sorry to hear that your order was mixed up. The order was completely mislabeled, which is very common in our online store, but I can appreciate it because it was my understanding from this site and our customer service of the previous day that your order was not made correct in our mind and that we are in a process of resolving this matter. We can assure you that your order

The Hugging Face Ecosystem

Hugging Face Transformers is surrounded by an ecosystem of helpful tools that support the modern machine learning workflow.
This ecosystem consists of a family of code libraries and a hub of pretrained model weights, datasets, scripts for evaluation, other resources.

The Hugging Face Hub

The Hub hosts over 20,000 freely available models plus datasets and scripts for computing metrics.
Model and dataset cards document the contents of the models and datasets.
Filters are available for tasks, frameworks, datasets, and more designed to help quickly navigate the Hub.
Users can directly try out any model through task-specific widgets.

Hugging Face Tokenizers

Documentation
Tokenizers split the raw input text into smaller pieces called tokens.
Tokens can be words, parts of words, or single characters.
Hugging Face Tokenizers takes care of all the preprocessing and postprocessing steps, such as normalizing the inputs and transforming the model outputs to the required format.
The Tokenizers library uses a Rust backend for fast tokenization.

Hugging Face Datasets

Documentation
The Datasets library provides a standard interface for thousands of datasets to simplify loading, processing, and storing datasets.
Smart caching removes the need to perform preprocessing steps each time your run your code.
Memory mapping helps avoid RAM limitations by storing the contents of a file in virtual memory and enables multiple processes to modify the file more efficiently.
The library is interoperable with frameworks like Pandas and NumPy.
Scripts are available for many metrics to help make experiments more reproducible and trustworthy.

Hugging Face Accelerate

Documentation
The Accelerate library adds a layer of abstraction to training loops, which takes care of all the custom logic necessary for the training infrastructure.

Main Challenges with Transformers

Language

It is hard to find pretrained models for languages other than English.

Data Availability

Even with transfer learning, transformers still need a lot of data compared to humans to perform a task.

Working With Long Documents

Self-attention becomes computationally expensive when working on full-length documents.

Opacity

It is hard or impossible to determine precisely why a model made a given prediction.

Bias

Biases present in the training data imprint into the model.

References

Next: Notes on Transformers Book Ch. 2

About Me:

I’m Christian Mills, a deep learning consultant specializing in practical AI implementations. I help clients leverage cutting-edge AI technologies to solve real-world problems.

Interested in working together? Fill out my Quick AI Project Assessment form or learn more about me.