Notes on Transformers Book Ch. 4
- Introduction
- Project: Multilingual Named Entity Recognition
- The Dataset
- Multilingual Transformers
- A Closer Look at Tokenization
- Transformers for Named Entity Recognition
- The Anatomy of the Transformers Model Class
- Tokenizing Texts for NER
- Performance Measures
- Fine-Tuning XLM-RoBERTa
- Error Analysis
- Cross-Lingual Transfer
- References
import transformers
import datasets
import accelerate
# Only print error messages
transformers.logging.set_verbosity_error()
datasets.logging.set_verbosity_error()
transformers.__version__, datasets.__version__, accelerate.__version__
('4.11.3', '1.16.1', '0.5.1')
import ast
# https://astor.readthedocs.io/en/latest/
import astor
import inspect
import textwrap
def print_source(obj, exclude_doc=True):
# Get source code
= inspect.getsource(obj)
source # Remove any common leading whitespace from every line
= textwrap.dedent(source)
cleaned_source # Parse the source into an AST node.
= ast.parse(cleaned_source)
parsed
for node in ast.walk(parsed):
# Skip any nodes that are not class or function definitions
if not isinstance(node, (ast.FunctionDef, ast.ClassDef, ast.AsyncFunctionDef)):
continue
if exclude_doc and len(node.body) > 1: node.body = node.body[1:]
print(astor.to_source(parsed))
Introduction
- Non-English pretrained models typically exist only for languages like German, Russian, or Mandarin, where plenty of web text is available for pretraining.
- Avoid maintaining multiple monolingual models in production when possible.
- Transformer models pretrained on large corpora across many languages can perform zero-shot cross-lingual transfer.
- We can fine-tune a model using one language and apply it to others without further training.
- Multilingual transformers are well-suited for situations where a speaker alternates between two or more languages in the context of a single conversation.
Project: Multilingual Named Entity Recognition
- The goal is to fine-tine the transformer model XLM-RoBERTa to perform named entity recognition for a customer in Switzerland, where there are four national languages.
- We will use German, French, Italian, and English as the four languages.
- Named entity recognition involves extracting real-world objects like products, places, and people from a piece of text.
- Some potential NER applications include gaining insights from company documents, augmenting the quality of search engines, and building a structured database from a corpus.
The Dataset
WikiAnn (a.k.a PAN-X)
- WikiAnn is a dataset for cross-lingual name tagging and linking based on Wikipedia articles in 295 languages.
- Each article has annotations for location, person, and organization tags in the IOB2 format.
- The IOB2 format indicates the beginning of an entity with a
B-
prefix, consecutive tags belonging to the same entity with anI-
prefix, and tokens that do not belong to any entity with anO
tag.
- The IOB2 format indicates the beginning of an entity with a
- WikiANN is a subset of the XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization benchmark.
- Cross-lingual Name Tagging and Linking for 282 Languages
- Hugging Face Dataset Card
import pandas as pd
'max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None) pd.set_option(
# An example of a sequence annotated with named entities in IOB2 format
= "Jeff Dean is a computer scientist at Google in California".split()
toks = ["B-PER", "I-PER", "O", "O", "O", "O", "O", "B-ORG", "O", "B-LOC"]
lbls =[toks, lbls], index=['Tokens', 'Tags']) pd.DataFrame(data
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
Tokens | Jeff | Dean | is | a | computer | scientist | at | in | California | |
Tags | B-PER | I-PER | O | O | O | O | O | B-ORG | O | B-LOC |
from datasets import get_dataset_config_names
get_dataset_config_names
- Documentation
- Get the list of available configuration names for a particular dataset.
print_source(get_dataset_config_names)
def get_dataset_config_names(path: str, revision: Optional[Union[str,
Version]]=None, download_config: Optional[DownloadConfig]=None,
download_mode: Optional[GenerateMode]=None, force_local_path: Optional[
str]=None, dynamic_modules_path: Optional[str]=None, data_files:
Optional[Union[Dict, List, str]]=None, **download_kwargs):
dataset_module = dataset_module_factory(path, revision=revision,
download_config=download_config, download_mode=download_mode,
force_local_path=force_local_path, dynamic_modules_path=
dynamic_modules_path, data_files=data_files, **download_kwargs)
builder_cls = import_main_class(dataset_module.module_path)
return list(builder_cls.builder_configs.keys()) or [dataset_module.
builder_kwargs.get('name', 'default')]
xtreme
Hugging Face Dataset Card
# Get the names of the subsets for the XTREME dataset
= get_dataset_config_names("xtreme")
xtreme_subsets print(f"XTREME has {len(xtreme_subsets)} configurations")
XTREME has 183 configurations
='columns').hide(axis='index') pd.DataFrame(xtreme_subsets).T.style.hide(axis
XNLI | tydiqa | SQuAD | PAN-X.af | PAN-X.ar | PAN-X.bg | PAN-X.bn | PAN-X.de | PAN-X.el | PAN-X.en | PAN-X.es | PAN-X.et | PAN-X.eu | PAN-X.fa | PAN-X.fi | PAN-X.fr | PAN-X.he | PAN-X.hi | PAN-X.hu | PAN-X.id | PAN-X.it | PAN-X.ja | PAN-X.jv | PAN-X.ka | PAN-X.kk | PAN-X.ko | PAN-X.ml | PAN-X.mr | PAN-X.ms | PAN-X.my | PAN-X.nl | PAN-X.pt | PAN-X.ru | PAN-X.sw | PAN-X.ta | PAN-X.te | PAN-X.th | PAN-X.tl | PAN-X.tr | PAN-X.ur | PAN-X.vi | PAN-X.yo | PAN-X.zh | MLQA.ar.ar | MLQA.ar.de | MLQA.ar.vi | MLQA.ar.zh | MLQA.ar.en | MLQA.ar.es | MLQA.ar.hi | MLQA.de.ar | MLQA.de.de | MLQA.de.vi | MLQA.de.zh | MLQA.de.en | MLQA.de.es | MLQA.de.hi | MLQA.vi.ar | MLQA.vi.de | MLQA.vi.vi | MLQA.vi.zh | MLQA.vi.en | MLQA.vi.es | MLQA.vi.hi | MLQA.zh.ar | MLQA.zh.de | MLQA.zh.vi | MLQA.zh.zh | MLQA.zh.en | MLQA.zh.es | MLQA.zh.hi | MLQA.en.ar | MLQA.en.de | MLQA.en.vi | MLQA.en.zh | MLQA.en.en | MLQA.en.es | MLQA.en.hi | MLQA.es.ar | MLQA.es.de | MLQA.es.vi | MLQA.es.zh | MLQA.es.en | MLQA.es.es | MLQA.es.hi | MLQA.hi.ar | MLQA.hi.de | MLQA.hi.vi | MLQA.hi.zh | MLQA.hi.en | MLQA.hi.es | MLQA.hi.hi | XQuAD.ar | XQuAD.de | XQuAD.vi | XQuAD.zh | XQuAD.en | XQuAD.es | XQuAD.hi | XQuAD.el | XQuAD.ru | XQuAD.th | XQuAD.tr | bucc18.de | bucc18.fr | bucc18.zh | bucc18.ru | PAWS-X.de | PAWS-X.en | PAWS-X.es | PAWS-X.fr | PAWS-X.ja | PAWS-X.ko | PAWS-X.zh | tatoeba.afr | tatoeba.ara | tatoeba.ben | tatoeba.bul | tatoeba.deu | tatoeba.cmn | tatoeba.ell | tatoeba.est | tatoeba.eus | tatoeba.fin | tatoeba.fra | tatoeba.heb | tatoeba.hin | tatoeba.hun | tatoeba.ind | tatoeba.ita | tatoeba.jav | tatoeba.jpn | tatoeba.kat | tatoeba.kaz | tatoeba.kor | tatoeba.mal | tatoeba.mar | tatoeba.nld | tatoeba.pes | tatoeba.por | tatoeba.rus | tatoeba.spa | tatoeba.swh | tatoeba.tam | tatoeba.tel | tatoeba.tgl | tatoeba.tha | tatoeba.tur | tatoeba.urd | tatoeba.vie | udpos.Afrikaans | udpos.Arabic | udpos.Basque | udpos.Bulgarian | udpos.Dutch | udpos.English | udpos.Estonian | udpos.Finnish | udpos.French | udpos.German | udpos.Greek | udpos.Hebrew | udpos.Hindi | udpos.Hungarian | udpos.Indonesian | udpos.Italian | udpos.Japanese | udpos.Kazakh | udpos.Korean | udpos.Chinese | udpos.Marathi | udpos.Persian | udpos.Portuguese | udpos.Russian | udpos.Spanish | udpos.Tagalog | udpos.Tamil | udpos.Telugu | udpos.Thai | udpos.Turkish | udpos.Urdu | udpos.Vietnamese | udpos.Yoruba |
Note: We are only interested in the PAN-X subsets for this project.
# Look for configuration names containing 'PAN'
= [s for s in xtreme_subsets if s.startswith("PAN")]
panx_subsets len(panx_subsets), panx_subsets[:3]
(40, ['PAN-X.af', 'PAN-X.ar', 'PAN-X.bg'])
Note: * There are 40 PAN-X subsets. * Each subset has a two-letter suffix indicating the ISO 639-1 language code. * German (de) * French (fr) * Italian (it) * English (en)
from datasets import load_dataset
from collections import defaultdict
from datasets import DatasetDict
# Specify the desired language codes
= ["de", "fr", "it", "en"]
langs # Specify the percentage each language should contribute to the total dataset
= [0.629, 0.229, 0.084, 0.059] fracs
Note: * These percentages represent the spoken proportions for each language in Switzerland. * This language imbalance simulates the common situation where acquiring labeled examples in a minority language is cost-prohibitive.
Dataset.shuffle
- Documentation
- Create a new dataset with shuffled rows.
Dataset.select
- Documentation
- Create a new dataset with rows selected following the list/array of indices.
# Return a DatasetDict if a key doesn't exist
= defaultdict(DatasetDict)
panx_ch
for lang, frac in zip(langs, fracs):
# Load monolingual corpus
= load_dataset("xtreme", name=f"PAN-X.{lang}")
ds # Shuffle and downsample each split according to spoken proportion
for split in ds:
= (
panx_ch[lang][split]
ds[split]# Shuffle the dataset split rows
=0)
.shuffle(seed# Select subset of dataset split
range(int(frac * ds[split].num_rows)))) .select(
"train"].num_rows] for lang in langs}, index=["Number of training examples"]) pd.DataFrame({lang: [panx_ch[lang][
de | fr | it | en | |
---|---|---|---|---|
Number of training examples | 12580 | 4580 | 1680 | 1180 |
= sum([panx_ch[lang]['train'].num_rows for lang in langs])
train_size train_size
20020
pd.DataFrame("train"].num_rows,
{lang: [panx_ch[lang][f'{panx_ch[lang]["train"].num_rows/train_size*100:.2f}%'] for lang in langs
=["Number of training examples", "Proportion of Dataset"]) }, index
de | fr | it | en | |
---|---|---|---|---|
Number of training examples | 12580 | 4580 | 1680 | 1180 |
Proportion of Dataset | 62.84% | 22.88% | 8.39% | 5.89% |
for lang in langs: print(panx_ch[lang]["train"])
Dataset({
features: ['tokens', 'ner_tags', 'langs'],
num_rows: 12580
})
Dataset({
features: ['tokens', 'ner_tags', 'langs'],
num_rows: 4580
})
Dataset({
features: ['tokens', 'ner_tags', 'langs'],
num_rows: 1680
})
Dataset({
features: ['tokens', 'ner_tags', 'langs'],
num_rows: 1180
})
= panx_ch["de"]["train"][0]
element pd.DataFrame(element).T
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
tokens | 2.000 | Einwohnern | an | der | Danziger | Bucht | in | der | polnischen | Woiwodschaft | Pommern | . |
ner_tags | 0 | 0 | 0 | 0 | 5 | 6 | 0 | 0 | 5 | 5 | 6 | 0 |
langs | de | de | de | de | de | de | de | de | de | de | de | de |
Note:
- The German text translates to “2,000 inhabitants at the Gdansk Bay in the Polish voivodeship of Pomerania.”
- Gdansk Bay is a bay in the Baltic Sea.
- The word “voivodeship” corresponds to a state in Poland.
- The ner_tags column corresponds to the mapping of each entity to a class ID.
- The Dataset object has a “features” attribute that specifies the underlying data types associated with each column.
= panx_ch["de"]["train"].features["ner_tags"].feature
tags tags
ClassLabel(num_classes=7, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], names_file=None, id=None)
tags.names
['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
Map the class IDs to the corresponding tag names
def create_tag_names(batch):
return {"ner_tags_str": [tags.int2str(idx) for idx in batch["ner_tags"]]}
= panx_ch["de"].map(create_tag_names) panx_de
"train"][0]).reindex(columns=["tokens", "ner_tags_str","ner_tags","langs"]).T pd.DataFrame(panx_de[
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
tokens | 2.000 | Einwohnern | an | der | Danziger | Bucht | in | der | polnischen | Woiwodschaft | Pommern | . |
ner_tags_str | O | O | O | O | B-LOC | I-LOC | O | O | B-LOC | B-LOC | I-LOC | O |
ner_tags | 0 | 0 | 0 | 0 | 5 | 6 | 0 | 0 | 5 | 5 | 6 | 0 |
langs | de | de | de | de | de | de | de | de | de | de | de | de |
from collections import Counter
Calculate the frequencies of each entity across each split
= defaultdict(Counter)
split2freqs for split, dataset in panx_de.items():
for row in dataset["ner_tags_str"]:
for tag in row:
if tag.startswith("B"):
= tag.split("-")[1]
tag_type += 1
split2freqs[split][tag_type] ="index") pd.DataFrame.from_dict(split2freqs, orient
ORG | LOC | PER | |
---|---|---|---|
validation | 2683 | 3172 | 2893 |
test | 2573 | 3180 | 3071 |
train | 5366 | 6186 | 5810 |
Note: The distributions of the entity frequencies are roughly the same for each split.
Multilingual Transformers
- Multilingual transformers use a corpus consisting of documents in many languages for pretraining.
- The models do not receive any explicit information to differentiate among languages.
- The resulting linguistic representations generalize well across languages for many downstream tasks.
- Many use the CoNLL-2002 and CoNLL-2003 datasets as benchmarks to measure the progress of cross-lingual transfer for named entity recognition for English, Dutch, Spanish, and German.
Evaluation Methods
- en: Fine-tune using the English training data and then evaluate the model on each language’s test set.
- each: Fine-tune and evaluate using monolingual test data to measure per-language performance.
- all: Fine-tune using all the training data to evaluate the model on each language’s test set.
A Closer Look at Tokenization
- XLM-RoBERTa uses the SentencePiece subword tokenizer instead of the WordPiece tokenizer.
from transformers import AutoTokenizer
= "bert-base-cased"
bert_model_name = "xlm-roberta-base"
xlmr_model_name = AutoTokenizer.from_pretrained(bert_model_name)
bert_tokenizer = AutoTokenizer.from_pretrained(xlmr_model_name) xlmr_tokenizer
Compare the WordPiece and SentencePiece tokenizers
= "Jack Sparrow loves New York!"
text = bert_tokenizer(text).tokens()
bert_tokens = xlmr_tokenizer(text).tokens() xlmr_tokens
=["BERT", "XLM-R"]) pd.DataFrame([bert_tokens, xlmr_tokens], index
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
BERT | [CLS] | Jack | Spa | ##rrow | loves | New | York | ! | [SEP] | None |
XLM-R | <s> | _Jack | _Spar | row | _love | s | _New | _York | ! | </s> |
Note: SentencePiece uses <s>
and </s>
to indicate the start and end sequences.
The Tokenizer Pipeline
1. Normalization
- The normalization step includes the operations to clean up raw text, such as stripping whitespace and removing accented characters.
- Unicode normalization schemes replace the various ways to write the same character with standard forms.
- Unicode normalization is particularly effective when working with multilingual corpora.
- Lowercasing can help reduce the vocabulary size when the model only accepts and uses lowercase characters.
2. Pretokenization
- The pre-tokenization step splits a text into smaller objects, and the final tokens will be subunits of these objects.
- Some languages might require language-specific pre-tokenization methods.
3. Tokenizer Model
- The tokenizer model analyzes the training corpus to find the most commonly occurring groups of characters, which become the vocab.
4. Postprocessing
- The postprocessing step applies some additional transformations, such as adding special characters to the start or end of an input sequence.
The SentencePiece Tokenizer
- The SentencePiece tokenizer builds on the Unigram subword segmentation algorithm and encodes each input text as a sequence of Unicode characters.
- SentencePiece supports the byte-pair-encoding (BPE) algorithm and the unigram language model.
- SentencePiece replaces whitespace with the Unicode symbol
U+2581
for_
.
"".join(xlmr_tokens).replace(u"\u2581", " ")
'<s> Jack Sparrow loves New York!</s>'
Transformers for Named Entity Recognition
- For text classification, BERT uses the
[CLS]
token to represent an entire sequence of text. - For named entity recognition, BERT feeds the representation of each input token through the same fully connected layer to output the entity of each one.
- We can assign the entity label to the first subword of a word and ignore the rest.
The Anatomy of the Transformers Model Class
- The Hugging Face Transformers library has dedicated classes for each architecture and task.
- We can extend existing models for specific use cases with little overhead.
Bodies and Heads
- Hugging Face Transformers splits model architectures into a body and head
- The body is task-agnostic, and the model head is unique to a specific downstream task.
Creating a Custom Model for Token Classification
import torch.nn as nn
from transformers import XLMRobertaConfig
from transformers.modeling_outputs import TokenClassifierOutput
from transformers.models.roberta.modeling_roberta import RobertaModel
from transformers.models.roberta.modeling_roberta import RobertaPreTrainedModel
TokenClassifierOutput
- Documentation
- A base class for outputs of token classification models.
print_source(TokenClassifierOutput)
@dataclass
class TokenClassifierOutput(ModelOutput):
loss: Optional[torch.FloatTensor] = None
logits: torch.FloatTensor = None
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
attentions: Optional[Tuple[torch.FloatTensor]] = None
ModelOutput
- Documentation
- The base class for all model outputs.
RobertaModel
- Documentation
- A bare RoBERTa Model transformer outputting raw hidden-states without any specific head on top.
RobertaPreTrainedModel
- Source Code
- An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.
XLMRobertaConfig()
XLMRobertaConfig {
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"classifier_dropout": null,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "xlm-roberta",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"transformers_version": "4.11.3",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 30522
}
class XLMRobertaForTokenClassification(RobertaPreTrainedModel):
# Use the standard XLM-RoBERTa settings.
= XLMRobertaConfig
config_class
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
# Load model body
# Set add_pooling_layer to False to get all hidden states in the output
self.roberta = RobertaModel(config, add_pooling_layer=False)
# Set up token classification head
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
# Load and initialize weights
self.init_weights()
def forward(self, input_ids=None, attention_mask=None, token_type_ids=None,
=None, **kwargs):
labels# Use model body to get encoder representations
= self.roberta(input_ids, attention_mask=attention_mask,
outputs =token_type_ids, **kwargs)
token_type_ids# Apply classifier to encoder representation
= self.dropout(outputs[0])
sequence_output = self.classifier(sequence_output)
logits # Calculate losses
= None
loss if labels is not None:
= nn.CrossEntropyLoss()
loss_fct = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
loss # Return model output object
return TokenClassifierOutput(loss=loss, logits=logits,
=outputs.hidden_states,
hidden_states=outputs.attentions) attentions
Loading a Custom Model
- We need to provide the tags for labeling each entity and mappings to convert between tags and IDs
Define the mappings to convert between tags and index IDs
= {idx: tag for idx, tag in enumerate(tags.names)}
index2tag = {tag: idx for idx, tag in enumerate(tags.names)}
tag2index index2tag, tag2index
({0: 'O',
1: 'B-PER',
2: 'I-PER',
3: 'B-ORG',
4: 'I-ORG',
5: 'B-LOC',
6: 'I-LOC'},
{'O': 0,
'B-PER': 1,
'I-PER': 2,
'B-ORG': 3,
'I-ORG': 4,
'B-LOC': 5,
'I-LOC': 6})
from transformers import AutoConfig
Override the default parameters stored in XLMRobertaConfig
= AutoConfig.from_pretrained(xlmr_model_name,
xlmr_config =tags.num_classes,
num_labels=index2tag, label2id=tag2index)
id2label xlmr_config
XLMRobertaConfig {
"architectures": [
"XLMRobertaForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"classifier_dropout": null,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "O",
"1": "B-PER",
"2": "I-PER",
"3": "B-ORG",
"4": "I-ORG",
"5": "B-LOC",
"6": "I-LOC"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"label2id": {
"B-LOC": 5,
"B-ORG": 3,
"B-PER": 1,
"I-LOC": 6,
"I-ORG": 4,
"I-PER": 2,
"O": 0
},
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "xlm-roberta",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"output_past": true,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"transformers_version": "4.11.3",
"type_vocab_size": 1,
"use_cache": true,
"vocab_size": 250002
}
import torch
Load a pretrained XLM-RoBERTa model with the custom classification head and configuration parameters
= torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = (XLMRobertaForTokenClassification
xlmr_model =xlmr_config)
.from_pretrained(xlmr_model_name, config .to(device))
Encode some sample text
text
'Jack Sparrow loves New York!'
= xlmr_tokenizer.encode(text, return_tensors="pt")
input_ids 0].numpy()], index=["Tokens", "Input IDs"]) pd.DataFrame([xlmr_tokens, input_ids[
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
Tokens | <s> | _Jack | _Spar | row | _love | s | _New | _York | ! | </s> |
Input IDs | 0 | 21763 | 37456 | 15555 | 5161 | 7 | 2356 | 5753 | 38 | 2 |
Test model predictions with the untrained classifier
= xlmr_model(input_ids.to(device)).logits
outputs = torch.argmax(outputs, dim=-1)
predictions print(f"Number of tokens in sequence: {len(xlmr_tokens)}")
print(f"Shape of outputs: {outputs.shape}")
Number of tokens in sequence: 10
Shape of outputs: torch.Size([1, 10, 7])
Note: The logits have the shape [batch_size, num_tokens, num_tags]
.
= [tags.names[p] for p in predictions[0].cpu().numpy()]
preds =["Tokens", "Tags"]) pd.DataFrame([xlmr_tokens, preds], index
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
Tokens | <s> | _Jack | _Spar | row | _love | s | _New | _York | ! | </s> |
Tags | I-ORG | I-ORG | I-ORG | I-ORG | I-ORG | I-ORG | I-ORG | I-ORG | I-ORG | I-ORG |
Note: The output is useless as the weights are still randomly initialized.
Wrap the prediction steps in a helper function
def tag_text(text, tags, model, tokenizer):
# Get tokens with special characters
= tokenizer(text).tokens()
tokens # Encode the sequence into IDs
= xlmr_tokenizer(text, return_tensors="pt").input_ids.to(device)
input_ids # Get predictions as distribution over 7 possible classes
= model(input_ids)[0]
outputs # Take argmax to get most likely class per token
= torch.argmax(outputs, dim=2)
predictions # Convert to DataFrame
= [tags.names[p] for p in predictions[0].cpu().numpy()]
preds return pd.DataFrame([tokens, preds], index=["Tokens", "Tags"])
Tokenizing Texts for NER
Collect the words and tags as ordinary lists
= panx_de['train'][0] de_example
= de_example["tokens"], de_example["ner_tags"]
words, labels =["words", "labels"]) pd.DataFrame([words,labels], index
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
words | 2.000 | Einwohnern | an | der | Danziger | Bucht | in | der | polnischen | Woiwodschaft | Pommern | . |
labels | 0 | 0 | 0 | 0 | 5 | 6 | 0 | 0 | 5 | 5 | 6 | 0 |
Tokenize each word
= xlmr_tokenizer(de_example["tokens"], is_split_into_words=True)
tokenized_input =tokenized_input.keys()) pd.DataFrame(tokenized_input.values(), index
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
input_ids | 0 | 70101 | 176581 | 19 | 142 | 122 | 2290 | 708 | 1505 | 18363 | 18 | 23 | 122 | 127474 | 15439 | 13787 | 14 | 15263 | 18917 | 663 | 6947 | 19 | 6 | 5 | 2 |
attention_mask | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Note: The is_split_into_words
argument tells the tokenizer the input sequence is a list of separated words.
= xlmr_tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens =["tokens"]).T pd.DataFrame(tokens, columns
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tokens | <s> | _2.000 | _Einwohner | n | _an | _der | _Dan | zi | ger | _Buch | t | _in | _der | _polni | schen | _Wo | i | wod | schaft | _Po | mmer | n | _ | . | </s> |
Note: We can use the word_ids()
function to mask the subword representations after the first subword.
BatchEncoding.word_ids.word_ids
- Documentation
- Get a list indicating the word corresponding to each token.
= tokenized_input.word_ids()
word_ids =["Tokens", "Word IDs"]) pd.DataFrame([tokens, word_ids], index
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Tokens | <s> | _2.000 | _Einwohner | n | _an | _der | _Dan | zi | ger | _Buch | t | _in | _der | _polni | schen | _Wo | i | wod | schaft | _Po | mmer | n | _ | . | </s> |
Word IDs | None | 0 | 1 | 1 | 2 | 3 | 4 | 4 | 4 | 5 | 5 | 6 | 7 | 8 | 8 | 9 | 9 | 9 | 9 | 10 | 10 | 10 | 11 | 11 | None |
Note: The <s>
and </s>
tokens map to None
as they are not words from the original text.
Set -100 as the label for the start and end tokens and masked subwords
- The PyTorch cross-entropy loss class has an attribute called
ignore_index
whose value is -100.
= None
previous_word_idx = []
label_ids
for word_idx in word_ids:
if word_idx is None or word_idx == previous_word_idx:
-100)
label_ids.append(elif word_idx != previous_word_idx:
label_ids.append(labels[word_idx])= word_idx
previous_word_idx
= [index2tag[l] if l != -100 else "IGN" for l in label_ids]
labels = ["Tokens", "Word IDs", "Label IDs", "Labels"]
index
=index) pd.DataFrame([tokens, word_ids, label_ids, labels], index
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Tokens | <s> | _2.000 | _Einwohner | n | _an | _der | _Dan | zi | ger | _Buch | t | _in | _der | _polni | schen | _Wo | i | wod | schaft | _Po | mmer | n | _ | . | </s> |
Word IDs | None | 0 | 1 | 1 | 2 | 3 | 4 | 4 | 4 | 5 | 5 | 6 | 7 | 8 | 8 | 9 | 9 | 9 | 9 | 10 | 10 | 10 | 11 | 11 | None |
Label IDs | -100 | 0 | 0 | -100 | 0 | 0 | 5 | -100 | -100 | 6 | -100 | 0 | 0 | 5 | -100 | 5 | -100 | -100 | -100 | 6 | -100 | -100 | 0 | -100 | -100 |
Labels | IGN | O | O | IGN | O | O | B-LOC | IGN | IGN | I-LOC | IGN | O | O | B-LOC | IGN | B-LOC | IGN | IGN | IGN | I-LOC | IGN | IGN | O | IGN | IGN |
Wrap the tokenization and label alignment steps into a single function
def tokenize_and_align_labels(examples):
= xlmr_tokenizer(examples["tokens"], truncation=True,
tokenized_inputs =True)
is_split_into_words= []
labels for idx, label in enumerate(examples["ner_tags"]):
= tokenized_inputs.word_ids(batch_index=idx)
word_ids = None
previous_word_idx = []
label_ids for word_idx in word_ids:
if word_idx is None or word_idx == previous_word_idx:
-100)
label_ids.append(else:
label_ids.append(label[word_idx])= word_idx
previous_word_idx
labels.append(label_ids)"labels"] = labels
tokenized_inputs[return tokenized_inputs
Define a mapping function to encode the dataset
def encode_panx_dataset(corpus):
return corpus.map(tokenize_and_align_labels, batched=True,
=['langs', 'ner_tags', 'tokens']) remove_columns
"de"] panx_ch[
DatasetDict({
validation: Dataset({
features: ['tokens', 'ner_tags', 'langs'],
num_rows: 6290
})
test: Dataset({
features: ['tokens', 'ner_tags', 'langs'],
num_rows: 6290
})
train: Dataset({
features: ['tokens', 'ner_tags', 'langs'],
num_rows: 12580
})
})
= encode_panx_dataset(panx_ch["de"])
panx_de_encoded panx_de_encoded
DatasetDict({
validation: Dataset({
features: ['attention_mask', 'input_ids', 'labels'],
num_rows: 6290
})
test: Dataset({
features: ['attention_mask', 'input_ids', 'labels'],
num_rows: 6290
})
train: Dataset({
features: ['attention_mask', 'input_ids', 'labels'],
num_rows: 12580
})
})
Performance Measures
- Standard performance measures for NER tasks include precision, recall, and F1-score.
- The model needs to correctly predict all words of an entity for a prediction to count as correct.
seqval
- GitHub Repository
- A Python framework for sequence labeling evaluation
from seqeval.metrics import classification_report
classification_report
- Source Code
- Build a text report showing the main classification metrics for a sequence of targets and predictions.
- The function expects targets and predictions as lists of lists.
= [["O", "O", "O", "B-MISC", "I-MISC", "I-MISC", "O"],["B-PER", "I-PER", "O"]]
y_true = [["O", "O", "B-MISC", "I-MISC", "I-MISC", "I-MISC", "O"],["B-PER", "I-PER", "O"]]
y_pred print(classification_report(y_true, y_pred))
precision recall f1-score support
MISC 0.00 0.00 0.00 1
PER 1.00 1.00 1.00 1
micro avg 0.50 0.50 0.50 2
macro avg 0.50 0.50 0.50 2
weighted avg 0.50 0.50 0.50 2
import numpy as np
Format predictions and target labels for seqval
def align_predictions(predictions, label_ids):
= np.argmax(predictions, axis=2)
preds = preds.shape
batch_size, seq_len = [], []
labels_list, preds_list
for batch_idx in range(batch_size):
= [], []
example_labels, example_preds for seq_idx in range(seq_len):
# Ignore label IDs = -100
if label_ids[batch_idx, seq_idx] != -100:
example_labels.append(index2tag[label_ids[batch_idx][seq_idx]])
example_preds.append(index2tag[preds[batch_idx][seq_idx]])
labels_list.append(example_labels)
preds_list.append(example_preds)
return preds_list, labels_list
Fine-Tuning XLM-RoBERTa
Define training attributes
from transformers import TrainingArguments
= 3
num_epochs = 64
batch_size = len(panx_de_encoded["train"]) // batch_size
logging_steps = f"{xlmr_model_name}-finetuned-panx-de"
model_name = TrainingArguments(
training_args =model_name, log_level="error", num_train_epochs=num_epochs,
output_dir=batch_size,
per_device_train_batch_size=batch_size, evaluation_strategy="epoch",
per_device_eval_batch_size=1e6, weight_decay=0.01, disable_tqdm=False,
save_steps=logging_steps, push_to_hub=True, fp16=True) logging_steps
Note: Set save_steps
to a large number to disable checkpointing.
Log into Hugging Face account
from huggingface_hub import notebook_login
notebook_login()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Login successful
Your token has been saved to /home/innom-dt/.huggingface/token
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Compute the \(f_{1}\)-score on the validation set
from seqeval.metrics import f1_score
def compute_metrics(eval_pred):
= align_predictions(eval_pred.predictions,
y_pred, y_true
eval_pred.label_ids)return {"f1": f1_score(y_true, y_pred)}
Define a collator to pad each input sequence to the highest sequence length in a batch
from transformers import DataCollatorForTokenClassification
DataCollatorForTokenClassification
- Documentation
- Create a data collator that will dynamically pad inputs and labels.
print_source(DataCollatorForTokenClassification.torch_call)
def torch_call(self, features):
label_name = 'label' if 'label' in features[0].keys() else 'labels'
labels = [feature[label_name] for feature in features
] if label_name in features[0].keys() else None
batch = self.tokenizer.pad(features, padding=self.padding, max_length=
self.max_length, pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors='pt' if labels is None else None)
if labels is None:
return batch
sequence_length = torch.tensor(batch['input_ids']).shape[1]
padding_side = self.tokenizer.padding_side
if padding_side == 'right':
batch[label_name] = [(list(label) + [self.label_pad_token_id] * (
sequence_length - len(label))) for label in labels]
else:
batch[label_name] = [([self.label_pad_token_id] * (sequence_length -
len(label)) + list(label)) for label in labels]
batch = {k: torch.tensor(v, dtype=torch.int64) for k, v in batch.items()}
return batch
DataCollatorForTokenClassification.label_pad_token_id
-100
Note: * We need to pad the labels as they are also sequences. * The collator pads label sequences with the value -100, so the PyTorch loss function ignores them.
= DataCollatorForTokenClassification(xlmr_tokenizer) data_collator
Create a helper function to initialize a new model for a training session
def model_init():
return (XLMRobertaForTokenClassification
=xlmr_config)
.from_pretrained(xlmr_model_name, config .to(device))
Disable Tokenizers Parallelism
%env TOKENIZERS_PARALLELISM=false
env: TOKENIZERS_PARALLELISM=false
Initialize the Trainer object
from transformers import Trainer
= Trainer(model_init=model_init, args=training_args,
trainer =data_collator, compute_metrics=compute_metrics,
data_collator=panx_de_encoded["train"],
train_dataset=panx_de_encoded["validation"],
eval_dataset=xlmr_tokenizer) tokenizer
Run the training loop and push the final model to the Hugging Face Hub
trainer.train()="Training completed!") trainer.push_to_hub(commit_message
<table border="1" class="dataframe">
Epoch
Training Loss
Validation Loss
F1
1
0.326400
0.162317
0.813909
2
0.136000
0.133068
0.845137
3
0.096000
0.131872
0.857581
'https://huggingface.co/cj-mills/xlm-roberta-base-finetuned-panx-de/commit/1ebdc3c9051a980588be5a495ad96896f330932c'
How to manually display the training log
trainer.state.log_history
[{'loss': 0.3264,
'learning_rate': 3.3671742808798645e-05,
'epoch': 0.99,
'step': 196},
{'eval_loss': 0.1623172014951706,
'eval_f1': 0.8139089269612262,
'eval_runtime': 7.0145,
'eval_samples_per_second': 896.714,
'eval_steps_per_second': 14.114,
'epoch': 1.0,
'step': 197},
{'loss': 0.136,
'learning_rate': 1.7174280879864637e-05,
'epoch': 1.99,
'step': 392},
{'eval_loss': 0.1330675333738327,
'eval_f1': 0.8451372416130125,
'eval_runtime': 6.8702,
'eval_samples_per_second': 915.543,
'eval_steps_per_second': 14.41,
'epoch': 2.0,
'step': 394},
{'loss': 0.096,
'learning_rate': 6.76818950930626e-07,
'epoch': 2.98,
'step': 588},
{'eval_loss': 0.13187244534492493,
'eval_f1': 0.8575809199318569,
'eval_runtime': 6.8965,
'eval_samples_per_second': 912.061,
'eval_steps_per_second': 14.355,
'epoch': 3.0,
'step': 591},
{'train_runtime': 95.0424,
'train_samples_per_second': 397.086,
'train_steps_per_second': 6.218,
'total_flos': 1039360955930616.0,
'train_loss': 0.18559023183211054,
'epoch': 3.0,
'step': 591}]
= pd.DataFrame(trainer.state.log_history)[['epoch','loss' ,'eval_loss', 'eval_f1']]
df = df.rename(columns={"epoch":"Epoch","loss": "Training Loss", "eval_loss": "Validation Loss", "eval_f1":"F1"})
df 'Epoch'] = df["Epoch"].apply(lambda x: round(x))
df['Training Loss'] = df["Training Loss"].ffill()
df['Validation Loss', 'F1']] = df[['Validation Loss', 'F1']].bfill().ffill()
df[[ df.drop_duplicates()
Epoch | Training Loss | Validation Loss | F1 | |
---|---|---|---|---|
0 | 1 | 0.3264 | 0.162317 | 0.813909 |
2 | 2 | 0.1360 | 0.133068 | 0.845137 |
4 | 3 | 0.0960 | 0.131872 | 0.857581 |
Test the model on some sample text
= "Jeff Dean ist ein Informatiker bei Google in Kalifornien"
text_de tag_text(text_de, tags, trainer.model, xlmr_tokenizer)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Tokens | <s> | _Jeff | _De | an | _ist | _ein | _Informati | ker | _bei | _Google | _in | _Kaliforni | en | </s> |
Tags | O | B-PER | I-PER | I-PER | O | O | O | O | O | B-ORG | O | B-LOC | I-LOC | O |
Note: The fine-tuned model correctly identifies the entities in the sample text.
Error Analysis
- Error analysis is an effective tool to understand a model’s strengths and weaknesses.
- Looking at the errors can yield helpful insights and reveal bugs that would be hard to spot by only looking at the code.
- There are several failure modes where a model might appear to perform well but have serious flaws.
Failure Modes
- We might accidentally mask too many tokens and some labels, resulting in a promising loss drop.
- The metrics function might have a bug.
- We might include the zero class, skewing the accuracy and \(F_{1}\)-score.
Define a function that returns the loss and predicted labels for a single batch
from torch.nn.functional import cross_entropy
def forward_pass_with_label(batch):
# Convert dict of lists to list of dicts suitable for data collator
= [dict(zip(batch, t)) for t in zip(*batch.values())]
features # Pad inputs and labels and put all tensors on device
= data_collator(features)
batch = batch["input_ids"].to(device)
input_ids = batch["attention_mask"].to(device)
attention_mask = batch["labels"].to(device)
labels with torch.no_grad():
# Pass data through model
= trainer.model(input_ids, attention_mask)
output # Logit.size: [batch_size, sequence_length, classes]
# Predict class with largest logit value on classes axis
= torch.argmax(output.logits, axis=-1).cpu().numpy()
predicted_label # Calculate loss per token after flattening batch dimension with view
= cross_entropy(output.logits.view(-1, 7),
loss -1), reduction="none")
labels.view(# Unflatten batch dimension and convert to numpy array
= loss.view(len(input_ids), -1).cpu().numpy()
loss
return {"loss":loss, "predicted_label": predicted_label}
Get the loss and predictions for the validation set
= panx_de_encoded["validation"]
valid_set = valid_set.map(forward_pass_with_label, batched=True, batch_size=32)
valid_set = valid_set.to_pandas() df
-100] = "IGN"
index2tag[# Map IDs to tokens
"input_tokens"] = df["input_ids"].apply(lambda x: xlmr_tokenizer.convert_ids_to_tokens(x))
df[# Map predicted label index to tag
"predicted_label"] = df["predicted_label"].apply(lambda x: [index2tag[i] for i in x])
df[# Map target label index to tag
"labels"] = df["labels"].apply(lambda x: [index2tag[i] for i in x])
df[# Remove padding for the loss field
'loss'] = df.apply(lambda x: x['loss'][:len(x['input_ids'])], axis=1)
df[# Remove padding for the predicted label field
'predicted_label'] = df.apply(lambda x: x['predicted_label'][:len(x['input_ids'])], axis=1) df[
1).T df.head(
0 | |
---|---|
attention_mask | [1, 1, 1, 1, 1, 1, 1] |
input_ids | [0, 10699, 11, 15, 16104, 1388, 2] |
labels | [IGN, B-ORG, IGN, I-ORG, I-ORG, I-ORG, IGN] |
loss | [0.0, 0.03210718, 0.0, 0.05737416, 0.0494957, 0.062034503, 0.0] |
predicted_label | [I-ORG, B-ORG, I-ORG, I-ORG, I-ORG, I-ORG, I-ORG] |
input_tokens | [<s>, Ham, a, (, Unternehmen, ), </s>] |
# Transform each element of a list-like to a row
= df.apply(pd.Series.explode)
df_tokens # Remove the tokens labeled with 'IGN'
= df_tokens.query("labels != 'IGN'")
df_tokens # Round loss values to two decimal places
"loss"] = df_tokens["loss"].astype(float).round(2) df_tokens[
7).T.style.hide(axis='columns') df_tokens.head(
attention_mask | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
---|---|---|---|---|---|---|---|
input_ids | 10699 | 15 | 16104 | 1388 | 56530 | 83982 | 10 |
labels | B-ORG | I-ORG | I-ORG | I-ORG | O | B-ORG | I-ORG |
loss | 0.030000 | 0.060000 | 0.050000 | 0.060000 | 0.000000 | 0.600000 | 0.380000 |
predicted_label | B-ORG | I-ORG | I-ORG | I-ORG | O | B-ORG | I-ORG |
input_tokens | _Ham | _( | _Unternehmen | _) | _WE | _Luz | _a |
(# Group data by the input tokens
"input_tokens")[["loss"]]
df_tokens.groupby(# Aggregate the losses for each token
"count", "mean", "sum"])
.agg([# Get rid of multi-level columns
=0, axis=1)
.droplevel(level# Sort values with the highest losses first
="sum", ascending=False)
.sort_values(by
.reset_index()round(2)
.10)
.head(
.T )
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
input_tokens | _ | _in | _von | _der | _/ | _und | _( | _) | _’’ | _A |
count | 6066 | 989 | 808 | 1388 | 163 | 1171 | 246 | 246 | 2898 | 125 |
mean | 0.03 | 0.11 | 0.14 | 0.07 | 0.51 | 0.07 | 0.28 | 0.27 | 0.02 | 0.47 |
sum | 187.46 | 110.59 | 110.46 | 100.7 | 83.81 | 83.13 | 69.48 | 67.49 | 59.03 | 58.63 |
Note:
- The whitespace token has the highest total loss since it is the most common token.
- The whitespace token has a low mean loss, indicating the model does not struggle to classify it.
- Words like “in,” “von,” “der,” and “und” often appear together with named entities and are sometimes part of them.
- It is rare to have parentheses, slashes, and capital letters at the beginning of words, but those have a relatively high average loss.
(# Group data by the label IDs
"labels")[["loss"]]
df_tokens.groupby("count", "mean", "sum"])
.agg([=0, axis=1)
.droplevel(level="mean", ascending=False)
.sort_values(by
.reset_index()round(2)
.
.T )
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
labels | B-ORG | I-LOC | I-ORG | B-LOC | B-PER | I-PER | O |
count | 2683 | 1462 | 3820 | 3172 | 2893 | 4139 | 43648 |
mean | 0.59 | 0.59 | 0.42 | 0.34 | 0.3 | 0.18 | 0.03 |
sum | 1582.79 | 857.5 | 1598.29 | 1073.82 | 861.09 | 727.88 | 1419.61 |
Note: B-ORG has the highest average loss, meaning the model struggles to find the beginning of organization entities.
Plot a confusion matrix of the token classification
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
import matplotlib.pyplot as plt
def plot_confusion_matrix(y_preds, y_true, labels):
= confusion_matrix(y_true, y_preds, normalize="true")
cm = plt.subplots(figsize=(6, 6))
fig, ax = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp ="Blues", values_format=".2f", ax=ax, colorbar=False)
disp.plot(cmap"Normalized confusion matrix")
plt.title( plt.show()
"labels"], df_tokens["predicted_label"],
plot_confusion_matrix(df_tokens[ tags.names)
Note: The model often confuses the beginning subword (B-ORG
) of an organizational entity with the subsequent subwords (I-ORG
).
Examine token sequences with high losses
def get_samples(df):
# Iterate over DataFrame rows
for _, row in df.iterrows():
= [], [], [], []
labels, preds, tokens, losses for i, mask in enumerate(row["attention_mask"]):
if i not in {0, len(row["attention_mask"])}:
"labels"][i])
labels.append(row["predicted_label"][i])
preds.append(row["input_tokens"][i])
tokens.append(row[f"{row['loss'][i]:.2f}")
losses.append(= pd.DataFrame({"tokens": tokens, "labels": labels,
df_tmp "preds": preds, "losses": losses}).T
yield df_tmp
"total_loss"] = df["loss"].apply(sum)
df[= df.sort_values(by="total_loss", ascending=False).head(3)
df_tmp
for sample in get_samples(df_tmp):
display(sample)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tokens | _’ | _’’ | _Τ | Κ | _’’ | _’ | _’ | _’’ | _T | _’’ | _’ | ri | _’’ | _’ | k | _’’ | _’ | ala | </s> |
labels | O | O | O | IGN | O | O | B-LOC | I-LOC | I-LOC | I-LOC | I-LOC | IGN | I-LOC | I-LOC | IGN | I-LOC | I-LOC | IGN | IGN |
preds | O | O | B-ORG | O | O | O | O | O | O | O | O | O | O | O | O | O | O | O | O |
losses | 0.00 | 0.00 | 2.42 | 0.00 | 0.00 | 0.00 | 9.83 | 9.15 | 7.60 | 6.55 | 6.66 | 0.00 | 5.83 | 6.83 | 0.00 | 7.26 | 7.44 | 0.00 | 0.00 |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tokens | _’’ | 8 | . | _Juli | _’’ | _: | _Protest | camp | _auf | _dem | _Gelände | _der | _Republika | n | ischen | _Gar | de | </s> |
labels | B-ORG | IGN | IGN | I-ORG | I-ORG | I-ORG | I-ORG | IGN | I-ORG | I-ORG | I-ORG | I-ORG | I-ORG | IGN | IGN | I-ORG | IGN | IGN |
preds | O | O | O | O | O | O | O | O | O | O | O | O | B-ORG | I-ORG | I-ORG | I-ORG | I-ORG | O |
losses | 8.37 | 0.00 | 0.00 | 4.67 | 9.00 | 8.87 | 6.17 | 0.00 | 7.98 | 8.33 | 7.00 | 4.32 | 2.61 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tokens | _United | _Nations | _Multi | dimensional | _Integra | ted | _Stabil | ization | _Mission | _in | _the | _Central | _African | _Republic | </s> |
labels | B-PER | I-PER | I-PER | IGN | I-PER | IGN | I-PER | IGN | I-PER | I-PER | I-PER | I-PER | I-PER | I-PER | IGN |
preds | B-ORG | I-ORG | I-ORG | I-ORG | I-ORG | I-ORG | I-ORG | I-ORG | I-ORG | I-ORG | I-ORG | I-ORG | I-ORG | I-ORG | I-ORG |
losses | 5.46 | 5.36 | 5.51 | 0.00 | 5.53 | 0.00 | 5.46 | 0.00 | 5.06 | 5.22 | 5.62 | 5.71 | 5.36 | 5.09 | 0.00 |
Note:
- The PAN-X dataset used an imperfect automated process to apply annotations, resulting in some labeling issues.
- The United Nations and the Central African Republic are organizations, not people.
- The date “8. Juli” (July 8th) also has an incorrect label.
Examine sequences with an opening parenthesis
u"\u2581("
'_('
```python df_tmp = df.loc[df[“input_tokens”].apply(lambda x: u”581(” in x)].head(2) for sample in get_samples(df_tmp): display(sample) ```` |
* We generally don’t include the parentheses and their contents as part of the named entity, but the automated annotation process does. * Some parentheses contain a geographic specification. * We might want to disconnect this information from the original location in the annotations. * The dataset consists of Wikipedia articles in different languages, and the article titles often contain an explanation in parentheses. * We need to know about such characteristics in our datasets when rolling out models to production. * We can use these insights to clean up the dataset and retrain the model. |
Cross-Lingual Transfer
Create a helper function to evaluate the model on different datasets
def get_f1_score(trainer, dataset):
return trainer.predict(dataset).metrics["test_f1"]
Examine the German model’s performance on the German test set
= defaultdict(dict)
f1_scores "de"]["de"] = get_f1_score(trainer, panx_de_encoded["test"])
f1_scores[print(f"F1-score of [de] model on [de] dataset: {f1_scores['de']['de']:.3f}")
F1-score of [de] model on [de] dataset: 0.859
Test the German model’s performance on French text
= "Jeff Dean est informaticien chez Google en Californie"
text_fr tag_text(text_fr, tags, trainer.model, xlmr_tokenizer)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Tokens | <s> | _Jeff | _De | an | _est | _informatic | ien | _chez | _Google | _en | _Cali | for | nie | </s> |
Tags | O | B-PER | I-PER | I-PER | O | O | O | O | B-ORG | O | B-LOC | B-LOC | I-LOC | O |
Note: The model correctly labeled the French translation of “Kalifornien” as a location.
Define a helper function to encode a dataset and generate a classification report
def evaluate_lang_performance(lang, trainer):
= encode_panx_dataset(panx_ch[lang])
panx_ds return get_f1_score(trainer, panx_ds["test"])
Evaluate the German model’s performance on the French test set
"de"]["fr"] = evaluate_lang_performance("fr", trainer)
f1_scores[print(f"F1-score of [de] model on [fr] dataset: {f1_scores['de']['fr']:.3f}")
F1-score of [de] model on [fr] dataset: 0.708
Note: The German model still performs relatively well despite not training on a single labeled French example.
Evaluate the German model’s performance on the Italian test set
"de"]["it"] = evaluate_lang_performance("it", trainer)
f1_scores[print(f"F1-score of [de] model on [it] dataset: {f1_scores['de']['it']:.3f}")
F1-score of [de] model on [it] dataset: 0.691
Evaluate the German model’s performance on the English test set
#hide_output
"de"]["en"] = evaluate_lang_performance("en", trainer)
f1_scores[print(f"F1-score of [de] model on [en] dataset: {f1_scores['de']['en']:.3f}")
F1-score of [de] model on [en] dataset: 0.596
Note: The model performs worse on the English dataset despite being closer to German than French.
When Does Zero-Shot Transfer Make Sense?
- We can determine at which point zero-shot cross-lingual transfer is superior to fine-tuning on a monolingual corpus by fine-tuning the model on training sets of increasing size.
Define a function to train a model on a downsampled dataset
def train_on_subset(dataset, num_samples):
# Downsample the training set to the target number of samples
= dataset["train"].shuffle(seed=42).select(range(num_samples))
train_ds = dataset["validation"]
valid_ds = dataset["test"]
test_ds = len(train_ds) // batch_size
training_args.logging_steps # Traing the model on the downsampled dataset
= Trainer(model_init=model_init, args=training_args,
trainer =data_collator, compute_metrics=compute_metrics,
data_collator=train_ds, eval_dataset=valid_ds, tokenizer=xlmr_tokenizer)
train_dataset
trainer.train()if training_args.push_to_hub:
="Training completed!")
trainer.push_to_hub(commit_message# Return the performance metrics
= get_f1_score(trainer, test_ds)
f1_score return pd.DataFrame.from_dict(
"num_samples": [len(train_ds)], "f1_score": [f1_score]}) {
Encode the French Dataset
= encode_panx_dataset(panx_ch["fr"]) panx_fr_encoded
Train the model on 250 French samples
= False
training_args.push_to_hub = train_on_subset(panx_fr_encoded, 250)
metrics_df metrics_df
<table border="1" class="dataframe">
Epoch
Training Loss
Validation Loss
F1
1
2.360800
2.210924
0.109819
2
2.192800
1.484458
0.032251
3
1.480200
1.368229
0.008093
num_samples | f1_score | |
---|---|---|
0 | 250 | 0.007832 |
Note: The French model significantly underperforms the German model when using only 250 examples.
Train the model on an increasing number of French samples
for num_samples in [500, 1000, 2000, 4000]:
= metrics_df.append(
metrics_df =True) train_on_subset(panx_fr_encoded, num_samples), ignore_index
<table border="1" class="dataframe">
Epoch
Training Loss
Validation Loss
F1
1
2.204900
1.488627
0.023411
2
1.465700
1.249257
0.144914
3
1.217400
1.093112
0.161066
<table border="1" class="dataframe">
Epoch
Training Loss
Validation Loss
F1
1
1.800700
1.176858
0.175238
2
1.016500
0.698511
0.578441
3
0.669500
0.540801
0.639231
Epoch | Training Loss | Validation Loss | F1 |
---|---|---|---|
1 | 1.413700 | 0.686913 | 0.531559 |
2 | 0.526900 | 0.386696 | 0.741683 |
3 | 0.318900 | 0.352989 | 0.771843 |
Epoch | Training Loss | Validation Loss | F1 |
---|---|---|---|
1 | 0.895500 | 0.371288 | 0.757611 |
2 | 0.324200 | 0.327193 | 0.777248 |
3 | 0.243800 | 0.284226 | 0.822527 |
= plt.subplots()
fig, ax "de"]["fr"], ls="--", color="r")
ax.axhline(f1_scores["num_samples").plot(ax=ax)
metrics_df.set_index("Zero-shot from de", "Fine-tuned on fr"], loc="lower right")
plt.legend([0, 1))
plt.ylim(("Number of Training Samples")
plt.xlabel("F1 Score")
plt.ylabel( plt.show()
Note: * The zero-shot transfer model remains competitive until about 1500 training examples. * Getting domain experts to label hundreds (let alone thousands) of documents can be costly, especially for NER.
Fine-Tuning on Multiple Languages at Once
- We can mitigate the performance drop from zero-shot cross-lingual transfer by fine-tuning with multiple languages at once.
from datasets import concatenate_datasets
concatenate_datasets
- Documentation
- Convert a list of Dataset objects with the same schema into a single Dataset.
Define a function to combine a list of datasets using based on split names
def concatenate_splits(corpora):
= DatasetDict()
multi_corpus for split in corpora[0].keys():
= concatenate_datasets(
multi_corpus[split] for corpus in corpora]).shuffle(seed=42)
[corpus[split] return multi_corpus
Combine the German and French datasets
= concatenate_splits([panx_de_encoded, panx_fr_encoded]) panx_de_fr_encoded
Update training attributes
= len(panx_de_fr_encoded["train"]) // batch_size
training_args.logging_steps = True
training_args.push_to_hub = "xlm-roberta-base-finetuned-panx-de-fr" training_args.output_dir
Train the model on the combined dataset
= Trainer(model_init=model_init, args=training_args,
trainer =data_collator, compute_metrics=compute_metrics,
data_collator=xlmr_tokenizer, train_dataset=panx_de_fr_encoded["train"],
tokenizer=panx_de_fr_encoded["validation"])
eval_dataset
trainer.train()
Epoch | Training Loss | Validation Loss | F1 |
---|---|---|---|
1 | 0.371800 | 0.176133 | 0.822272 |
2 | 0.153500 | 0.160763 | 0.840360 |
3 | 0.107400 | 0.157969 | 0.854692 |
TrainOutput(global_step=807, training_loss=0.2103209033368393, metrics={'train_runtime': 129.7072, 'train_samples_per_second': 396.894, 'train_steps_per_second': 6.222, 'total_flos': 1399867154966784.0, 'train_loss': 0.2103209033368393, 'epoch': 3.0})
="Training completed!") trainer.push_to_hub(commit_message
'https://huggingface.co/cj-mills/xlm-roberta-base-finetuned-panx-de-fr/commit/e93b59a0d16dc03a657342fd9bf31413af9aebc1'
for lang in langs:
= evaluate_lang_performance(lang, trainer)
f1 print(f"F1-score of [de-fr] model on [{lang}] dataset: {f1:.3f}")
F1-score of [de-fr] model on [de] dataset: 0.862
F1-score of [de-fr] model on [fr] dataset: 0.848
F1-score of [de-fr] model on [it] dataset: 0.793
F1-score of [de-fr] model on [en] dataset: 0.688
Note: The model now performs much better on the French split, and it even improved on the Italian and English sets.
Test the performance from fine-tuning on each language separately
= [panx_de_encoded]
corpora
# Exclude German from iteration
for lang in langs[1:]:
= f"xlm-roberta-base-finetuned-panx-{lang}"
training_args.output_dir # Fine-tune on monolingual corpus
= encode_panx_dataset(panx_ch[lang])
ds_encoded = train_on_subset(ds_encoded, ds_encoded["train"].num_rows)
metrics # Collect F1-scores in common dict
= metrics["f1_score"][0]
f1_scores[lang][lang] # Add monolingual corpus to list of corpora to concatenate
corpora.append(ds_encoded)
<table border="1" class="dataframe">
Epoch
Training Loss
Validation Loss
F1
1
0.854100
0.352915
0.782609
2
0.306900
0.280733
0.815359
3
0.226200
0.271911
0.829342
<table border="1" class="dataframe">
Epoch
Training Loss
Validation Loss
F1
1
1.454800
0.652213
0.545667
2
0.521400
0.347615
0.740443
3
0.318600
0.292827
0.773021
<table border="1" class="dataframe">
Epoch
Training Loss
Validation Loss
F1
1
1.711900
1.000937
0.226577
2
0.891000
0.640531
0.528084
3
0.602300
0.508421
0.579369
Test the performance from multilingual learning on all the corpora
= concatenate_splits(corpora) corpora_encoded
= len(corpora_encoded["train"]) // batch_size
training_args.logging_steps = "xlm-roberta-base-finetuned-panx-all"
training_args.output_dir
= Trainer(model_init=model_init, args=training_args,
trainer =data_collator, compute_metrics=compute_metrics,
data_collator=xlmr_tokenizer, train_dataset=corpora_encoded["train"],
tokenizer=corpora_encoded["validation"])
eval_dataset
trainer.train()="Training completed!") trainer.push_to_hub(commit_message
Cloning https://huggingface.co/cj-mills/xlm-roberta-base-finetuned-panx-all into local empty directory.
<table border="1" class="dataframe">
Epoch
Training Loss
Validation Loss
F1
1
0.370100
0.200005
0.805385
2
0.162900
0.168012
0.837781
3
0.115600
0.167403
0.847745
'https://huggingface.co/cj-mills/xlm-roberta-base-finetuned-panx-all/commit/f01950fd63b31959f5c3d520125366485fb375b6'
Generate predictions on the test set for each language
for idx, lang in enumerate(langs):
"all"][lang] = get_f1_score(trainer, corpora[idx]["test"]) f1_scores[
= {"de": f1_scores["de"],
scores_data "each": {lang: f1_scores[lang][lang] for lang in langs},
"all": f1_scores["all"]}
= pd.DataFrame(scores_data).T.round(4)
f1_scores_df ="Fine-tune on", columns="Evaluated on",
f1_scores_df.rename_axis(index=True)
inplace f1_scores_df
Evaluated on | de | fr | it | en |
---|---|---|---|---|
Fine-tune on | ||||
de | 0.8590 | 0.7079 | 0.6910 | 0.5962 |
each | 0.8590 | 0.8321 | 0.7696 | 0.5962 |
all | 0.8592 | 0.8568 | 0.8646 | 0.7678 |
Note:
- Multilingual learning can provide significant performance gains.
- You should generally focus your attention on cross-lingual transfer within language families.
References
Previous: Notes on Transformers Book Ch. 3
Next: Notes on Transformers Book Ch. 5
I’m Christian Mills, a deep learning consultant specializing in practical AI implementations. I help clients leverage cutting-edge AI technologies to solve real-world problems.
Interested in working together? Fill out my Quick AI Project Assessment form or learn more about me.