Notes on Transformers Book Ch. 4

ai
huggingface
nlp
notes
Chapter 4 covers fine-tuning a multilingual transformer model to perform named entity recognition.
Published

April 7, 2022


import transformers
import datasets
import accelerate

# Only print error messages
transformers.logging.set_verbosity_error()
datasets.logging.set_verbosity_error()

transformers.__version__, datasets.__version__, accelerate.__version__
    ('4.11.3', '1.16.1', '0.5.1')

import ast
# https://astor.readthedocs.io/en/latest/
import astor
import inspect
import textwrap
def print_source(obj, exclude_doc=True):
    
    # Get source code
    source = inspect.getsource(obj)
    # Remove any common leading whitespace from every line
    cleaned_source = textwrap.dedent(source)
    # Parse the source into an AST node.
    parsed = ast.parse(cleaned_source)

    for node in ast.walk(parsed):
        # Skip any nodes that are not class or function definitions
        if not isinstance(node, (ast.FunctionDef, ast.ClassDef, ast.AsyncFunctionDef)):
            continue
        
        if exclude_doc and len(node.body) > 1: node.body = node.body[1:]
        
    print(astor.to_source(parsed))

Introduction

  • Non-English pretrained models typically exist only for languages like German, Russian, or Mandarin, where plenty of web text is available for pretraining.
  • Avoid maintaining multiple monolingual models in production when possible.
  • Transformer models pretrained on large corpora across many languages can perform zero-shot cross-lingual transfer.
    • We can fine-tune a model using one language and apply it to others without further training.
  • Multilingual transformers are well-suited for situations where a speaker alternates between two or more languages in the context of a single conversation.

Project: Multilingual Named Entity Recognition

  • The goal is to fine-tine the transformer model XLM-RoBERTa to perform named entity recognition for a customer in Switzerland, where there are four national languages.
    • We will use German, French, Italian, and English as the four languages.
  • Named entity recognition involves extracting real-world objects like products, places, and people from a piece of text.
    • Some potential NER applications include gaining insights from company documents, augmenting the quality of search engines, and building a structured database from a corpus.

The Dataset

WikiAnn (a.k.a PAN-X)


import pandas as pd
pd.set_option('max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# An example of a sequence annotated with named entities in IOB2 format
toks = "Jeff Dean is a computer scientist at Google in California".split()
lbls = ["B-PER", "I-PER", "O", "O", "O", "O", "O", "B-ORG", "O", "B-LOC"]
pd.DataFrame(data=[toks, lbls], index=['Tokens', 'Tags'])
0 1 2 3 4 5 6 7 8 9
Tokens Jeff Dean is a computer scientist at Google in California
Tags B-PER I-PER O O O O O B-ORG O B-LOC

from datasets import get_dataset_config_names

get_dataset_config_names

  • Documentation
  • Get the list of available configuration names for a particular dataset.

print_source(get_dataset_config_names)
    def get_dataset_config_names(path: str, revision: Optional[Union[str,
        Version]]=None, download_config: Optional[DownloadConfig]=None,
        download_mode: Optional[GenerateMode]=None, force_local_path: Optional[
        str]=None, dynamic_modules_path: Optional[str]=None, data_files:
        Optional[Union[Dict, List, str]]=None, **download_kwargs):
        dataset_module = dataset_module_factory(path, revision=revision,
            download_config=download_config, download_mode=download_mode,
            force_local_path=force_local_path, dynamic_modules_path=
            dynamic_modules_path, data_files=data_files, **download_kwargs)
        builder_cls = import_main_class(dataset_module.module_path)
        return list(builder_cls.builder_configs.keys()) or [dataset_module.
            builder_kwargs.get('name', 'default')]

xtreme Hugging Face Dataset Card

# Get the names of the subsets for the XTREME dataset
xtreme_subsets = get_dataset_config_names("xtreme")
print(f"XTREME has {len(xtreme_subsets)} configurations")
    XTREME has 183 configurations
pd.DataFrame(xtreme_subsets).T.style.hide(axis='columns').hide(axis='index')
XNLI tydiqa SQuAD PAN-X.af PAN-X.ar PAN-X.bg PAN-X.bn PAN-X.de PAN-X.el PAN-X.en PAN-X.es PAN-X.et PAN-X.eu PAN-X.fa PAN-X.fi PAN-X.fr PAN-X.he PAN-X.hi PAN-X.hu PAN-X.id PAN-X.it PAN-X.ja PAN-X.jv PAN-X.ka PAN-X.kk PAN-X.ko PAN-X.ml PAN-X.mr PAN-X.ms PAN-X.my PAN-X.nl PAN-X.pt PAN-X.ru PAN-X.sw PAN-X.ta PAN-X.te PAN-X.th PAN-X.tl PAN-X.tr PAN-X.ur PAN-X.vi PAN-X.yo PAN-X.zh MLQA.ar.ar MLQA.ar.de MLQA.ar.vi MLQA.ar.zh MLQA.ar.en MLQA.ar.es MLQA.ar.hi MLQA.de.ar MLQA.de.de MLQA.de.vi MLQA.de.zh MLQA.de.en MLQA.de.es MLQA.de.hi MLQA.vi.ar MLQA.vi.de MLQA.vi.vi MLQA.vi.zh MLQA.vi.en MLQA.vi.es MLQA.vi.hi MLQA.zh.ar MLQA.zh.de MLQA.zh.vi MLQA.zh.zh MLQA.zh.en MLQA.zh.es MLQA.zh.hi MLQA.en.ar MLQA.en.de MLQA.en.vi MLQA.en.zh MLQA.en.en MLQA.en.es MLQA.en.hi MLQA.es.ar MLQA.es.de MLQA.es.vi MLQA.es.zh MLQA.es.en MLQA.es.es MLQA.es.hi MLQA.hi.ar MLQA.hi.de MLQA.hi.vi MLQA.hi.zh MLQA.hi.en MLQA.hi.es MLQA.hi.hi XQuAD.ar XQuAD.de XQuAD.vi XQuAD.zh XQuAD.en XQuAD.es XQuAD.hi XQuAD.el XQuAD.ru XQuAD.th XQuAD.tr bucc18.de bucc18.fr bucc18.zh bucc18.ru PAWS-X.de PAWS-X.en PAWS-X.es PAWS-X.fr PAWS-X.ja PAWS-X.ko PAWS-X.zh tatoeba.afr tatoeba.ara tatoeba.ben tatoeba.bul tatoeba.deu tatoeba.cmn tatoeba.ell tatoeba.est tatoeba.eus tatoeba.fin tatoeba.fra tatoeba.heb tatoeba.hin tatoeba.hun tatoeba.ind tatoeba.ita tatoeba.jav tatoeba.jpn tatoeba.kat tatoeba.kaz tatoeba.kor tatoeba.mal tatoeba.mar tatoeba.nld tatoeba.pes tatoeba.por tatoeba.rus tatoeba.spa tatoeba.swh tatoeba.tam tatoeba.tel tatoeba.tgl tatoeba.tha tatoeba.tur tatoeba.urd tatoeba.vie udpos.Afrikaans udpos.Arabic udpos.Basque udpos.Bulgarian udpos.Dutch udpos.English udpos.Estonian udpos.Finnish udpos.French udpos.German udpos.Greek udpos.Hebrew udpos.Hindi udpos.Hungarian udpos.Indonesian udpos.Italian udpos.Japanese udpos.Kazakh udpos.Korean udpos.Chinese udpos.Marathi udpos.Persian udpos.Portuguese udpos.Russian udpos.Spanish udpos.Tagalog udpos.Tamil udpos.Telugu udpos.Thai udpos.Turkish udpos.Urdu udpos.Vietnamese udpos.Yoruba

Note: We are only interested in the PAN-X subsets for this project.


# Look for configuration names containing 'PAN'
panx_subsets = [s for s in xtreme_subsets if s.startswith("PAN")]
len(panx_subsets), panx_subsets[:3]
    (40, ['PAN-X.af', 'PAN-X.ar', 'PAN-X.bg'])

Note: * There are 40 PAN-X subsets. * Each subset has a two-letter suffix indicating the ISO 639-1 language code. * German (de) * French (fr) * Italian (it) * English (en)


from datasets import load_dataset
from collections import defaultdict
from datasets import DatasetDict
# Specify the desired language codes
langs = ["de", "fr", "it", "en"]
# Specify the percentage each language should contribute to the total dataset
fracs = [0.629, 0.229, 0.084, 0.059]

Note: * These percentages represent the spoken proportions for each language in Switzerland. * This language imbalance simulates the common situation where acquiring labeled examples in a minority language is cost-prohibitive.


Dataset.shuffle

Dataset.select

  • Documentation
  • Create a new dataset with rows selected following the list/array of indices.

# Return a DatasetDict if a key doesn't exist
panx_ch = defaultdict(DatasetDict)

for lang, frac in zip(langs, fracs):
    # Load monolingual corpus
    ds = load_dataset("xtreme", name=f"PAN-X.{lang}")
    # Shuffle and downsample each split according to spoken proportion
    for split in ds:
        panx_ch[lang][split] = (
            ds[split]
            # Shuffle the dataset split rows
            .shuffle(seed=0)
            # Select subset of dataset split
            .select(range(int(frac * ds[split].num_rows))))

pd.DataFrame({lang: [panx_ch[lang]["train"].num_rows] for lang in langs}, index=["Number of training examples"])
de fr it en
Number of training examples 12580 4580 1680 1180

train_size = sum([panx_ch[lang]['train'].num_rows for lang in langs])
train_size
    20020

pd.DataFrame(
    {lang: [panx_ch[lang]["train"].num_rows, 
            f'{panx_ch[lang]["train"].num_rows/train_size*100:.2f}%'] for lang in langs
    }, index=["Number of training examples", "Proportion of Dataset"])
de fr it en
Number of training examples 12580 4580 1680 1180
Proportion of Dataset 62.84% 22.88% 8.39% 5.89%

for lang in langs: print(panx_ch[lang]["train"])
    Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 12580
    })
    Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 4580
    })
    Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 1680
    })
    Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 1180
    })

element = panx_ch["de"]["train"][0]
pd.DataFrame(element).T
0 1 2 3 4 5 6 7 8 9 10 11
tokens 2.000 Einwohnern an der Danziger Bucht in der polnischen Woiwodschaft Pommern .
ner_tags 0 0 0 0 5 6 0 0 5 5 6 0
langs de de de de de de de de de de de de

Note:

  • The German text translates to “2,000 inhabitants at the Gdansk Bay in the Polish voivodeship of Pomerania.”
    • Gdansk Bay is a bay in the Baltic Sea.
    • The word “voivodeship” corresponds to a state in Poland.
  • The ner_tags column corresponds to the mapping of each entity to a class ID.
  • The Dataset object has a “features” attribute that specifies the underlying data types associated with each column.

tags = panx_ch["de"]["train"].features["ner_tags"].feature
tags
    ClassLabel(num_classes=7, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], names_file=None, id=None)

tags.names
    ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']

Map the class IDs to the corresponding tag names

def create_tag_names(batch):
    return {"ner_tags_str": [tags.int2str(idx) for idx in batch["ner_tags"]]}

panx_de = panx_ch["de"].map(create_tag_names)
pd.DataFrame(panx_de["train"][0]).reindex(columns=["tokens", "ner_tags_str","ner_tags","langs"]).T
0 1 2 3 4 5 6 7 8 9 10 11
tokens 2.000 Einwohnern an der Danziger Bucht in der polnischen Woiwodschaft Pommern .
ner_tags_str O O O O B-LOC I-LOC O O B-LOC B-LOC I-LOC O
ner_tags 0 0 0 0 5 6 0 0 5 5 6 0
langs de de de de de de de de de de de de

from collections import Counter

Calculate the frequencies of each entity across each split

split2freqs = defaultdict(Counter)
for split, dataset in panx_de.items():
    for row in dataset["ner_tags_str"]:
        for tag in row:
            if tag.startswith("B"):
                tag_type = tag.split("-")[1]
                split2freqs[split][tag_type] += 1
pd.DataFrame.from_dict(split2freqs, orient="index")
ORG LOC PER
validation 2683 3172 2893
test 2573 3180 3071
train 5366 6186 5810

Note: The distributions of the entity frequencies are roughly the same for each split.


Multilingual Transformers

  • Multilingual transformers use a corpus consisting of documents in many languages for pretraining.
    • The models do not receive any explicit information to differentiate among languages.
  • The resulting linguistic representations generalize well across languages for many downstream tasks.
  • Many use the CoNLL-2002 and CoNLL-2003 datasets as benchmarks to measure the progress of cross-lingual transfer for named entity recognition for English, Dutch, Spanish, and German.

Evaluation Methods

  1. en: Fine-tune using the English training data and then evaluate the model on each language’s test set.
  2. each: Fine-tune and evaluate using monolingual test data to measure per-language performance.
  3. all: Fine-tune using all the training data to evaluate the model on each language’s test set.

A Closer Look at Tokenization

  • XLM-RoBERTa uses the SentencePiece subword tokenizer instead of the WordPiece tokenizer.

from transformers import AutoTokenizer
bert_model_name = "bert-base-cased"
xlmr_model_name = "xlm-roberta-base"
bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)
xlmr_tokenizer = AutoTokenizer.from_pretrained(xlmr_model_name)

Compare the WordPiece and SentencePiece tokenizers

text = "Jack Sparrow loves New York!"
bert_tokens = bert_tokenizer(text).tokens()
xlmr_tokens = xlmr_tokenizer(text).tokens()
pd.DataFrame([bert_tokens, xlmr_tokens], index=["BERT", "XLM-R"])
0 1 2 3 4 5 6 7 8 9
BERT [CLS] Jack Spa ##rrow loves New York ! [SEP] None
XLM-R <s> _Jack _Spar row _love s _New _York ! </s>

Note: SentencePiece uses <s> and </s> to indicate the start and end sequences.


The Tokenizer Pipeline

1. Normalization

  • The normalization step includes the operations to clean up raw text, such as stripping whitespace and removing accented characters.
  • Unicode normalization schemes replace the various ways to write the same character with standard forms.
    • Unicode normalization is particularly effective when working with multilingual corpora.
  • Lowercasing can help reduce the vocabulary size when the model only accepts and uses lowercase characters.

2. Pretokenization

  • The pre-tokenization step splits a text into smaller objects, and the final tokens will be subunits of these objects.
  • Some languages might require language-specific pre-tokenization methods.

3. Tokenizer Model

  • The tokenizer model analyzes the training corpus to find the most commonly occurring groups of characters, which become the vocab.

4. Postprocessing

  • The postprocessing step applies some additional transformations, such as adding special characters to the start or end of an input sequence.

The SentencePiece Tokenizer

  • The SentencePiece tokenizer builds on the Unigram subword segmentation algorithm and encodes each input text as a sequence of Unicode characters.
  • SentencePiece supports the byte-pair-encoding (BPE) algorithm and the unigram language model.
  • SentencePiece replaces whitespace with the Unicode symbol U+2581 for _.

"".join(xlmr_tokens).replace(u"\u2581", " ")
    '<s> Jack Sparrow loves New York!</s>'

Transformers for Named Entity Recognition

  • For text classification, BERT uses the [CLS] token to represent an entire sequence of text.
  • For named entity recognition, BERT feeds the representation of each input token through the same fully connected layer to output the entity of each one.
    • We can assign the entity label to the first subword of a word and ignore the rest.

The Anatomy of the Transformers Model Class

  • The Hugging Face Transformers library has dedicated classes for each architecture and task.
  • We can extend existing models for specific use cases with little overhead.

Bodies and Heads

  • Hugging Face Transformers splits model architectures into a body and head
  • The body is task-agnostic, and the model head is unique to a specific downstream task.

Creating a Custom Model for Token Classification


import torch.nn as nn
from transformers import XLMRobertaConfig
from transformers.modeling_outputs import TokenClassifierOutput
from transformers.models.roberta.modeling_roberta import RobertaModel
from transformers.models.roberta.modeling_roberta import RobertaPreTrainedModel

TokenClassifierOutput

  • Documentation
  • A base class for outputs of token classification models.
print_source(TokenClassifierOutput)
    @dataclass
    class TokenClassifierOutput(ModelOutput):
        loss: Optional[torch.FloatTensor] = None
        logits: torch.FloatTensor = None
        hidden_states: Optional[Tuple[torch.FloatTensor]] = None
        attentions: Optional[Tuple[torch.FloatTensor]] = None

ModelOutput

RobertaModel

  • Documentation
  • A bare RoBERTa Model transformer outputting raw hidden-states without any specific head on top.

RobertaPreTrainedModel

  • Source Code
  • An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

XLMRobertaConfig()
    XLMRobertaConfig {
      "attention_probs_dropout_prob": 0.1,
      "bos_token_id": 0,
      "classifier_dropout": null,
      "eos_token_id": 2,
      "hidden_act": "gelu",
      "hidden_dropout_prob": 0.1,
      "hidden_size": 768,
      "initializer_range": 0.02,
      "intermediate_size": 3072,
      "layer_norm_eps": 1e-12,
      "max_position_embeddings": 512,
      "model_type": "xlm-roberta",
      "num_attention_heads": 12,
      "num_hidden_layers": 12,
      "pad_token_id": 1,
      "position_embedding_type": "absolute",
      "transformers_version": "4.11.3",
      "type_vocab_size": 2,
      "use_cache": true,
      "vocab_size": 30522
    }

class XLMRobertaForTokenClassification(RobertaPreTrainedModel):
    # Use the standard XLM-RoBERTa settings.
    config_class = XLMRobertaConfig

    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        # Load model body
        # Set add_pooling_layer to False to get all hidden states in the output
        self.roberta = RobertaModel(config, add_pooling_layer=False)
        # Set up token classification head
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        # Load and initialize weights
        self.init_weights()

    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, 
                labels=None, **kwargs):
        # Use model body to get encoder representations
        outputs = self.roberta(input_ids, attention_mask=attention_mask,
                               token_type_ids=token_type_ids, **kwargs)
        # Apply classifier to encoder representation
        sequence_output = self.dropout(outputs[0])
        logits = self.classifier(sequence_output)
        # Calculate losses
        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
        # Return model output object
        return TokenClassifierOutput(loss=loss, logits=logits, 
                                     hidden_states=outputs.hidden_states, 
                                     attentions=outputs.attentions)

Loading a Custom Model

  • We need to provide the tags for labeling each entity and mappings to convert between tags and IDs

Define the mappings to convert between tags and index IDs

index2tag = {idx: tag for idx, tag in enumerate(tags.names)}
tag2index = {tag: idx for idx, tag in enumerate(tags.names)}
index2tag, tag2index
    ({0: 'O',
      1: 'B-PER',
      2: 'I-PER',
      3: 'B-ORG',
      4: 'I-ORG',
      5: 'B-LOC',
      6: 'I-LOC'},
     {'O': 0,
      'B-PER': 1,
      'I-PER': 2,
      'B-ORG': 3,
      'I-ORG': 4,
      'B-LOC': 5,
      'I-LOC': 6})

from transformers import AutoConfig

Override the default parameters stored in XLMRobertaConfig

xlmr_config = AutoConfig.from_pretrained(xlmr_model_name, 
                                         num_labels=tags.num_classes,
                                         id2label=index2tag, label2id=tag2index)
xlmr_config
    XLMRobertaConfig {
      "architectures": [
        "XLMRobertaForMaskedLM"
      ],
      "attention_probs_dropout_prob": 0.1,
      "bos_token_id": 0,
      "classifier_dropout": null,
      "eos_token_id": 2,
      "hidden_act": "gelu",
      "hidden_dropout_prob": 0.1,
      "hidden_size": 768,
      "id2label": {
        "0": "O",
        "1": "B-PER",
        "2": "I-PER",
        "3": "B-ORG",
        "4": "I-ORG",
        "5": "B-LOC",
        "6": "I-LOC"
      },
      "initializer_range": 0.02,
      "intermediate_size": 3072,
      "label2id": {
        "B-LOC": 5,
        "B-ORG": 3,
        "B-PER": 1,
        "I-LOC": 6,
        "I-ORG": 4,
        "I-PER": 2,
        "O": 0
      },
      "layer_norm_eps": 1e-05,
      "max_position_embeddings": 514,
      "model_type": "xlm-roberta",
      "num_attention_heads": 12,
      "num_hidden_layers": 12,
      "output_past": true,
      "pad_token_id": 1,
      "position_embedding_type": "absolute",
      "transformers_version": "4.11.3",
      "type_vocab_size": 1,
      "use_cache": true,
      "vocab_size": 250002
    }

import torch

Load a pretrained XLM-RoBERTa model with the custom classification head and configuration parameters

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
xlmr_model = (XLMRobertaForTokenClassification
              .from_pretrained(xlmr_model_name, config=xlmr_config)
              .to(device))

Encode some sample text

text
    'Jack Sparrow loves New York!'

input_ids = xlmr_tokenizer.encode(text, return_tensors="pt")
pd.DataFrame([xlmr_tokens, input_ids[0].numpy()], index=["Tokens", "Input IDs"])
0 1 2 3 4 5 6 7 8 9
Tokens <s> _Jack _Spar row _love s _New _York ! </s>
Input IDs 0 21763 37456 15555 5161 7 2356 5753 38 2

Test model predictions with the untrained classifier

outputs = xlmr_model(input_ids.to(device)).logits
predictions = torch.argmax(outputs, dim=-1)
print(f"Number of tokens in sequence: {len(xlmr_tokens)}")
print(f"Shape of outputs: {outputs.shape}")
    Number of tokens in sequence: 10
    Shape of outputs: torch.Size([1, 10, 7])

Note: The logits have the shape [batch_size, num_tokens, num_tags].


preds = [tags.names[p] for p in predictions[0].cpu().numpy()]
pd.DataFrame([xlmr_tokens, preds], index=["Tokens", "Tags"])
0 1 2 3 4 5 6 7 8 9
Tokens <s> _Jack _Spar row _love s _New _York ! </s>
Tags I-ORG I-ORG I-ORG I-ORG I-ORG I-ORG I-ORG I-ORG I-ORG I-ORG

Note: The output is useless as the weights are still randomly initialized.


Wrap the prediction steps in a helper function

def tag_text(text, tags, model, tokenizer):
    # Get tokens with special characters
    tokens = tokenizer(text).tokens()
    # Encode the sequence into IDs
    input_ids = xlmr_tokenizer(text, return_tensors="pt").input_ids.to(device)
    # Get predictions as distribution over 7 possible classes
    outputs = model(input_ids)[0]
    # Take argmax to get most likely class per token
    predictions = torch.argmax(outputs, dim=2)
    # Convert to DataFrame
    preds = [tags.names[p] for p in predictions[0].cpu().numpy()]
    return pd.DataFrame([tokens, preds], index=["Tokens", "Tags"])

Tokenizing Texts for NER

Collect the words and tags as ordinary lists

de_example = panx_de['train'][0]
words, labels = de_example["tokens"], de_example["ner_tags"]
pd.DataFrame([words,labels], index=["words", "labels"])
0 1 2 3 4 5 6 7 8 9 10 11
words 2.000 Einwohnern an der Danziger Bucht in der polnischen Woiwodschaft Pommern .
labels 0 0 0 0 5 6 0 0 5 5 6 0

Tokenize each word

tokenized_input = xlmr_tokenizer(de_example["tokens"], is_split_into_words=True)
pd.DataFrame(tokenized_input.values(), index=tokenized_input.keys())
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
input_ids 0 70101 176581 19 142 122 2290 708 1505 18363 18 23 122 127474 15439 13787 14 15263 18917 663 6947 19 6 5 2
attention_mask 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Note: The is_split_into_words argument tells the tokenizer the input sequence is a list of separated words.


tokens = xlmr_tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
pd.DataFrame(tokens, columns=["tokens"]).T
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
tokens <s> _2.000 _Einwohner n _an _der _Dan zi ger _Buch t _in _der _polni schen _Wo i wod schaft _Po mmer n _ . </s>

Note: We can use the word_ids() function to mask the subword representations after the first subword.


BatchEncoding.word_ids.word_ids

  • Documentation
  • Get a list indicating the word corresponding to each token.
word_ids = tokenized_input.word_ids()
pd.DataFrame([tokens, word_ids], index=["Tokens", "Word IDs"])
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Tokens <s> _2.000 _Einwohner n _an _der _Dan zi ger _Buch t _in _der _polni schen _Wo i wod schaft _Po mmer n _ . </s>
Word IDs None 0 1 1 2 3 4 4 4 5 5 6 7 8 8 9 9 9 9 10 10 10 11 11 None

Note: The <s> and </s> tokens map to None as they are not words from the original text.


Set -100 as the label for the start and end tokens and masked subwords

  • The PyTorch cross-entropy loss class has an attribute called ignore_index whose value is -100.
previous_word_idx = None
label_ids = []

for word_idx in word_ids:
    if word_idx is None or word_idx == previous_word_idx:
        label_ids.append(-100)
    elif word_idx != previous_word_idx:
        label_ids.append(labels[word_idx])
    previous_word_idx = word_idx
    
labels = [index2tag[l] if l != -100 else "IGN" for l in label_ids]
index = ["Tokens", "Word IDs", "Label IDs", "Labels"]

pd.DataFrame([tokens, word_ids, label_ids, labels], index=index)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Tokens <s> _2.000 _Einwohner n _an _der _Dan zi ger _Buch t _in _der _polni schen _Wo i wod schaft _Po mmer n _ . </s>
Word IDs None 0 1 1 2 3 4 4 4 5 5 6 7 8 8 9 9 9 9 10 10 10 11 11 None
Label IDs -100 0 0 -100 0 0 5 -100 -100 6 -100 0 0 5 -100 5 -100 -100 -100 6 -100 -100 0 -100 -100
Labels IGN O O IGN O O B-LOC IGN IGN I-LOC IGN O O B-LOC IGN B-LOC IGN IGN IGN I-LOC IGN IGN O IGN IGN

Wrap the tokenization and label alignment steps into a single function

def tokenize_and_align_labels(examples):
    tokenized_inputs = xlmr_tokenizer(examples["tokens"], truncation=True, 
                                      is_split_into_words=True)
    labels = []
    for idx, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=idx)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None or word_idx == previous_word_idx:
                label_ids.append(-100)
            else:
                label_ids.append(label[word_idx])
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

Define a mapping function to encode the dataset

def encode_panx_dataset(corpus):
    return corpus.map(tokenize_and_align_labels, batched=True, 
                      remove_columns=['langs', 'ner_tags', 'tokens'])

panx_ch["de"]
    DatasetDict({
        validation: Dataset({
            features: ['tokens', 'ner_tags', 'langs'],
            num_rows: 6290
        })
        test: Dataset({
            features: ['tokens', 'ner_tags', 'langs'],
            num_rows: 6290
        })
        train: Dataset({
            features: ['tokens', 'ner_tags', 'langs'],
            num_rows: 12580
        })
    })

panx_de_encoded = encode_panx_dataset(panx_ch["de"])
panx_de_encoded
    DatasetDict({
        validation: Dataset({
            features: ['attention_mask', 'input_ids', 'labels'],
            num_rows: 6290
        })
        test: Dataset({
            features: ['attention_mask', 'input_ids', 'labels'],
            num_rows: 6290
        })
        train: Dataset({
            features: ['attention_mask', 'input_ids', 'labels'],
            num_rows: 12580
        })
    })

Performance Measures

  • Standard performance measures for NER tasks include precision, recall, and F1-score.
  • The model needs to correctly predict all words of an entity for a prediction to count as correct.

seqval


from seqeval.metrics import classification_report

classification_report

  • Source Code
  • Build a text report showing the main classification metrics for a sequence of targets and predictions.
  • The function expects targets and predictions as lists of lists.

y_true = [["O", "O", "O", "B-MISC", "I-MISC", "I-MISC", "O"],["B-PER", "I-PER", "O"]]
y_pred = [["O", "O", "B-MISC", "I-MISC", "I-MISC", "I-MISC", "O"],["B-PER", "I-PER", "O"]]
print(classification_report(y_true, y_pred))
                  precision    recall  f1-score   support
    
            MISC       0.00      0.00      0.00         1
             PER       1.00      1.00      1.00         1
    
       micro avg       0.50      0.50      0.50         2
       macro avg       0.50      0.50      0.50         2
    weighted avg       0.50      0.50      0.50         2

import numpy as np

Format predictions and target labels for seqval

def align_predictions(predictions, label_ids):
    preds = np.argmax(predictions, axis=2)
    batch_size, seq_len = preds.shape
    labels_list, preds_list = [], []

    for batch_idx in range(batch_size):
        example_labels, example_preds = [], []
        for seq_idx in range(seq_len):
            # Ignore label IDs = -100
            if label_ids[batch_idx, seq_idx] != -100:
                example_labels.append(index2tag[label_ids[batch_idx][seq_idx]])
                example_preds.append(index2tag[preds[batch_idx][seq_idx]])

        labels_list.append(example_labels)
        preds_list.append(example_preds)

    return preds_list, labels_list

Fine-Tuning XLM-RoBERTa

Define training attributes

from transformers import TrainingArguments
num_epochs = 3
batch_size = 64
logging_steps = len(panx_de_encoded["train"]) // batch_size
model_name = f"{xlmr_model_name}-finetuned-panx-de"
training_args = TrainingArguments(
    output_dir=model_name, log_level="error", num_train_epochs=num_epochs, 
    per_device_train_batch_size=batch_size, 
    per_device_eval_batch_size=batch_size, evaluation_strategy="epoch", 
    save_steps=1e6, weight_decay=0.01, disable_tqdm=False, 
    logging_steps=logging_steps, push_to_hub=True, fp16=True)

Note: Set save_steps to a large number to disable checkpointing.


Log into Hugging Face account

from huggingface_hub import notebook_login
notebook_login()
    huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
    To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
    Login successful
    Your token has been saved to /home/innom-dt/.huggingface/token
    huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
    To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Compute the \(f_{1}\)-score on the validation set

from seqeval.metrics import f1_score
def compute_metrics(eval_pred):
    y_pred, y_true = align_predictions(eval_pred.predictions, 
                                       eval_pred.label_ids)
    return {"f1": f1_score(y_true, y_pred)}

Define a collator to pad each input sequence to the highest sequence length in a batch

from transformers import DataCollatorForTokenClassification

DataCollatorForTokenClassification

  • Documentation
  • Create a data collator that will dynamically pad inputs and labels.

print_source(DataCollatorForTokenClassification.torch_call)
    def torch_call(self, features):
        label_name = 'label' if 'label' in features[0].keys() else 'labels'
        labels = [feature[label_name] for feature in features
            ] if label_name in features[0].keys() else None
        batch = self.tokenizer.pad(features, padding=self.padding, max_length=
            self.max_length, pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt' if labels is None else None)
        if labels is None:
            return batch
        sequence_length = torch.tensor(batch['input_ids']).shape[1]
        padding_side = self.tokenizer.padding_side
        if padding_side == 'right':
            batch[label_name] = [(list(label) + [self.label_pad_token_id] * (
                sequence_length - len(label))) for label in labels]
        else:
            batch[label_name] = [([self.label_pad_token_id] * (sequence_length -
                len(label)) + list(label)) for label in labels]
        batch = {k: torch.tensor(v, dtype=torch.int64) for k, v in batch.items()}
        return batch

DataCollatorForTokenClassification.label_pad_token_id
    -100

Note: * We need to pad the labels as they are also sequences. * The collator pads label sequences with the value -100, so the PyTorch loss function ignores them.


data_collator = DataCollatorForTokenClassification(xlmr_tokenizer)

Create a helper function to initialize a new model for a training session

def model_init():
    return (XLMRobertaForTokenClassification
            .from_pretrained(xlmr_model_name, config=xlmr_config)
            .to(device))

Disable Tokenizers Parallelism

%env TOKENIZERS_PARALLELISM=false
    env: TOKENIZERS_PARALLELISM=false

Initialize the Trainer object

from transformers import Trainer
trainer = Trainer(model_init=model_init, args=training_args, 
                  data_collator=data_collator, compute_metrics=compute_metrics,
                  train_dataset=panx_de_encoded["train"],
                  eval_dataset=panx_de_encoded["validation"], 
                  tokenizer=xlmr_tokenizer)

Run the training loop and push the final model to the Hugging Face Hub

trainer.train()
trainer.push_to_hub(commit_message="Training completed!")
<table border="1" class="dataframe">
Epoch Training Loss Validation Loss F1 1 0.326400 0.162317 0.813909 2 0.136000 0.133068 0.845137 3 0.096000 0.131872 0.857581
    'https://huggingface.co/cj-mills/xlm-roberta-base-finetuned-panx-de/commit/1ebdc3c9051a980588be5a495ad96896f330932c'

How to manually display the training log

trainer.state.log_history
    [{'loss': 0.3264,
      'learning_rate': 3.3671742808798645e-05,
      'epoch': 0.99,
      'step': 196},
     {'eval_loss': 0.1623172014951706,
      'eval_f1': 0.8139089269612262,
      'eval_runtime': 7.0145,
      'eval_samples_per_second': 896.714,
      'eval_steps_per_second': 14.114,
      'epoch': 1.0,
      'step': 197},
     {'loss': 0.136,
      'learning_rate': 1.7174280879864637e-05,
      'epoch': 1.99,
      'step': 392},
     {'eval_loss': 0.1330675333738327,
      'eval_f1': 0.8451372416130125,
      'eval_runtime': 6.8702,
      'eval_samples_per_second': 915.543,
      'eval_steps_per_second': 14.41,
      'epoch': 2.0,
      'step': 394},
     {'loss': 0.096,
      'learning_rate': 6.76818950930626e-07,
      'epoch': 2.98,
      'step': 588},
     {'eval_loss': 0.13187244534492493,
      'eval_f1': 0.8575809199318569,
      'eval_runtime': 6.8965,
      'eval_samples_per_second': 912.061,
      'eval_steps_per_second': 14.355,
      'epoch': 3.0,
      'step': 591},
     {'train_runtime': 95.0424,
      'train_samples_per_second': 397.086,
      'train_steps_per_second': 6.218,
      'total_flos': 1039360955930616.0,
      'train_loss': 0.18559023183211054,
      'epoch': 3.0,
      'step': 591}]

df = pd.DataFrame(trainer.state.log_history)[['epoch','loss' ,'eval_loss', 'eval_f1']]
df = df.rename(columns={"epoch":"Epoch","loss": "Training Loss", "eval_loss": "Validation Loss", "eval_f1":"F1"})
df['Epoch'] = df["Epoch"].apply(lambda x: round(x))
df['Training Loss'] = df["Training Loss"].ffill()
df[['Validation Loss', 'F1']] = df[['Validation Loss', 'F1']].bfill().ffill()
df.drop_duplicates()
Epoch Training Loss Validation Loss F1
0 1 0.3264 0.162317 0.813909
2 2 0.1360 0.133068 0.845137
4 3 0.0960 0.131872 0.857581

Test the model on some sample text

text_de = "Jeff Dean ist ein Informatiker bei Google in Kalifornien"
tag_text(text_de, tags, trainer.model, xlmr_tokenizer)
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Tokens <s> _Jeff _De an _ist _ein _Informati ker _bei _Google _in _Kaliforni en </s>
Tags O B-PER I-PER I-PER O O O O O B-ORG O B-LOC I-LOC O

Note: The fine-tuned model correctly identifies the entities in the sample text.


Error Analysis

  • Error analysis is an effective tool to understand a model’s strengths and weaknesses.
  • Looking at the errors can yield helpful insights and reveal bugs that would be hard to spot by only looking at the code.
  • There are several failure modes where a model might appear to perform well but have serious flaws.

Failure Modes

  • We might accidentally mask too many tokens and some labels, resulting in a promising loss drop.
  • The metrics function might have a bug.
  • We might include the zero class, skewing the accuracy and \(F_{1}\)-score.

Define a function that returns the loss and predicted labels for a single batch

from torch.nn.functional import cross_entropy
def forward_pass_with_label(batch):
    # Convert dict of lists to list of dicts suitable for data collator
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    # Pad inputs and labels and put all tensors on device
    batch = data_collator(features)
    input_ids = batch["input_ids"].to(device)
    attention_mask = batch["attention_mask"].to(device)
    labels = batch["labels"].to(device)
    with torch.no_grad():
        # Pass data through model  
        output = trainer.model(input_ids, attention_mask)
        # Logit.size: [batch_size, sequence_length, classes]
        # Predict class with largest logit value on classes axis
        predicted_label = torch.argmax(output.logits, axis=-1).cpu().numpy()
    # Calculate loss per token after flattening batch dimension with view
    loss = cross_entropy(output.logits.view(-1, 7), 
                         labels.view(-1), reduction="none")
    # Unflatten batch dimension and convert to numpy array
    loss = loss.view(len(input_ids), -1).cpu().numpy()

    return {"loss":loss, "predicted_label": predicted_label}

Get the loss and predictions for the validation set

valid_set = panx_de_encoded["validation"]
valid_set = valid_set.map(forward_pass_with_label, batched=True, batch_size=32)
df = valid_set.to_pandas()

index2tag[-100] = "IGN"
# Map IDs to tokens
df["input_tokens"] = df["input_ids"].apply(lambda x: xlmr_tokenizer.convert_ids_to_tokens(x))
# Map predicted label index to tag
df["predicted_label"] = df["predicted_label"].apply(lambda x: [index2tag[i] for i in x])
# Map target label index to tag
df["labels"] = df["labels"].apply(lambda x: [index2tag[i] for i in x])
# Remove padding for the loss field
df['loss'] = df.apply(lambda x: x['loss'][:len(x['input_ids'])], axis=1)
# Remove padding for the predicted label field
df['predicted_label'] = df.apply(lambda x: x['predicted_label'][:len(x['input_ids'])], axis=1)
df.head(1).T
0
attention_mask [1, 1, 1, 1, 1, 1, 1]
input_ids [0, 10699, 11, 15, 16104, 1388, 2]
labels [IGN, B-ORG, IGN, I-ORG, I-ORG, I-ORG, IGN]
loss [0.0, 0.03210718, 0.0, 0.05737416, 0.0494957, 0.062034503, 0.0]
predicted_label [I-ORG, B-ORG, I-ORG, I-ORG, I-ORG, I-ORG, I-ORG]
input_tokens [<s>, Ham, a, (, Unternehmen, ), </s>]

# Transform each element of a list-like to a row
df_tokens = df.apply(pd.Series.explode)
# Remove the tokens labeled with 'IGN'
df_tokens = df_tokens.query("labels != 'IGN'")
# Round loss values to two decimal places
df_tokens["loss"] = df_tokens["loss"].astype(float).round(2)
df_tokens.head(7).T.style.hide(axis='columns')
attention_mask 1 1 1 1 1 1 1
input_ids 10699 15 16104 1388 56530 83982 10
labels B-ORG I-ORG I-ORG I-ORG O B-ORG I-ORG
loss 0.030000 0.060000 0.050000 0.060000 0.000000 0.600000 0.380000
predicted_label B-ORG I-ORG I-ORG I-ORG O B-ORG I-ORG
input_tokens _Ham _( _Unternehmen _) _WE _Luz _a

(
    # Group data by the input tokens
    df_tokens.groupby("input_tokens")[["loss"]]
    # Aggregate the losses for each token
    .agg(["count", "mean", "sum"])
    # Get rid of multi-level columns
    .droplevel(level=0, axis=1)
    # Sort values with the highest losses first
    .sort_values(by="sum", ascending=False)
    .reset_index()
    .round(2)
    .head(10)
    .T
)
0 1 2 3 4 5 6 7 8 9
input_tokens _ _in _von _der _/ _und _( _) _’’ _A
count 6066 989 808 1388 163 1171 246 246 2898 125
mean 0.03 0.11 0.14 0.07 0.51 0.07 0.28 0.27 0.02 0.47
sum 187.46 110.59 110.46 100.7 83.81 83.13 69.48 67.49 59.03 58.63

Note:

  • The whitespace token has the highest total loss since it is the most common token.
  • The whitespace token has a low mean loss, indicating the model does not struggle to classify it.
  • Words like “in,” “von,” “der,” and “und” often appear together with named entities and are sometimes part of them.
  • It is rare to have parentheses, slashes, and capital letters at the beginning of words, but those have a relatively high average loss.

(
    # Group data by the label IDs
    df_tokens.groupby("labels")[["loss"]] 
    .agg(["count", "mean", "sum"])
    .droplevel(level=0, axis=1)
    .sort_values(by="mean", ascending=False)
    .reset_index()
    .round(2)
    .T
)
0 1 2 3 4 5 6
labels B-ORG I-LOC I-ORG B-LOC B-PER I-PER O
count 2683 1462 3820 3172 2893 4139 43648
mean 0.59 0.59 0.42 0.34 0.3 0.18 0.03
sum 1582.79 857.5 1598.29 1073.82 861.09 727.88 1419.61

Note: B-ORG has the highest average loss, meaning the model struggles to find the beginning of organization entities.


Plot a confusion matrix of the token classification

from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
import matplotlib.pyplot as plt
def plot_confusion_matrix(y_preds, y_true, labels):
    cm = confusion_matrix(y_true, y_preds, normalize="true")
    fig, ax = plt.subplots(figsize=(6, 6))
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
    disp.plot(cmap="Blues", values_format=".2f", ax=ax, colorbar=False)
    plt.title("Normalized confusion matrix")
    plt.show()
plot_confusion_matrix(df_tokens["labels"], df_tokens["predicted_label"],
                      tags.names)

png

Note: The model often confuses the beginning subword (B-ORG) of an organizational entity with the subsequent subwords (I-ORG).


Examine token sequences with high losses

def get_samples(df):
    # Iterate over DataFrame rows
    for _, row in df.iterrows():
        labels, preds, tokens, losses = [], [], [], []
        for i, mask in enumerate(row["attention_mask"]):
            if i not in {0, len(row["attention_mask"])}:
                labels.append(row["labels"][i])
                preds.append(row["predicted_label"][i])
                tokens.append(row["input_tokens"][i])
                losses.append(f"{row['loss'][i]:.2f}")
        df_tmp = pd.DataFrame({"tokens": tokens, "labels": labels, 
                               "preds": preds, "losses": losses}).T
        yield df_tmp

df["total_loss"] = df["loss"].apply(sum)
df_tmp = df.sort_values(by="total_loss", ascending=False).head(3)

for sample in get_samples(df_tmp):
    display(sample)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
tokens _’ _’’ Κ _’’ _’ _’ _’’ _T _’’ _’ ri _’’ _’ k _’’ _’ ala </s>
labels O O O IGN O O B-LOC I-LOC I-LOC I-LOC I-LOC IGN I-LOC I-LOC IGN I-LOC I-LOC IGN IGN
preds O O B-ORG O O O O O O O O O O O O O O O O
losses 0.00 0.00 2.42 0.00 0.00 0.00 9.83 9.15 7.60 6.55 6.66 0.00 5.83 6.83 0.00 7.26 7.44 0.00 0.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
tokens _’’ 8 . _Juli _’’ _: _Protest camp _auf _dem _Gelände _der _Republika n ischen _Gar de </s>
labels B-ORG IGN IGN I-ORG I-ORG I-ORG I-ORG IGN I-ORG I-ORG I-ORG I-ORG I-ORG IGN IGN I-ORG IGN IGN
preds O O O O O O O O O O O O B-ORG I-ORG I-ORG I-ORG I-ORG O
losses 8.37 0.00 0.00 4.67 9.00 8.87 6.17 0.00 7.98 8.33 7.00 4.32 2.61 0.00 0.00 0.01 0.00 0.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
tokens _United _Nations _Multi dimensional _Integra ted _Stabil ization _Mission _in _the _Central _African _Republic </s>
labels B-PER I-PER I-PER IGN I-PER IGN I-PER IGN I-PER I-PER I-PER I-PER I-PER I-PER IGN
preds B-ORG I-ORG I-ORG I-ORG I-ORG I-ORG I-ORG I-ORG I-ORG I-ORG I-ORG I-ORG I-ORG I-ORG I-ORG
losses 5.46 5.36 5.51 0.00 5.53 0.00 5.46 0.00 5.06 5.22 5.62 5.71 5.36 5.09 0.00

Note:

  • The PAN-X dataset used an imperfect automated process to apply annotations, resulting in some labeling issues.
    • The United Nations and the Central African Republic are organizations, not people.
    • The date “8. Juli” (July 8th) also has an incorrect label.

Examine sequences with an opening parenthesis

u"\u2581("
    '_('
```python df_tmp = df.loc[df[“input_tokens”].apply(lambda x: u”581(” in x)].head(2) for sample in get_samples(df_tmp): display(sample) ````
* We generally don’t include the parentheses and their contents as part of the named entity, but the automated annotation process does. * Some parentheses contain a geographic specification. * We might want to disconnect this information from the original location in the annotations. * The dataset consists of Wikipedia articles in different languages, and the article titles often contain an explanation in parentheses. * We need to know about such characteristics in our datasets when rolling out models to production. * We can use these insights to clean up the dataset and retrain the model.

Cross-Lingual Transfer

Create a helper function to evaluate the model on different datasets

def get_f1_score(trainer, dataset):
    return trainer.predict(dataset).metrics["test_f1"]

Examine the German model’s performance on the German test set

f1_scores = defaultdict(dict)
f1_scores["de"]["de"] = get_f1_score(trainer, panx_de_encoded["test"])
print(f"F1-score of [de] model on [de] dataset: {f1_scores['de']['de']:.3f}")
    F1-score of [de] model on [de] dataset: 0.859

Test the German model’s performance on French text

text_fr = "Jeff Dean est informaticien chez Google en Californie"
tag_text(text_fr, tags, trainer.model, xlmr_tokenizer)
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Tokens <s> _Jeff _De an _est _informatic ien _chez _Google _en _Cali for nie </s>
Tags O B-PER I-PER I-PER O O O O B-ORG O B-LOC B-LOC I-LOC O

Note: The model correctly labeled the French translation of “Kalifornien” as a location.


Define a helper function to encode a dataset and generate a classification report

def evaluate_lang_performance(lang, trainer):
    panx_ds = encode_panx_dataset(panx_ch[lang])
    return get_f1_score(trainer, panx_ds["test"])

Evaluate the German model’s performance on the French test set

f1_scores["de"]["fr"] = evaluate_lang_performance("fr", trainer)
print(f"F1-score of [de] model on [fr] dataset: {f1_scores['de']['fr']:.3f}")
    F1-score of [de] model on [fr] dataset: 0.708

Note: The German model still performs relatively well despite not training on a single labeled French example.


Evaluate the German model’s performance on the Italian test set

f1_scores["de"]["it"] = evaluate_lang_performance("it", trainer)
print(f"F1-score of [de] model on [it] dataset: {f1_scores['de']['it']:.3f}")
    F1-score of [de] model on [it] dataset: 0.691

Evaluate the German model’s performance on the English test set

#hide_output
f1_scores["de"]["en"] = evaluate_lang_performance("en", trainer)
print(f"F1-score of [de] model on [en] dataset: {f1_scores['de']['en']:.3f}")
    F1-score of [de] model on [en] dataset: 0.596

Note: The model performs worse on the English dataset despite being closer to German than French.


When Does Zero-Shot Transfer Make Sense?

  • We can determine at which point zero-shot cross-lingual transfer is superior to fine-tuning on a monolingual corpus by fine-tuning the model on training sets of increasing size.

Define a function to train a model on a downsampled dataset

def train_on_subset(dataset, num_samples):
    # Downsample the training set to the target number of samples
    train_ds = dataset["train"].shuffle(seed=42).select(range(num_samples))
    valid_ds = dataset["validation"]
    test_ds = dataset["test"]
    training_args.logging_steps = len(train_ds) // batch_size
    # Traing the model on the downsampled dataset
    trainer = Trainer(model_init=model_init, args=training_args,
        data_collator=data_collator, compute_metrics=compute_metrics,
        train_dataset=train_ds, eval_dataset=valid_ds, tokenizer=xlmr_tokenizer)
    trainer.train()
    if training_args.push_to_hub:
        trainer.push_to_hub(commit_message="Training completed!")
    # Return the performance metrics
    f1_score = get_f1_score(trainer, test_ds)
    return pd.DataFrame.from_dict(
        {"num_samples": [len(train_ds)], "f1_score": [f1_score]})

Encode the French Dataset

panx_fr_encoded = encode_panx_dataset(panx_ch["fr"])

Train the model on 250 French samples

training_args.push_to_hub = False
metrics_df = train_on_subset(panx_fr_encoded, 250)
metrics_df
<table border="1" class="dataframe">
Epoch Training Loss Validation Loss F1 1 2.360800 2.210924 0.109819 2 2.192800 1.484458 0.032251 3 1.480200 1.368229 0.008093
num_samples f1_score
0 250 0.007832

Note: The French model significantly underperforms the German model when using only 250 examples.


Train the model on an increasing number of French samples

for num_samples in [500, 1000, 2000, 4000]:
    metrics_df = metrics_df.append(
        train_on_subset(panx_fr_encoded, num_samples), ignore_index=True)
<table border="1" class="dataframe">
Epoch Training Loss Validation Loss F1 1 2.204900 1.488627 0.023411 2 1.465700 1.249257 0.144914 3 1.217400 1.093112 0.161066
<table border="1" class="dataframe">
Epoch Training Loss Validation Loss F1 1 1.800700 1.176858 0.175238 2 1.016500 0.698511 0.578441 3 0.669500 0.540801 0.639231
Epoch Training Loss Validation Loss F1
1 1.413700 0.686913 0.531559
2 0.526900 0.386696 0.741683
3 0.318900 0.352989 0.771843
Epoch Training Loss Validation Loss F1
1 0.895500 0.371288 0.757611
2 0.324200 0.327193 0.777248
3 0.243800 0.284226 0.822527

fig, ax = plt.subplots()
ax.axhline(f1_scores["de"]["fr"], ls="--", color="r")
metrics_df.set_index("num_samples").plot(ax=ax)
plt.legend(["Zero-shot from de", "Fine-tuned on fr"], loc="lower right")
plt.ylim((0, 1))
plt.xlabel("Number of Training Samples")
plt.ylabel("F1 Score")
plt.show()

png

Note: * The zero-shot transfer model remains competitive until about 1500 training examples. * Getting domain experts to label hundreds (let alone thousands) of documents can be costly, especially for NER.


Fine-Tuning on Multiple Languages at Once

  • We can mitigate the performance drop from zero-shot cross-lingual transfer by fine-tuning with multiple languages at once.

from datasets import concatenate_datasets

concatenate_datasets

  • Documentation
  • Convert a list of Dataset objects with the same schema into a single Dataset.

Define a function to combine a list of datasets using based on split names

def concatenate_splits(corpora):
    multi_corpus = DatasetDict()
    for split in corpora[0].keys():
        multi_corpus[split] = concatenate_datasets(
            [corpus[split] for corpus in corpora]).shuffle(seed=42)
    return multi_corpus

Combine the German and French datasets

panx_de_fr_encoded = concatenate_splits([panx_de_encoded, panx_fr_encoded])

Update training attributes

training_args.logging_steps = len(panx_de_fr_encoded["train"]) // batch_size
training_args.push_to_hub = True
training_args.output_dir = "xlm-roberta-base-finetuned-panx-de-fr"

Train the model on the combined dataset

trainer = Trainer(model_init=model_init, args=training_args,
    data_collator=data_collator, compute_metrics=compute_metrics,
    tokenizer=xlmr_tokenizer, train_dataset=panx_de_fr_encoded["train"],
    eval_dataset=panx_de_fr_encoded["validation"])

trainer.train()
Epoch Training Loss Validation Loss F1
1 0.371800 0.176133 0.822272
2 0.153500 0.160763 0.840360
3 0.107400 0.157969 0.854692
TrainOutput(global_step=807, training_loss=0.2103209033368393, metrics={'train_runtime': 129.7072, 'train_samples_per_second': 396.894, 'train_steps_per_second': 6.222, 'total_flos': 1399867154966784.0, 'train_loss': 0.2103209033368393, 'epoch': 3.0})

trainer.push_to_hub(commit_message="Training completed!")
    'https://huggingface.co/cj-mills/xlm-roberta-base-finetuned-panx-de-fr/commit/e93b59a0d16dc03a657342fd9bf31413af9aebc1'

for lang in langs:
    f1 = evaluate_lang_performance(lang, trainer)
    print(f"F1-score of [de-fr] model on [{lang}] dataset: {f1:.3f}")
    F1-score of [de-fr] model on [de] dataset: 0.862
    F1-score of [de-fr] model on [fr] dataset: 0.848
    F1-score of [de-fr] model on [it] dataset: 0.793
    F1-score of [de-fr] model on [en] dataset: 0.688

Note: The model now performs much better on the French split, and it even improved on the Italian and English sets.


Test the performance from fine-tuning on each language separately

corpora = [panx_de_encoded]

# Exclude German from iteration
for lang in langs[1:]:
    training_args.output_dir = f"xlm-roberta-base-finetuned-panx-{lang}"
    # Fine-tune on monolingual corpus
    ds_encoded = encode_panx_dataset(panx_ch[lang])
    metrics = train_on_subset(ds_encoded, ds_encoded["train"].num_rows)
    # Collect F1-scores in common dict
    f1_scores[lang][lang] = metrics["f1_score"][0]
    # Add monolingual corpus to list of corpora to concatenate
    corpora.append(ds_encoded)
<table border="1" class="dataframe">
Epoch Training Loss Validation Loss F1 1 0.854100 0.352915 0.782609 2 0.306900 0.280733 0.815359 3 0.226200 0.271911 0.829342
<table border="1" class="dataframe">
Epoch Training Loss Validation Loss F1 1 1.454800 0.652213 0.545667 2 0.521400 0.347615 0.740443 3 0.318600 0.292827 0.773021
<table border="1" class="dataframe">
Epoch Training Loss Validation Loss F1 1 1.711900 1.000937 0.226577 2 0.891000 0.640531 0.528084 3 0.602300 0.508421 0.579369

Test the performance from multilingual learning on all the corpora

corpora_encoded = concatenate_splits(corpora)
training_args.logging_steps = len(corpora_encoded["train"]) // batch_size
training_args.output_dir = "xlm-roberta-base-finetuned-panx-all"

trainer = Trainer(model_init=model_init, args=training_args,
    data_collator=data_collator, compute_metrics=compute_metrics,
    tokenizer=xlmr_tokenizer, train_dataset=corpora_encoded["train"],
    eval_dataset=corpora_encoded["validation"])

trainer.train()
trainer.push_to_hub(commit_message="Training completed!")
Cloning https://huggingface.co/cj-mills/xlm-roberta-base-finetuned-panx-all into local empty directory.
<table border="1" class="dataframe">
Epoch Training Loss Validation Loss F1 1 0.370100 0.200005 0.805385 2 0.162900 0.168012 0.837781 3 0.115600 0.167403 0.847745
    'https://huggingface.co/cj-mills/xlm-roberta-base-finetuned-panx-all/commit/f01950fd63b31959f5c3d520125366485fb375b6'

Generate predictions on the test set for each language

for idx, lang in enumerate(langs):
    f1_scores["all"][lang] = get_f1_score(trainer, corpora[idx]["test"])

scores_data = {"de": f1_scores["de"],
               "each": {lang: f1_scores[lang][lang] for lang in langs},
               "all": f1_scores["all"]}
f1_scores_df = pd.DataFrame(scores_data).T.round(4)
f1_scores_df.rename_axis(index="Fine-tune on", columns="Evaluated on",
                         inplace=True)
f1_scores_df
Evaluated on de fr it en
Fine-tune on
de 0.8590 0.7079 0.6910 0.5962
each 0.8590 0.8321 0.7696 0.5962
all 0.8592 0.8568 0.8646 0.7678

Note:

  • Multilingual learning can provide significant performance gains.
  • You should generally focus your attention on cross-lingual transfer within language families.

References