Notes on Transformers Book Ch. 2

huggingface

nlp

notes

Chapter 2 covers training a model to classify emotions expressed in Twitter messages.

Author

Christian Mills

Published

April 1, 2022

This post is part of the following series:

Natural Language Processing with Transformers

Project: Analyze Product Sentiment on Twitter
The Dataset
From Text to Tokens
Training a Text Classifier
Conclusion
References

import transformers
import datasets
import pandas as pd

# Only print error messages
transformers.logging.set_verbosity_error()
datasets.logging.set_verbosity_error()

transformers.__version__

    '4.11.3'

pd.set_option('max_colwidth',None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

import ast
# https://astor.readthedocs.io/en/latest/
import astor
import inspect
import textwrap
def print_source(obj, exclude_doc=True):
    
    # Get source code
    source = inspect.getsource(obj)
    # Remove any common leading whitespace from every line
    cleaned_source = textwrap.dedent(source)
    # Parse the source into an AST node.
    parsed = ast.parse(cleaned_source)

    for node in ast.walk(parsed):
        # Skip any nodes that are not class or function definitions
        if not isinstance(node, (ast.FunctionDef, ast.ClassDef, ast.AsyncFunctionDef)):
            continue
        
        if exclude_doc and len(node.body) > 1: node.body = node.body[1:]
        
    print(astor.to_source(parsed))

Project: Analyze Product Sentiment on Twitter

Sentiment analysis involves classifying the feelings or opinions expressed in a given text.
The goal is to build a system that automatically classifies emotions expressed in Twitter messages about a product.
A model will take a single tweet as input and assign one of the possible labels.
Possible labels include anger, fear, joy, love, sadness, and surprise.
The project will use a variant of BERT called DistilBERT.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
- DistilBERT achieves comparable accuracy to BERT while being significantly more efficient in size and speed.
- DistilBERT was created in 2019 by researchers at Hugging Face.

Hugging Face Project Pipeline:

Load and process datasets using the Datasets library.
Tokenize input texts using the Tokenizers library.
Load, train, and run models using the Transformers library.
Load metrics and evaluate models using the Datasets library.

The Dataset

CARER: Contextualized Affect Representations for Emotion Recognition
- The authors of the paper created an emotion dataset of English twitter messages.
- The emotion dataset contains messages that express anger, fear, joy, love, sadness, or surprise.
- Emoticons present in the tweets determine the initial labels.
- A graph-based algorithm then uses these initial labels to construct contextualized, pattern-based emotion features.
- Word embeddings help further enrich these features.
GitHub Repository
Hugging Face Dataset Card

A First Look at Hugging Face Datasets

GitHub Repository
Documentation
Hugging Face Datasets is based on Apache Arrow.
- Apache Arrow defines a typed columnar format that is more memory efficient than native Python.

from datasets import list_datasets

print_source(list_datasets, exclude_doc=True)

    def list_datasets(with_community_datasets=True, with_details=False):
        datasets = huggingface_hub.list_datasets(full=with_details)
        if not with_community_datasets:
            datasets = [dataset for dataset in datasets if '/' not in dataset.id]
        if not with_details:
            datasets = [dataset.id for dataset in datasets]
        return datasets

# Get a list of all the datasets scripts available on the Hugging Face Hub
all_datasets = list_datasets()
print(f"There are {len(all_datasets)} datasets currently available on the Hub")
print(f"The first 10 are: {all_datasets[:10]}")

    There are 3896 datasets currently available on the Hub
    The first 10 are: ['acronym_identification', 'ade_corpus_v2', 'adversarial_qa', 'aeslc', 'afrikaans_ner_corpus', 'ag_news', 'ai2_arc', 'air_dialogue', 'ajgt_twitter_ar', 'allegro_reviews']

from datasets import load_dataset

`load_dataset`

Documentation
This method downloads and imports the loading script for the specified dataset.
The script defines the citation, info, and format of the dataset, the URL to the original data files, and the code to load examples from the original files.
The script downloads the dataset files and caches them in typed Apache Arrow tables.
Several loading scripts are available to handle local and remote datasets.

Methods to Load Common Data Formats

Data format	Loading script	Example
CSV	`csv`	`load_dataset("csv", data_files="my_file.csv")`
Text	`text`	`load_dataset("text", data_files="my_file.txt")`
JSON	`json`	`load_dataset("json", data_files="my_file.jsonl")`

print_source(load_dataset)

    def load_dataset(path: str, name: Optional[str]=None, data_dir: Optional[
        str]=None, data_files: Optional[Union[str, Sequence[str], Mapping[str,
        Union[str, Sequence[str]]]]]=None, split: Optional[Union[str, Split]]=
        None, cache_dir: Optional[str]=None, features: Optional[Features]=None,
        download_config: Optional[DownloadConfig]=None, download_mode: Optional
        [GenerateMode]=None, ignore_verifications: bool=False, keep_in_memory:
        Optional[bool]=None, save_infos: bool=False, revision: Optional[Union[
        str, Version]]=None, use_auth_token: Optional[Union[bool, str]]=None,
        task: Optional[Union[str, TaskTemplate]]=None, streaming: bool=False,
        script_version='deprecated', **config_kwargs) ->Union[DatasetDict,
        Dataset, IterableDatasetDict, IterableDataset]:
        if script_version != 'deprecated':
            warnings.warn(
                "'script_version' was renamed to 'revision' in version 1.13 and will be removed in 1.15."
                , FutureWarning)
            revision = script_version
        ignore_verifications = ignore_verifications or save_infos
        builder_instance = load_dataset_builder(path=path, name=name, data_dir=
            data_dir, data_files=data_files, cache_dir=cache_dir, features=
            features, download_config=download_config, download_mode=
            download_mode, revision=revision, use_auth_token=use_auth_token, **
            config_kwargs)
        if streaming:
            extend_module_for_streaming(builder_instance.__module__,
                use_auth_token=use_auth_token)
            if not builder_instance.__module__.startswith('datasets.'):
                for imports in get_imports(inspect.getfile(builder_instance.
                    __class__)):
                    if imports[0] == 'internal':
                        internal_import_name = imports[1]
                        internal_module_name = '.'.join(builder_instance.
                            __module__.split('.')[:-1] + [internal_import_name])
                        extend_module_for_streaming(internal_module_name,
                            use_auth_token=use_auth_token)
            return builder_instance.as_streaming_dataset(split=split,
                use_auth_token=use_auth_token)
        try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES
        builder_instance.download_and_prepare(download_config=download_config,
            download_mode=download_mode, ignore_verifications=
            ignore_verifications, try_from_hf_gcs=try_from_hf_gcs,
            use_auth_token=use_auth_token)
        keep_in_memory = (keep_in_memory if keep_in_memory is not None else
            is_small_dataset(builder_instance.info.dataset_size))
        ds = builder_instance.as_dataset(split=split, ignore_verifications=
            ignore_verifications, in_memory=keep_in_memory)
        if task is not None:
            ds = ds.prepare_for_task(task)
        if save_infos:
            builder_instance._save_infos()
        return ds

Automated Process

# Download dataset from Hub
emotions = load_dataset("emotion")

pd.DataFrame(list(emotions.cache_files.items()))

	0	1
0	train	[{‘filename’: ‘/home/innom-dt/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/emotion-train.arrow’}]
1	validation	[{‘filename’: ‘/home/innom-dt/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/emotion-validation.arrow’}]
2	test	[{‘filename’: ‘/home/innom-dt/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/emotion-test.arrow’}]

Manual Process - Local

# Get the download URLs
urls = list(emotions['train'].info.download_checksums.keys())
urls

    ['https://www.dropbox.com/s/1pzkadrvffbqw6o/train.txt?dl=1',
     'https://www.dropbox.com/s/2mzialpsgf9k5l3/val.txt?dl=1',
     'https://www.dropbox.com/s/ikkqxfdbdec3fuj/test.txt?dl=1']

# Download each dataset to current directory
for url in urls:
    # remove url parameters
    url = url.split('?')[0]
    # run the wget shell command in the jupyter notebook
    !wget $url

    --2022-04-01 11:59:26--  https://www.dropbox.com/s/1pzkadrvffbqw6o/train.txt
    Resolving www.dropbox.com (www.dropbox.com)... 162.125.7.18, 2620:100:6017:18::a27d:212
    Connecting to www.dropbox.com (www.dropbox.com)|162.125.7.18|:443... connected.
    HTTP request sent, awaiting response... 301 Moved Permanently
    Location: /s/raw/1pzkadrvffbqw6o/train.txt [following]
    --2022-04-01 11:59:26--  https://www.dropbox.com/s/raw/1pzkadrvffbqw6o/train.txt
    Reusing existing connection to www.dropbox.com:443.
    HTTP request sent, awaiting response... 302 Found
    Location: https://ucd8bd8ccbe834141eed1bd4fe3f.dl.dropboxusercontent.com/cd/0/inline/BimHlqS8EcLnVO8-ErygeREvWupg-stxp_BKxrhhRBD8zXEOdQ5P-ssnFHFhv63Jx0wos3YwmuzmYs4Ex3iGW6lF430Y2yc4Y-ro00V20otuMPHh1I7x6YnZWmMe_xQOeM_-RNv_CbVeXC2wxDFZxE-TWzFuwjHo-RUy7RcwlYWMng/file# [following]
    --2022-04-01 11:59:26--  https://ucd8bd8ccbe834141eed1bd4fe3f.dl.dropboxusercontent.com/cd/0/inline/BimHlqS8EcLnVO8-ErygeREvWupg-stxp_BKxrhhRBD8zXEOdQ5P-ssnFHFhv63Jx0wos3YwmuzmYs4Ex3iGW6lF430Y2yc4Y-ro00V20otuMPHh1I7x6YnZWmMe_xQOeM_-RNv_CbVeXC2wxDFZxE-TWzFuwjHo-RUy7RcwlYWMng/file
    Resolving ucd8bd8ccbe834141eed1bd4fe3f.dl.dropboxusercontent.com (ucd8bd8ccbe834141eed1bd4fe3f.dl.dropboxusercontent.com)... 162.125.7.15, 2620:100:6017:15::a27d:20f
    Connecting to ucd8bd8ccbe834141eed1bd4fe3f.dl.dropboxusercontent.com (ucd8bd8ccbe834141eed1bd4fe3f.dl.dropboxusercontent.com)|162.125.7.15|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 1658616 (1.6M) [text/plain]
    Saving to: ‘train.txt.8’
    
    train.txt.8         100%[===================>]   1.58M  --.-KB/s    in 0.1s    
    
    2022-04-01 11:59:27 (12.6 MB/s) - ‘train.txt.8’ saved [1658616/1658616]
    
    --2022-04-01 11:59:27--  https://www.dropbox.com/s/2mzialpsgf9k5l3/val.txt
    Resolving www.dropbox.com (www.dropbox.com)... 162.125.7.18, 2620:100:6017:18::a27d:212
    Connecting to www.dropbox.com (www.dropbox.com)|162.125.7.18|:443... connected.
    HTTP request sent, awaiting response... 301 Moved Permanently
    Location: /s/raw/2mzialpsgf9k5l3/val.txt [following]
    --2022-04-01 11:59:27--  https://www.dropbox.com/s/raw/2mzialpsgf9k5l3/val.txt
    Reusing existing connection to www.dropbox.com:443.
    HTTP request sent, awaiting response... 302 Found
    Location: https://ucd7c254cf6c0298b8fdea83c996.dl.dropboxusercontent.com/cd/0/inline/BinpUxjuQUPZKSAw9nVygw-6QF-JqzCuvRo2N8QqZPM8-Aqp5PxM0tHDJ3zclYqIKMhc_9_ORaLBDtdxeknAqfm_e3E0QJIYPA4tUpTQ7h31LAD_sc__6kyvioIZzjK61S5MlbTyM3YUMq3gPYMRH9_XE5gYrjnC1pddo3lRgrcUrg/file# [following]
    --2022-04-01 11:59:27--  https://ucd7c254cf6c0298b8fdea83c996.dl.dropboxusercontent.com/cd/0/inline/BinpUxjuQUPZKSAw9nVygw-6QF-JqzCuvRo2N8QqZPM8-Aqp5PxM0tHDJ3zclYqIKMhc_9_ORaLBDtdxeknAqfm_e3E0QJIYPA4tUpTQ7h31LAD_sc__6kyvioIZzjK61S5MlbTyM3YUMq3gPYMRH9_XE5gYrjnC1pddo3lRgrcUrg/file
    Resolving ucd7c254cf6c0298b8fdea83c996.dl.dropboxusercontent.com (ucd7c254cf6c0298b8fdea83c996.dl.dropboxusercontent.com)... 162.125.7.15, 2620:100:6017:15::a27d:20f
    Connecting to ucd7c254cf6c0298b8fdea83c996.dl.dropboxusercontent.com (ucd7c254cf6c0298b8fdea83c996.dl.dropboxusercontent.com)|162.125.7.15|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 204240 (199K) [text/plain]
    Saving to: ‘val.txt.8’
    
    val.txt.8           100%[===================>] 199.45K  --.-KB/s    in 0.09s   
    
    2022-04-01 11:59:28 (2.23 MB/s) - ‘val.txt.8’ saved [204240/204240]
    
    --2022-04-01 11:59:28--  https://www.dropbox.com/s/ikkqxfdbdec3fuj/test.txt
    Resolving www.dropbox.com (www.dropbox.com)... 162.125.7.18, 2620:100:6017:18::a27d:212
    Connecting to www.dropbox.com (www.dropbox.com)|162.125.7.18|:443... connected.
    HTTP request sent, awaiting response... 301 Moved Permanently
    Location: /s/raw/ikkqxfdbdec3fuj/test.txt [following]
    --2022-04-01 11:59:28--  https://www.dropbox.com/s/raw/ikkqxfdbdec3fuj/test.txt
    Reusing existing connection to www.dropbox.com:443.
    HTTP request sent, awaiting response... 302 Found
    Location: https://uc6a6ed094f33148a8d600d0bd94.dl.dropboxusercontent.com/cd/0/inline/BileG7vM49CD4NPqfWBG0td8OcodftXS6fihHcq6NCZrPE8Xn9puhgIP1mCk-KXlQnwxW_3WTCdvFmmavZXbvU5qj_mu4PoCB4quNit8j4vVynpa3QWMxcPTiHfQB8UgZaKz319rr67HSjySTKFR1xvmTxTwZIsB0Ixss_Bem8ixQg/file# [following]
    --2022-04-01 11:59:28--  https://uc6a6ed094f33148a8d600d0bd94.dl.dropboxusercontent.com/cd/0/inline/BileG7vM49CD4NPqfWBG0td8OcodftXS6fihHcq6NCZrPE8Xn9puhgIP1mCk-KXlQnwxW_3WTCdvFmmavZXbvU5qj_mu4PoCB4quNit8j4vVynpa3QWMxcPTiHfQB8UgZaKz319rr67HSjySTKFR1xvmTxTwZIsB0Ixss_Bem8ixQg/file
    Resolving uc6a6ed094f33148a8d600d0bd94.dl.dropboxusercontent.com (uc6a6ed094f33148a8d600d0bd94.dl.dropboxusercontent.com)... 162.125.7.15, 2620:100:6017:15::a27d:20f
    Connecting to uc6a6ed094f33148a8d600d0bd94.dl.dropboxusercontent.com (uc6a6ed094f33148a8d600d0bd94.dl.dropboxusercontent.com)|162.125.7.15|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 206760 (202K) [text/plain]
    Saving to: ‘test.txt.8’
    
    test.txt.8          100%[===================>] 201.91K  --.-KB/s    in 0.07s   
    
    2022-04-01 11:59:29 (2.95 MB/s) - ‘test.txt.8’ saved [206760/206760]

!head -5 train.txt

    i didnt feel humiliated;sadness
    i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake;sadness
    im grabbing a minute to post i feel greedy wrong;anger
    i am ever feeling nostalgic about the fireplace i will know that it is still on the property;love
    i am feeling grouchy;anger

dataset_names = ['train', 'validation', 'test']
file_names = [url.split('?')[0].split('/')[-1] for url in urls]
data_files={name:file for name,file in zip(dataset_names, file_names)}
data_files

    {'train': 'train.txt', 'validation': 'val.txt', 'test': 'test.txt'}

emotions_local = load_dataset("csv", data_files=data_files, sep=";", names=["text", "label"])

emotions_local

    DatasetDict({
        train: Dataset({
            features: ['text', 'label'],
            num_rows: 16000
        })
        validation: Dataset({
            features: ['text', 'label'],
            num_rows: 2000
        })
        test: Dataset({
            features: ['text', 'label'],
            num_rows: 2000
        })
    })

pd.DataFrame(list(emotions_local.cache_files.items()))

	0	1
0	train	[{‘filename’: ‘/home/innom-dt/.cache/huggingface/datasets/csv/default-88fded83f2f02d15/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/csv-train.arrow’}]
1	validation	[{‘filename’: ‘/home/innom-dt/.cache/huggingface/datasets/csv/default-88fded83f2f02d15/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/csv-validation.arrow’}]
2	test	[{‘filename’: ‘/home/innom-dt/.cache/huggingface/datasets/csv/default-88fded83f2f02d15/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/csv-test.arrow’}]

Manual Process - Remote

data_files = {name:url for name,url in zip(dataset_names,urls)}
data_files

    {'train': 'https://www.dropbox.com/s/1pzkadrvffbqw6o/train.txt?dl=1',
     'validation': 'https://www.dropbox.com/s/2mzialpsgf9k5l3/val.txt?dl=1',
     'test': 'https://www.dropbox.com/s/ikkqxfdbdec3fuj/test.txt?dl=1'}

emotions_remote = load_dataset("csv", data_files=data_files, sep=";", names=["text", "label"])

emotions_remote

    DatasetDict({
        train: Dataset({
            features: ['text', 'label'],
            num_rows: 16000
        })
        validation: Dataset({
            features: ['text', 'label'],
            num_rows: 2000
        })
        test: Dataset({
            features: ['text', 'label'],
            num_rows: 2000
        })
    })

pd.DataFrame(list(emotions_remote.cache_files.items()))

	0	1
0	train	[{‘filename’: ‘/home/innom-dt/.cache/huggingface/datasets/csv/default-6e495c0980795f6b/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/csv-train.arrow’}]
1	validation	[{‘filename’: ‘/home/innom-dt/.cache/huggingface/datasets/csv/default-6e495c0980795f6b/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/csv-validation.arrow’}]
2	test	[{‘filename’: ‘/home/innom-dt/.cache/huggingface/datasets/csv/default-6e495c0980795f6b/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/csv-test.arrow’}]

`DatasetDict`

Documentation
A dictionary (dict of str: datasets.Dataset) with dataset transforms methods (map, filter, etc.)

emotions

    DatasetDict({
        train: Dataset({
            features: ['text', 'label'],
            num_rows: 16000
        })
        validation: Dataset({
            features: ['text', 'label'],
            num_rows: 2000
        })
        test: Dataset({
            features: ['text', 'label'],
            num_rows: 2000
        })
    })

Note: The data is already split into training, validation, and test sets.

`Dataset`

Documentation
The base class datasets.Dataset implements a Dataset backed by an Apache Arrow table.
Behaves like an ordinary Python array or list.

train_ds = emotions["train"]
train_ds

    Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })

len(train_ds)

train_ds[0]

    {'text': 'i didnt feel humiliated', 'label': 0}

train_ds.column_names

    ['text', 'label']

# Check the data types used for text and labels.
print(train_ds.features)

    {'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=6, names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], names_file=None, id=None)}

`ClassLabel`

Documentation
Feature type for integer class labels.
This class provides methods to convert integer labels to strings and strings to integer labels.

datasets.ClassLabel

    datasets.features.features.ClassLabel

`Value`

Documentation

datasets.Value

    datasets.features.features.Value

print_source(datasets.Value, False)

    @dataclass
    class Value:
        """
        The Value dtypes are as follows:
    
        null
        bool
        int8
        int16
        int32
        int64
        uint8
        uint16
        uint32
        uint64
        float16
        float32 (alias float)
        float64 (alias double)
        timestamp[(s|ms|us|ns)]
        timestamp[(s|ms|us|ns), tz=(tzstring)]
        binary
        large_binary
        string
        large_string
        """
        dtype: str
        id: Optional[str] = None
        pa_type: ClassVar[Any] = None
        _type: str = field(default='Value', init=False, repr=False)
    
        def __post_init__(self):
            if self.dtype == 'double':
                self.dtype = 'float64'
            if self.dtype == 'float':
                self.dtype = 'float32'
            self.pa_type = string_to_arrow(self.dtype)
    
        def __call__(self):
            return self.pa_type
    
        def encode_example(self, value):
            if pa.types.is_boolean(self.pa_type):
                return bool(value)
            elif pa.types.is_integer(self.pa_type):
                return int(value)
            elif pa.types.is_floating(self.pa_type):
                return float(value)
            elif pa.types.is_string(self.pa_type):
                return str(value)
            else:
                return value

From Datasets to DataFrames

Hugging Face Datasets provides a set_format method to convert Datasets objects to Pandas DataFrames.
The underlying data format is still an Arrow table.

`DatasetDict.set_format`

Documentation
Set the format for every Dataset object in the dictionary.

print_source(datasets.DatasetDict.set_format, exclude_doc=True)

    def set_format(self, type: Optional[str]=None, columns: Optional[List]=None,
        output_all_columns: bool=False, **format_kwargs):
        self._check_values_type()
        for dataset in self.values():
            dataset.set_format(type=type, columns=columns, output_all_columns=
                output_all_columns, **format_kwargs)

`Dataset.set_format`

Documentation
Set the __getitem__ return format.
- None (Python object), numpy, torch, tensorflow, pandas, arrow

print_source(datasets.Dataset.set_format, exclude_doc=True)

    @fingerprint_transform(inplace=True)
    def set_format(self, type: Optional[str]=None, columns: Optional[List]=None,
        output_all_columns: bool=False, **format_kwargs):
        format_kwargs.update(format_kwargs.pop('format_kwargs', {}))
        type = get_format_type_from_alias(type)
        _ = get_formatter(type, features=self.features, **format_kwargs)
        if isinstance(columns, str):
            columns = [columns]
        if isinstance(columns, tuple):
            columns = list(columns)
        if columns is not None and any(col not in self._data.column_names for
            col in columns):
            raise ValueError(
                f'Columns {list(filter(lambda col: col not in self._data.column_names, columns))} not in the dataset. Current columns in the dataset: {self._data.column_names}'
                )
        if columns is not None:
            columns = columns.copy()
        self._format_type = type
        self._format_kwargs = format_kwargs
        self._format_columns = columns
        self._output_all_columns = output_all_columns
        logger.debug(
            'Set __getitem__(key) output type to %s for %s columns  (when key is int or slice) and %s output other (un-formatted) columns.'
            , 'python objects' if type is None else type, 'no' if columns is
            None else str(columns), 'do' if output_all_columns else "don't")

emotions.set_format(type="pandas")
df = emotions["train"][:]
df.head()

	text	label
0	i didnt feel humiliated	0
1	i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake	0
2	im grabbing a minute to post i feel greedy wrong	3
3	i am ever feeling nostalgic about the fireplace i will know that it is still on the property	2
4	i am feeling grouchy	3

`ClassLabel.int2str`

Documentation
Convert an integer label to the corresponding class name string.

print_source(datasets.ClassLabel.int2str)

    def int2str(self, values: Union[int, Iterable]):
        assert isinstance(values, int) or isinstance(values, Iterable
            ), f'Values {values} should be an integer or an Iterable (list, numpy array, pytorch, tensorflow tensors)'
        return_list = True
        if isinstance(values, int):
            values = [values]
            return_list = False
        for v in values:
            if not 0 <= v < self.num_classes:
                raise ValueError(f'Invalid integer class label {v:d}')
        if self._int2str:
            output = [self._int2str[int(v)] for v in values]
        else:
            output = [str(v) for v in values]
        return output if return_list else output[0]

# Get the corresponding emotion name
def label_int2str(row):
    return emotions["train"].features["label"].int2str(row)
# Add a new column with the corresponding emotion name
df["emotion"] = df["label"].apply(label_int2str)
df.head()

	text	label	emotion
0	i didnt feel humiliated	0	sadness
1	i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake	0	sadness
2	im grabbing a minute to post i feel greedy wrong	3	anger
3	i am ever feeling nostalgic about the fireplace i will know that it is still on the property	2	love
4	i am feeling grouchy	3	anger

Looking at the Class Distribution

A Recipe for Training Neural Networks
- The first step to training a neural network involves thoroughly inspecting the data.
- Understand the distribution of the training examples and look for patterns.
Datasets with skewed class distribution might require a different treatment regarding the training loss and evaluation metrics.

import matplotlib.pyplot as plt

# Increase the figure size
plt.rcParams["figure.figsize"] = (10,6)
# Create a horizontal bar chart
df["emotion"].value_counts(ascending=True).plot.barh()
plt.title("Frequency of Classes")
plt.show()
# Reset the figure size
plt.rcParams["figure.figsize"] = plt.rcParamsDefault["figure.figsize"]

Note: Messages expressing joy and sadness are about 5-10 times more common than messages expressing love and surprise.

Methods to Deal with Imbalanced Data

Randomly oversample the minority class.
Randomly undersample the majority class.
Gather more labeled data from the underrepresented classes.

imbalanced-learn

Documentation
This library extends scikit-learn and provides tools for dealing with imbalanced classes.

How Long Are Our Tweets?

Transformer models have a maximum input sequence length called the maximum context size.
The maximum context size for DistilBERT is 512 tokens, which is roughly equivalent to a few paragraphs of text.
We need to truncate pieces of text that do not fit in a model’s context size, which might remove crucial information.
We can approximate the number of tokens per twee for each emotion by looking at the distribution of words per tweet.

# Increase the figure size
plt.rcParams["figure.figsize"] = (10,7)
# Create a new column containing the number of words for each tweet
df["Words Per Tweet"] = df["text"].str.split().apply(len)
# Create a box plot
df.boxplot("Words Per Tweet", by="emotion", grid=False, showfliers=False, color="black")
plt.suptitle("")
plt.xlabel("")
plt.show()
# Reset the figure size
plt.rcParams["figure.figsize"] = plt.rcParamsDefault["figure.figsize"]

Note: Most tweets are between 15 and 20 words long, with a max length of around 50 words.

`DatasetDict.reset_format()`

Documentation
return format to python objects for all datasets in the dictionary
calls set_format with the default arguments

datasets.DatasetDict.reset_format

    <function datasets.dataset_dict.DatasetDict.reset_format(self)>

print_source(datasets.DatasetDict.reset_format)

    def reset_format(self):
        self._check_values_type()
        for dataset in self.values():
            dataset.set_format()

print_source(datasets.Dataset.reset_format)

    def reset_format(self):
        self.set_format()

emotions.reset_format()

From Text to Tokens

Transformer models cannot receive raw strings as input.
Text first needs to be tokenized and encoded as numerical vectors.
The three main tokenization strategies are character tokenization, word tokenization, and subword tokenization.

IMPORTANT: Use the same tokenizer when training, fine-tuning, and performing inference with a given model.

Character Tokenization

Character-based tokenizers split the text into single characters.
Character tokenization results in a smaller vocabulary and much fewer out-of-vocabulary tokens.
It also results in a much higher number of tokens for a given input sequence.
A character-based representation is less meaningful compared to using words.
The model needs to learn linguistic structures like words from the data, making training more expensive.
Most projects do not use character tokenization.

text = "Tokenizing text is a core task of NLP."
tokenized_text = list(text)
print(tokenized_text)

    ['T', 'o', 'k', 'e', 'n', 'i', 'z', 'i', 'n', 'g', ' ', 't', 'e', 'x', 't', ' ', 'i', 's', ' ', 'a', ' ', 'c', 'o', 'r', 'e', ' ', 't', 'a', 's', 'k', ' ', 'o', 'f', ' ', 'N', 'L', 'P', '.']

Numericalization

Models can only process numbers, so we need to encode tokens as numerical data.
A simple encoding method is to convert each unique token to a unique integer.

# Map each unique token to a unique integer
token2idx = {ch: idx for idx, ch in enumerate(sorted(set(tokenized_text)))}
print(token2idx)

    {' ': 0, '.': 1, 'L': 2, 'N': 3, 'P': 4, 'T': 5, 'a': 6, 'c': 7, 'e': 8, 'f': 9, 'g': 10, 'i': 11, 'k': 12, 'n': 13, 'o': 14, 'r': 15, 's': 16, 't': 17, 'x': 18, 'z': 19}

# Encode the tokenized text
input_ids = [token2idx[token] for token in tokenized_text]
print(input_ids)

    [5, 14, 12, 8, 13, 11, 19, 11, 13, 10, 0, 17, 8, 18, 17, 0, 11, 16, 0, 6, 0, 7, 14, 15, 8, 0, 17, 6, 16, 12, 0, 14, 9, 0, 3, 2, 4, 1]

One-hot Encoding

It is common to encode categorical variables as one-hot vectors, where a single entry has the value 1, and every other entry has the value 0.
One-hot encoding can help prevent the model from learning undesired relationships like fictitious ordering between names.

# Sample categorical data
categorical_df = pd.DataFrame(
    {"Name": ["Bumblebee", "Optimus Prime", "Megatron"], "Label ID": [0,1,2]})
categorical_df

	Name	Label ID
0	Bumblebee	0
1	Optimus Prime	1
2	Megatron	2

Note: A model might interpret the order of values in the Label ID column as significant.

# Create one-hot vectors for each unique value in the Name column
pd.get_dummies(categorical_df["Name"])

	Bumblebee	Megatron	Optimus Prime
0	1	0	0
1	0	0	1
2	0	1	0

Note: We can use this approach to prevent the model from learning similar undesired relationships in the input_ids list.

import torch
import torch.nn.functional as F

PyTorch `one_hot`:

Documentation
Generate one-hot encodings for a tensor with a specified number of classes

len(input_ids), len(token2idx)

    (38, 20)

# Convert input_ids list to a tensor
input_ids = torch.tensor(input_ids)
# Generate one-hot encodings
one_hot_encodings = F.one_hot(input_ids, num_classes=len(token2idx))
one_hot_encodings.shape

    torch.Size([38, 20])

Note: Make sure to set num_classes to the vocabulary size.

print(f"Token: {tokenized_text[0]}")
print(f"Tensor index: {input_ids[0]}")
print(f"One-hot: {one_hot_encodings[0]}")

    Token: T
    Tensor index: 5
    One-hot: tensor([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Word Tokenization

Word-based tokenizers split the text into words and map each word to an integer.
Word tokenization creates less work for the model as it does not need to learn such linguistic structures from the data.
There is a loss of meaning across very similar words.
User-defined rules tell the tokenizer how to split the raw text into words.
A simple tokenization method is to split text using whitespace.
More sophisticated word tokenizers have additional rules to handle punctuation.
Other methods involve stemming or lemmatization to normalize words to their stem.
- Example: “great”, “greater”, and “greatest” all become “great”
- These methods come at the expense of losing some information in the text.
Word tokenization results in a bigger vocabulary size, which requires more model parameters.
- The first layer for a model that compressed the input vectors for a vocabulary with one million unique words to one thousand dimensional vectors would contain one billion weights.
A popular method to limit the vocabulary size involves only adding the 100,000 most common words in the corpus.
- Words that are not part of the vocabulary are classified as unknown and mapped to a shared token.
- This approach risks losing potentially important information related to rare words.

tokenized_text = text.split()
print(tokenized_text)

    ['Tokenizing', 'text', 'is', 'a', 'core', 'task', 'of', 'NLP.']

Subword Tokenization

Subword tokenization algorithms decompose rare words into meaningful subwords while keeping the most frequently used words as unique entities.
Subword tokenization algorithms can identify start-of-word tokens.
Most state-of-the-art English models use subword-tokenization.
Subword tokenization helps keep vocabulary size and input length manageable by sharing information across different words.
There are several tokenization methods used in NLP.
- The tokenizers for both the BERT and DistilBERT models use WordPiece.
- GPT-2 uses Byte Pair Encoding.
- Several multilingual models use SentencePiece or Unigram Segmentation.

from transformers import AutoTokenizer

`AutoTokenizer`

Documentation
Quickly load the tokenizer associated with a pretrained model.
AutoTokenizer belongs to a set of auto classes that automatically retrieve the model’s configuration, pretrained weights, or vocabulary from the name of a checkpoint.

AutoTokenizer

    transformers.models.auto.tokenization_auto.AutoTokenizer

model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

Note: Hugging Face automatically caches the parameters of the pretrained tokenizer after the first download.

type(tokenizer)

transformers.models.distilbert.tokenization_distilbert_fast.DistilBertTokenizerFast

`DistilBertTokenizerFast`

Documentation
Construct a “fast” DistilBERT tokenizer that runs end-to-end tokenization, including punctuation and WordPiece.

tokenizer

    PreTrainedTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

tokenizer.init_kwargs

    {'do_lower_case': True,
     'unk_token': '[UNK]',
     'sep_token': '[SEP]',
     'pad_token': '[PAD]',
     'cls_token': '[CLS]',
     'mask_token': '[MASK]',
     'tokenize_chinese_chars': True,
     'strip_accents': None,
     'model_max_length': 512,
     'special_tokens_map_file': None,
     'name_or_path': 'distilbert-base-uncased'}

print(text)

    Tokenizing text is a core task of NLP.

# Map the raw text content to unique integers
encoded_text = tokenizer(text)
print(encoded_text)

    {'input_ids': [101, 19204, 6026, 3793, 2003, 1037, 4563, 4708, 1997, 17953, 2361, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

`convert_ids_to_tokens`

Documentation
Convert a single index or a sequence of indices in a token or a sequence of tokens, using the vocabulary and added tokens.

print_source(tokenizer.convert_ids_to_tokens)

    def convert_ids_to_tokens(self, ids: Union[int, List[int]],
        skip_special_tokens: bool=False) ->Union[str, List[str]]:
        if isinstance(ids, int):
            return self._tokenizer.id_to_token(ids)
        tokens = []
        for index in ids:
            index = int(index)
            if skip_special_tokens and index in self.all_special_ids:
                continue
            tokens.append(self._tokenizer.id_to_token(index))
        return tokens

# Convert integer ids to tokens
tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)
print(tokens)

    ['[CLS]', 'token', '##izing', 'text', 'is', 'a', 'core', 'task', 'of', 'nl', '##p', '.', '[SEP]']

Note: 1. The [CLS] and [SEP] tokens indicate the start and end of a sequence respectively. 2. All the tokens are lower case. 3. The words “tokenizing” and “NLP” have been decomposed into subwords since they are rare. 4. The ## prefix indicates the preceding string was not whitespace.

# Convert tokens to plain text
print(tokenizer.convert_tokens_to_string(tokens))

    [CLS] tokenizing text is a core task of nlp. [SEP]

print_source(tokenizer.convert_tokens_to_string)

    def convert_tokens_to_string(self, tokens: List[str]) ->str:
        return self.backend_tokenizer.decoder.decode(tokens)

tokenizer.vocab_size

tokenizer.model_max_length

tokenizer.model_input_names

    ['input_ids', 'attention_mask']

Tokenizing the Whole Dataset

We need to define a processing function to tokenize training examples.

def tokenize(batch):
    # Apply the tokenizer to a batch of examples
    return tokenizer(batch["text"],
                     # Pad examples with zeros to the longest one in the batch
                     padding=True, 
                     # Truncate the examples to the model's maximum context size
                     truncation=True)

print(tokenize(emotions["train"][:2]))

    {'input_ids': [[101, 1045, 2134, 2102, 2514, 26608, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2064, 2175, 2013, 3110, 2061, 20625, 2000, 2061, 9636, 17772, 2074, 2013, 2108, 2105, 2619, 2040, 14977, 1998, 2003, 8300, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

Note: 1. The zeros have a corresponding [PAD] token in the vocabulary. 2. The attention mask allows the model to ignore padded parts of the input.

tokens2ids = list(zip(tokenizer.all_special_tokens, tokenizer.all_special_ids))
data = sorted(tokens2ids, key=lambda x : x[-1])
df = pd.DataFrame(data, columns=["Special Token", "Special Token ID"])
df.T

	0	1	2	3	4
Special Token	[PAD]	[UNK]	[CLS]	[SEP]	[MASK]
Special Token ID	0	100	101	102	103

`DatasetDict.map`

Documentation
Apply a function to all the elements in the tables for all datasets in the dictionary.

print_source(datasets.DatasetDict.map)

    def map(self, function, with_indices: bool=False, input_columns: Optional[
        Union[str, List[str]]]=None, batched: bool=False, batch_size: Optional[
        int]=1000, remove_columns: Optional[Union[str, List[str]]]=None,
        keep_in_memory: bool=False, load_from_cache_file: bool=True,
        cache_file_names: Optional[Dict[str, Optional[str]]]=None,
        writer_batch_size: Optional[int]=1000, features: Optional[Features]=
        None, disable_nullable: bool=False, fn_kwargs: Optional[dict]=None,
        num_proc: Optional[int]=None, desc: Optional[str]=None) ->'DatasetDict':
        self._check_values_type()
        if cache_file_names is None:
            cache_file_names = {k: None for k in self}
        return DatasetDict({k: dataset.map(function=function, with_indices=
            with_indices, input_columns=input_columns, batched=batched,
            batch_size=batch_size, remove_columns=remove_columns,
            keep_in_memory=keep_in_memory, load_from_cache_file=
            load_from_cache_file, cache_file_name=cache_file_names[k],
            writer_batch_size=writer_batch_size, features=features,
            disable_nullable=disable_nullable, fn_kwargs=fn_kwargs, num_proc=
            num_proc, desc=desc) for k, dataset in self.items()})

`Dataset.map`

Documentation
Apply a function to all the examples in the table and update the table.

# Apply the processing function across all splits in the corpus in batches
emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)

Note: * The map function applies the processing function to the entire dataset as a single batch when the batch size is None. * This approach ensures the input tensors and attention masks have the same shape globally.

print(emotions_encoded["train"].column_names)

    ['attention_mask', 'input_ids', 'label', 'text']

Training a Text Classifier

Models like DistilBERT are pretrained to predict masked words in a sequence, and we need to modify them for text classification.
We can combine the body of the pretrained model with a custom classification head.

Architecture of an Encoder-Based Classifier

Tokenize the text and represent it as one-hot vectors called token encodings.
- The size of the tokenized vocabulary determines the dimensions of the token encodings and usually consists of 20 thousand to 200 thousand unique tokens.
Convert the token encodings to token embeddings, which are vectors living in the lowe-dimensional space.
Pass the token embeddings through the encoder block layers, which yield a hidden state for each input token.
Replace the pretrained language modeling layer with a classification layer.

Note: PyTorch skips the step of creating one-hot vectors because multiplying a matrix with a one-hot vector is the same as selecting a column with the token ID from the matrix.

Methods to Train a Text Classifier

Feature Extraction

Use the hidden states as features and train the classifier on them without modifying the pretrained model.

Fine-tuning

Train the whole model end-to-end, which also updates the parameters of the pretrained model.

Transformers as Feature Extractors

This method is well-suited for quickly training a small or shallow model.
The model could be a neural classification layer or a method that does not rely on gradients like random forests.
Using the transformer as a feature extractor is especially useful when GPUs are unavailable since the hidden states only need to be precomputed once.

Using pretrained models

from transformers import AutoModel

`AutoModel.from_pretrained`

Documentation
Instantiate one of the base model classes of the library from a pretrained model.

print_source(AutoModel.from_pretrained)

    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
        trust_remote_code = kwargs.pop('trust_remote_code', False)
        kwargs['_from_auto'] = True
        if not isinstance(config, PretrainedConfig):
            config, kwargs = AutoConfig.from_pretrained(
                pretrained_model_name_or_path, return_unused_kwargs=True, **kwargs)
        if hasattr(config, 'auto_map') and cls.__name__ in config.auto_map:
            if not trust_remote_code:
                raise ValueError(
                    f'Loading {pretrained_model_name_or_path} requires you to execute the modeling file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code=True` to remove this error.'
                    )
            if kwargs.get('revision', None) is None:
                logger.warn(
                    'Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.'
                    )
            class_ref = config.auto_map[cls.__name__]
            module_file, class_name = class_ref.split('.')
            model_class = get_class_from_dynamic_module(
                pretrained_model_name_or_path, module_file + '.py', class_name,
                **kwargs)
            return model_class.from_pretrained(pretrained_model_name_or_path, *
                model_args, config=config, **kwargs)
        elif type(config) in cls._model_mapping.keys():
            model_class = _get_model_class(config, cls._model_mapping)
            return model_class.from_pretrained(pretrained_model_name_or_path, *
                model_args, config=config, **kwargs)
        raise ValueError(
            f"""Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.
    Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."""
            )

model_ckpt = "distilbert-base-uncased"
# Use a CUDA GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Instantiate a pretrained DistilBertModel
model = AutoModel.from_pretrained(model_ckpt).to(device)

model

    DistilBertModel(
      (embeddings): Embeddings(
        (word_embeddings): Embedding(30522, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (transformer): Transformer(
        (layer): ModuleList(
          (0): TransformerBlock(
            (attention): MultiHeadSelfAttention(
              (dropout): Dropout(p=0.1, inplace=False)
              (q_lin): Linear(in_features=768, out_features=768, bias=True)
              (k_lin): Linear(in_features=768, out_features=768, bias=True)
              (v_lin): Linear(in_features=768, out_features=768, bias=True)
              (out_lin): Linear(in_features=768, out_features=768, bias=True)
            )
            (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (ffn): FFN(
              (dropout): Dropout(p=0.1, inplace=False)
              (lin1): Linear(in_features=768, out_features=3072, bias=True)
              (lin2): Linear(in_features=3072, out_features=768, bias=True)
            )
            (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          )
          (1): TransformerBlock(
            (attention): MultiHeadSelfAttention(
              (dropout): Dropout(p=0.1, inplace=False)
              (q_lin): Linear(in_features=768, out_features=768, bias=True)
              (k_lin): Linear(in_features=768, out_features=768, bias=True)
              (v_lin): Linear(in_features=768, out_features=768, bias=True)
              (out_lin): Linear(in_features=768, out_features=768, bias=True)
            )
            (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (ffn): FFN(
              (dropout): Dropout(p=0.1, inplace=False)
              (lin1): Linear(in_features=768, out_features=3072, bias=True)
              (lin2): Linear(in_features=3072, out_features=768, bias=True)
            )
            (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          )
          (2): TransformerBlock(
            (attention): MultiHeadSelfAttention(
              (dropout): Dropout(p=0.1, inplace=False)
              (q_lin): Linear(in_features=768, out_features=768, bias=True)
              (k_lin): Linear(in_features=768, out_features=768, bias=True)
              (v_lin): Linear(in_features=768, out_features=768, bias=True)
              (out_lin): Linear(in_features=768, out_features=768, bias=True)
            )
            (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (ffn): FFN(
              (dropout): Dropout(p=0.1, inplace=False)
              (lin1): Linear(in_features=768, out_features=3072, bias=True)
              (lin2): Linear(in_features=3072, out_features=768, bias=True)
            )
            (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          )
          (3): TransformerBlock(
            (attention): MultiHeadSelfAttention(
              (dropout): Dropout(p=0.1, inplace=False)
              (q_lin): Linear(in_features=768, out_features=768, bias=True)
              (k_lin): Linear(in_features=768, out_features=768, bias=True)
              (v_lin): Linear(in_features=768, out_features=768, bias=True)
              (out_lin): Linear(in_features=768, out_features=768, bias=True)
            )
            (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (ffn): FFN(
              (dropout): Dropout(p=0.1, inplace=False)
              (lin1): Linear(in_features=768, out_features=3072, bias=True)
              (lin2): Linear(in_features=3072, out_features=768, bias=True)
            )
            (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          )
          (4): TransformerBlock(
            (attention): MultiHeadSelfAttention(
              (dropout): Dropout(p=0.1, inplace=False)
              (q_lin): Linear(in_features=768, out_features=768, bias=True)
              (k_lin): Linear(in_features=768, out_features=768, bias=True)
              (v_lin): Linear(in_features=768, out_features=768, bias=True)
              (out_lin): Linear(in_features=768, out_features=768, bias=True)
            )
            (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (ffn): FFN(
              (dropout): Dropout(p=0.1, inplace=False)
              (lin1): Linear(in_features=768, out_features=3072, bias=True)
              (lin2): Linear(in_features=3072, out_features=768, bias=True)
            )
            (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          )
          (5): TransformerBlock(
            (attention): MultiHeadSelfAttention(
              (dropout): Dropout(p=0.1, inplace=False)
              (q_lin): Linear(in_features=768, out_features=768, bias=True)
              (k_lin): Linear(in_features=768, out_features=768, bias=True)
              (v_lin): Linear(in_features=768, out_features=768, bias=True)
              (out_lin): Linear(in_features=768, out_features=768, bias=True)
            )
            (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (ffn): FFN(
              (dropout): Dropout(p=0.1, inplace=False)
              (lin1): Linear(in_features=768, out_features=3072, bias=True)
              (lin2): Linear(in_features=3072, out_features=768, bias=True)
            )
            (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          )
        )
      )
    )

Extracting the last hidden states

# Encode some sample text
text = "this is a test"
inputs = tokenizer(text, return_tensors="pt")
print(f"Input tensor shape: {inputs['input_ids'].size()}")

    Input tensor shape: torch.Size([1, 6])

# Move the input tensors to the same device as the model
inputs = {k:v.to(device) for k,v in inputs.items()}
# Get the last hidden states for the sample text
with torch.no_grad():
    outputs = model(**inputs)
print(outputs)

    BaseModelOutput(last_hidden_state=tensor([[[-0.1565, -0.1862,  0.0528,  ..., -0.1188,  0.0662,  0.5470],
             [-0.3575, -0.6484, -0.0618,  ..., -0.3040,  0.3508,  0.5221],
             [-0.2772, -0.4459,  0.1818,  ..., -0.0948, -0.0076,  0.9958],
             [-0.2841, -0.3917,  0.3753,  ..., -0.2151, -0.1173,  1.0526],
             [ 0.2661, -0.5094, -0.3180,  ..., -0.4203,  0.0144, -0.2149],
             [ 0.9441,  0.0112, -0.4714,  ...,  0.1439, -0.7288, -0.1619]]],
           device='cuda:0'), hidden_states=None, attentions=None)

Note: The hidden state tensor has the shape [batch_size, n_tokens, hidden_dim].

outputs.last_hidden_state.size()

    torch.Size([1, 6, 768])

outputs.last_hidden_state[:,0].size()

    torch.Size([1, 768])

Note: * For classification tasks, it is common to use the hidden state associated with the [CLS] token as the input feature. * Since the [CLS] token appears at the start of each sequence, we can extract it by accessing the associated index of the hidden state tensor.

def extract_hidden_states(batch):
    # Place model inputs on the GPU
    inputs = {k:v.to(device) for k,v in batch.items() 
              if k in tokenizer.model_input_names}
    # Disable automatic calculation of the gradient
    with torch.no_grad():
        # Extract last hidden states
        last_hidden_state = model(**inputs).last_hidden_state
    # Return vector for [CLS] token
    return {"hidden_state": last_hidden_state[:,0].cpu().numpy()}

Note: The map() method requires the processing function to return Python or NumPy objects.

emotions_encoded.set_format("torch", columns=["input_ids", "attention_mask", "label"])

# Extract the hidden states for every token in the dataset
emotions_hidden = emotions_encoded.map(extract_hidden_states, batched=True)

emotions_hidden["train"].column_names

    ['attention_mask', 'hidden_state', 'input_ids', 'label', 'text']

Creating a feature matrix

We can use the hidden states as input features and the labels as targets.

import numpy as np

# Get the input and target data for the training and validation sets
X_train = np.array(emotions_hidden["train"]["hidden_state"])
X_valid = np.array(emotions_hidden["validation"]["hidden_state"])
y_train = np.array(emotions_hidden["train"]["label"])
y_valid = np.array(emotions_hidden["validation"]["label"])
X_train.shape, X_valid.shape

    ((16000, 768), (2000, 768))

X_train[0].size, y_train[0].size

    (768, 1)

Visualizing the training set

from umap import UMAP
from sklearn.preprocessing import MinMaxScaler

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Documentation
UMAP is a dimension reduction technique that can be useful for visualization as a drop-in replacement for t-SNE.
We can use the UMAP algorithm to scale the 768-dimensional vectors down to a 2-dimensional representation.
UMAP works best with feature values scaled to [0,1].

Scit-Kit Learn MinMaxScaler

Documentation
Transform features by scaling each to a given range.

# Scale features to [0,1] range
X_scaled = MinMaxScaler().fit_transform(X_train)
# Initialize and fit UMAP
mapper = UMAP(n_components=2, metric="cosine").fit(X_scaled)
# Create a DataFrame of 2D embeddings
df_emb = pd.DataFrame(mapper.embedding_, columns=["X", "Y"])
df_emb["label"] = y_train
df_emb.head()

	X	Y	label
0	4.345317	6.545871	0
1	-2.770711	5.816418	0
2	5.433491	3.048345	3
3	-2.210309	3.550199	2
4	-3.055895	3.723285	3

Note: The UMAP algorithm has compressed the hidden state vectors from 768 dimensions to 2 dimensions.

`matplotlib.pyplot.hexbin`

Documentation
Make a 2D hexagonal binning plot of points x, y.

fig, axes = plt.subplots(2, 3, figsize=(14,10))
# Collapse the array into one dimension
axes = axes.flatten()
cmaps = ["Greys", "Blues", "Oranges", "Reds", "Purples", "Greens"]
labels = emotions["train"].features["label"].names

for i, (label, cmap) in enumerate(zip(labels, cmaps)):
    df_emb_sub = df_emb.query(f"label == {i}")
    axes[i].hexbin(df_emb_sub["X"], df_emb_sub["Y"], cmap=cmap, gridsize=20, linewidths=(0,))
    axes[i].set_title(label)
    axes[i].set_xticks([]), axes[i].set_yticks([])

# Adjust the padding between and around subplots.
plt.tight_layout()
plt.show()

Note: * The negative feelings such as sadness, anger, and fear occupy similar regions in the hidden state with slightly varying distributions. * The positive emotions, joy, and love, are well separated from the negative emotions and share a similar space. * Surprise is scattered all over the place. * The model did not train to know the difference between these emotions. It learned them implicitly by guessing masked words in the training corpus. * Just because some categories overlap when projected onto a lower-dimensional space does not mean they are not separable in the original space.

Training a simple classifier

We can use the hidden states to train a simple logistic regression model.
This type of model is trains quickly and does not require a GPU.

from sklearn.linear_model import LogisticRegression

`sklearn.linear_model.LogisticRegression`

Documentation
Logistic Regression classifier

# Increase `max_iter` to guarantee convergence 
lr_clf = LogisticRegression(max_iter=3000)
lr_clf.fit(X_train, y_train)

    LogisticRegression(max_iter=3000)

lr_clf.score(X_valid, y_valid)

    0.6335

Note: The model performs well, considering the training data is unbalanced.

from sklearn.dummy import DummyClassifier

`sklearn.dummy.DummyClassifier`

Documentation
A DummyClassifier makes predictions using a predefined strategy and ignores the input features.
The classifier serves as a simple baseline to compare against other more complex classifiers.

# Set the DummyClassifier to always select the most frequent class
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train, y_train)
dummy_clf.score(X_valid, y_valid)

    0.352

Note: The simple logistic regression classifier performs significantly better than a model that always selects the most frequent class.

from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

`sklearn.metrics.ConfusionMatrixDisplay`

Documentation
Create a confusion matrix visualization

`sklearn.metrics.confusion_matrix`

Documentation
Compute confusion matrix to evaluate the accuracy of a classification.

def plot_confusion_matrix(y_preds, y_true, labels):
    cm = confusion_matrix(y_true, y_preds, normalize="true")
    fig, ax = plt.subplots(figsize=(6, 6))
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
    disp.plot(cmap="Blues", values_format=".2f", ax=ax, colorbar=False)
    plt.title("Normalized confusion matrix")
    plt.show()
    
y_preds = lr_clf.predict(X_valid)
plot_confusion_matrix(y_preds, y_valid, labels)

Note: * Anger and fear are most often confused with sadness. * Love and surprise are frequently mistaken for joy.

Fine-Tuning Transformers

Fine-tuning results in superior performance than feature extraction but requires more computational resources such as GPUs.
Fine-tuning involves training the hidden states, so the classification head needs to be differentiable.
Training the hidden states that serve as input to the classifier helps avoid the problem of working with data that may not be well suited for the classification task.

Loading a pretrained model

from transformers import AutoModelForSequenceClassification

`AutoModelForSequenceClassification.from_pretrained`

Documentation
Instantiate one of the model classes of the library (with a sequence classification head) from a pretrained model.

# Specify the number of labels (i.e. the number of emotions)
num_labels = 6
model = (AutoModelForSequenceClassification
         .from_pretrained(model_ckpt, num_labels=num_labels)
         .to(device))

Note: The classifier head is randomly initialized.

type(model)

transformers.models.distilbert.modeling_distilbert.DistilBertForSequenceClassification

`DistilBertForSequenceClassification`

Documentation
DistilBert Model transformer with a sequence classification head on top

for child in model.named_children(): print(child[0])

    distilbert
    pre_classifier
    classifier
    dropout

list(model.named_children())[-3:]

    [('pre_classifier', Linear(in_features=768, out_features=768, bias=True)),
     ('classifier', Linear(in_features=768, out_features=6, bias=True)),
     ('dropout', Dropout(p=0.2, inplace=False))]

Defining the performance metrics

We need to define a function to compute metrics for the trainer so we can monitor performance during training.
The function receives an EvalPrediction object containing predictions and label_ids attributes and returns a dictionary that maps each metric’s name to its value.

from sklearn.metrics import accuracy_score, f1_score

`sklearn.metrics.f1_score`

Documentation
Compute the \(F_{1}\)-score

`sklearn.metrics.accuracy_score`

Documentation
Compute the classification accuracy score.

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1}

Training the model

We can use the Hugging Face Hub API to push our fine-tuned model to our account on the Hub and share it with the community.

from huggingface_hub import notebook_login

inspect.getdoc(notebook_login)

    'Displays a widget to login to the HF website and store the token.'

print_source(notebook_login)

    def notebook_login():
        try:
            import ipywidgets.widgets as widgets
            from IPython.display import clear_output, display
        except ImportError:
            raise ImportError(
                'The `notebook_login` function can only be used in a notebook (Jupyter or Colab) and you need the `ipywdidgets` module: `pip install ipywidgets`.'
                )
        box_layout = widgets.Layout(display='flex', flex_flow='column',
            align_items='center', width='50%')
        token_widget = widgets.Password(description='Token:')
        token_finish_button = widgets.Button(description='Login')
        switch_button = widgets.Button(description='Use password')
        login_token_widget = widgets.VBox([widgets.HTML(
            NOTEBOOK_LOGIN_TOKEN_HTML_START), token_widget, token_finish_button,
            widgets.HTML(NOTEBOOK_LOGIN_TOKEN_HTML_END), switch_button], layout
            =box_layout)
        display(login_token_widget)
        input_widget = widgets.Text(description='Username:')
        password_widget = widgets.Password(description='Password:')
        password_finish_button = widgets.Button(description='Login')
        login_password_widget = widgets.VBox([widgets.HTML(value=
            NOTEBOOK_LOGIN_PASSWORD_HTML), widgets.HBox([input_widget,
            password_widget]), password_finish_button], layout=box_layout)
    
        def login_token_event(t):
            token_widget.value = ''
            clear_output()
            _login(HfApi(), token=token)
        token_finish_button.on_click(login_token_event)
    
        def login_password_event(t):
            password = password_widget.value
            password_widget.value = ''
            clear_output()
            _login(HfApi(), username=username, password=password)
        password_finish_button.on_click(login_password_event)
    
        def switch_event(t):
            display(login_password_widget)
        switch_button.on_click(switch_event)

notebook_login()

!git config --global credential.helper store

Note: The equivalent terminal command is huggingface-cli login.

from transformers import TrainingArguments

`TrainingArguments`

Documentation
The TrainingArguments class provides fine-grained control over the arguments related to the training loop.

pd.DataFrame(inspect.signature(TrainingArguments).parameters).T

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	33	34	35	36	37	38	39	40	41	42	43	44	45	46	47	48	49	50	51	52	53	54	55	56	57	58	59	60	61	62	63	64	65	66	67	68	69	70	71	72	73	74	75	76	77	78	79
0	output_dir	overwrite_output_dir	do_train	do_eval	do_predict	evaluation_strategy	prediction_loss_only	per_device_train_batch_size	per_device_eval_batch_size	per_gpu_train_batch_size	per_gpu_eval_batch_size	gradient_accumulation_steps	eval_accumulation_steps	learning_rate	weight_decay	adam_beta1	adam_beta2	adam_epsilon	max_grad_norm	num_train_epochs	max_steps	lr_scheduler_type	warmup_ratio	warmup_steps	log_level	log_level_replica	log_on_each_node	logging_dir	logging_strategy	logging_first_step	logging_steps	logging_nan_inf_filter	save_strategy	save_steps	save_total_limit	save_on_each_node	no_cuda	seed	fp16	fp16_opt_level	fp16_backend	fp16_full_eval	local_rank	xpu_backend	tpu_num_cores	tpu_metrics_debug	debug	dataloader_drop_last	eval_steps	dataloader_num_workers	past_index	run_name	disable_tqdm	remove_unused_columns	label_names	load_best_model_at_end	metric_for_best_model	greater_is_better	ignore_data_skip	sharded_ddp	deepspeed	label_smoothing_factor	adafactor	group_by_length	length_column_name	report_to	ddp_find_unused_parameters	dataloader_pin_memory	skip_memory_metrics	use_legacy_prediction_loop	push_to_hub	resume_from_checkpoint	hub_model_id	hub_strategy	hub_token	gradient_checkpointing	push_to_hub_model_id	push_to_hub_organization	push_to_hub_token	mp_parameters

batch_size = 64
logging_steps = len(emotions_encoded["train"]) // batch_size
model_name = f"{model_ckpt}-finetuned-emotion"
training_args = TrainingArguments(output_dir=model_name,
                                  num_train_epochs=2,
                                  learning_rate=2e-5,
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  weight_decay=0.01,
                                  evaluation_strategy="epoch",
                                  disable_tqdm=False,
                                  logging_steps=logging_steps,
                                  push_to_hub=True, 
                                  log_level="error")

training_args.output_dir

    'distilbert-base-uncased-finetuned-emotion'

from transformers import Trainer

`Trainer`

Documentation
The Trainer class provides a simple, feature-complete training and eval loop for PyTorch, optimized for Hugging Face Transformers.

Note: Install Git-LFS before running the following code cell.

trainer = Trainer(model=model, args=training_args, 
                  compute_metrics=compute_metrics,
                  train_dataset=emotions_encoded["train"],
                  eval_dataset=emotions_encoded["validation"],
                  tokenizer=tokenizer)

Note: Had to add the following workaround

old_collator = trainer.data_collator
trainer.data_collator = lambda data: dict(old_collator(data))

trainer.train();

<table border="1" class="dataframe">
    <thead>
        <tr style="text-align: left;">
            <th>Epoch</th>
            <th>Training Loss</th>
            <th>Validation Loss</th>
            <th>Accuracy</th>
            <th>F1</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>1</td>
            <td>0.044200</td>
            <td>0.239172</td>
            <td>0.926000</td>
            <td>0.926452</td>
        </tr>
        <tr>
            <td>2</td>
            <td>0.046300</td>
            <td>0.220463</td>
            <td>0.936000</td>
            <td>0.936133</td>
        </tr>
    </tbody>
</table>

preds_output = trainer.predict(emotions_encoded["validation"])

Note: The predict method returns a PredictionOutput object, which contains arrays of predictions and label_ids, along with the user-defined metrics.

type(preds_output)

transformers.trainer_utils.PredictionOutput

preds_output.metrics

{'test_loss': 0.22046349942684174,
 'test_accuracy': 0.936,
 'test_f1': 0.9361334972007946,
 'test_runtime': 1.522,
 'test_samples_per_second': 1314.038,
 'test_steps_per_second': 21.025}

Note: The fine-tuned model performs significantly better than the feature-based logistic regression classifier.

# Get the predicted labels
y_preds = np.argmax(preds_output.predictions, axis=1)

plot_confusion_matrix(y_preds, y_valid, labels)

Note:

This one is much closer to the ideal diagonal confusion matrix than the confusion matrix for the logistic regression model.
The love category is still often confused with joy.
Surprise is frequently confused for joy or fear.

Error analysis

A simple error-analysis technique involves sorting the validation samples by the model loss.
- This approach allows us to find and correct mislabeled data.
Inspecting the model’s weakest predictions can help identify quirks of the dataset.
- Cleaning the data or injecting similar examples can make the model more robust.
We can significantly improve model performance by refining the dataset without obtaining more data or using a larger model.

from torch.nn.functional import cross_entropy

def forward_pass_with_label(batch):
    # Place all input tensors on the same device as the model
    inputs = {k:v.to(device) for k,v in batch.items() 
              if k in tokenizer.model_input_names}

    with torch.no_grad():
        output = model(**inputs)
        pred_label = torch.argmax(output.logits, axis=-1)
        # Compute the cross entropy loss between the prediction and the target
        loss = cross_entropy(output.logits, batch["label"].to(device), 
                             reduction="none")

    # Place outputs on CPU for compatibility with other dataset columns   
    return {"loss": loss.cpu().numpy(), 
            "predicted_label": pred_label.cpu().numpy()}

# Convert our dataset back to PyTorch tensors
emotions_encoded.set_format("torch", columns=["input_ids", "attention_mask", "label"])
# Compute loss values
emotions_encoded["validation"] = emotions_encoded["validation"].map(
    forward_pass_with_label, batched=True, batch_size=16)

# Create a DataFrame with the texts, losses, and predicted/true labels
emotions_encoded.set_format("pandas")
cols = ["text", "label", "predicted_label", "loss"]
df_test = emotions_encoded["validation"][:][cols]
df_test["label"] = df_test["label"].apply(label_int2str)
df_test["predicted_label"] = (df_test["predicted_label"]
                              .apply(label_int2str))

# Sort the validation samples by the model loss
df_test.sort_values("loss", ascending=False).head(10)

	text	label	predicted_label	loss
318	i felt ashamed of these feelings and was scared because i knew that something wrong with me and thought i might be gay	fear	sadness	8.869118
1509	i guess this is a memoir so it feels like that should be fine too except i dont know something about such a deep amount of self absorption made me feel uncomfortable	joy	fear	8.770837
1950	i as representative of everything thats wrong with corporate america and feel that sending him to washington is a ludicrous idea	surprise	sadness	8.217673
882	i feel badly about reneging on my commitment to bring donuts to the faithful at holy family catholic church in columbus ohio	love	sadness	8.134083
1757	i feel like there s a reason to buy another tom petty record	anger	joy	7.790391
1111	im lazy my characters fall into categories of smug and or blas people and their foils people who feel inconvenienced by smug and or blas people	joy	fear	7.778357
1500	i guess we would naturally feel a sense of loneliness even the people who said unkind things to you might be missed	anger	sadness	7.741042
1919	i should admit when consuming alcohol myself in small amounts i feel much less inhibited ideas come to me more easily and i can write with greater ease	fear	sadness	7.342785
415	im kind of embarrassed about feeling that way though because my moms training was such a wonderfully defining part of my own life and i loved and still love	love	sadness	7.320217
1801	i feel that he was being overshadowed by the supporting characters	love	sadness	6.833299

Note:

The model made some incorrect predictions.
Some examples seem mislabeled or do not fit into one of the six emotion classes.
Joy, in particular, seems to be mislabeled several times.

df_test.sort_values("loss", ascending=True).head(10)

	text	label	predicted_label	loss
702	i only find out that they are looking and feeling complacent just before a match started and i have no other way to find out except through the assistant manager	joy	joy	0.000212
1205	i log on feeling vaguely sociable and after a short amount of time im all socialised out	joy	joy	0.000214
1607	i feel incredibly mellow and spacey	joy	joy	0.000214
452	i manage to complete the lap not too far behind the front runners and am feeling pretty jubilant until i realise that this is just the warm up	joy	joy	0.000215
400	i are just relaxing together and i feel ecstatic and blissfully happy because i know he loves me and i love him	joy	joy	0.000217
911	i feel in love with a cute little maltese	joy	joy	0.000218
1567	i feel wonderful shayla admitted	joy	joy	0.000220
1198	i feel like i should also mention that there was some content that i wasnt thrilled with either	joy	joy	0.000220
1951	i can finish even if i have to eat and feel satisfied bellmont cabinets before it leaves bellmont cabinets a wipe out on the spot it is not necessary to wipe out for when you o	joy	joy	0.000221
293	i am sure she makes all waiting couples feel this way but we left feeling like she is pulling for us and she will be so thrilled when it all works out	joy	joy	0.000222

`Trainer.push_to_hub`

Documentation
Upload the trainer model and tokenizer to the Hugging Face Model Hub.

trainer.push_to_hub(commit_message="Training completed!")

    'https://huggingface.co/cj-mills/distilbert-base-uncased-finetuned-emotion/commit/5ca5827ba0121e07c8056a8592398e73beca3f17'

Inference

We can now perform inference using the fine-tuned model from our Hub repository.

from transformers import pipeline

# Change `transformersbook` to your Hub username
model_id = "cj-mills/distilbert-base-uncased-finetuned-emotion"
classifier = pipeline("text-classification", model=model_id)

custom_tweet = "I saw a movie today and it was really good."
preds = classifier(custom_tweet, return_all_scores=True)

preds_df = pd.DataFrame(preds[0])
plt.bar(labels, 100 * preds_df["score"], color='C0')
plt.title(f'"{custom_tweet}"')
plt.ylabel("Class probability (%)")
plt.show()

custom_tweet = "I saw a movie today and it was garbage!."
preds = classifier(custom_tweet, return_all_scores=True)

preds_df = pd.DataFrame(preds[0])
plt.bar(labels, 100 * preds_df["score"], color='C0')
plt.title(f'"{custom_tweet}"')
plt.ylabel("Class probability (%)")
plt.show()

custom_tweet = "I saw a movie today and it was really weird."
preds = classifier(custom_tweet, return_all_scores=True)

preds_df = pd.DataFrame(preds[0])
plt.bar(labels, 100 * preds_df["score"], color='C0')
plt.title(f'"{custom_tweet}"')
plt.ylabel("Class probability (%)")
plt.show()

Conclusion

NLP Challenges

Moving a model to production

Hugging Face creates an inference endpoint automatically when you push a model to the Hub.
Hugging Face Accelerated Inference API

Increasing Inference Speed

The process used to create the more efficient DistilBERT model is called knowledge distillation.

Applying a Model to other tasks

Transformers are exceedingly versatile.

Using Non-English Text

Multilingual transformers are available.

Working with little labeled data

Fine-tuning might not be an option when little labeled training data is available.

References

Previous: Notes on Transformers Book Ch. 1

Next: Notes on Transformers Book Ch. 3

About Me:

I’m Christian Mills, a deep learning consultant specializing in practical AI implementations. I help clients leverage cutting-edge AI technologies to solve real-world problems.

Interested in working together? Fill out my Quick AI Project Assessment form or learn more about me.

Project: Analyze Product Sentiment on Twitter

Hugging Face Project Pipeline:

The Dataset

A First Look at Hugging Face Datasets

load_dataset

Methods to Load Common Data Formats

Automated Process

Manual Process - Local

Manual Process - Remote

DatasetDict

Dataset

ClassLabel

Value

From Datasets to DataFrames

DatasetDict.set_format

Dataset.set_format

ClassLabel.int2str

Looking at the Class Distribution

Methods to Deal with Imbalanced Data

imbalanced-learn

How Long Are Our Tweets?

DatasetDict.reset_format()

From Text to Tokens

Character Tokenization

Numericalization

One-hot Encoding

PyTorch one_hot:

Word Tokenization

Subword Tokenization

AutoTokenizer

DistilBertTokenizerFast

convert_ids_to_tokens

Tokenizing the Whole Dataset

DatasetDict.map

Dataset.map

Training a Text Classifier

Architecture of an Encoder-Based Classifier

Methods to Train a Text Classifier

Feature Extraction

Fine-tuning

Transformers as Feature Extractors

Using pretrained models

AutoModel.from_pretrained

Extracting the last hidden states

Creating a feature matrix

Visualizing the training set

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Scit-Kit Learn MinMaxScaler

matplotlib.pyplot.hexbin

Training a simple classifier

sklearn.linear_model.LogisticRegression

sklearn.dummy.DummyClassifier

sklearn.metrics.ConfusionMatrixDisplay

sklearn.metrics.confusion_matrix

Fine-Tuning Transformers

Loading a pretrained model

AutoModelForSequenceClassification.from_pretrained

DistilBertForSequenceClassification

Defining the performance metrics

sklearn.metrics.f1_score

sklearn.metrics.accuracy_score

Training the model

TrainingArguments

Trainer

Error analysis

Saving and sharing the model

Trainer.push_to_hub

Inference

Conclusion

NLP Challenges

Moving a model to production

Increasing Inference Speed

Applying a Model to other tasks

Using Non-English Text

Working with little labeled data

References

`load_dataset`

`DatasetDict`

`Dataset`

`ClassLabel`

`Value`

`DatasetDict.set_format`

`Dataset.set_format`

`ClassLabel.int2str`

`DatasetDict.reset_format()`

PyTorch `one_hot`:

`AutoTokenizer`

`DistilBertTokenizerFast`

`convert_ids_to_tokens`

`DatasetDict.map`

`Dataset.map`

`AutoModel.from_pretrained`

`matplotlib.pyplot.hexbin`

`sklearn.linear_model.LogisticRegression`

`sklearn.dummy.DummyClassifier`

`sklearn.metrics.ConfusionMatrixDisplay`

`sklearn.metrics.confusion_matrix`

`AutoModelForSequenceClassification.from_pretrained`

`DistilBertForSequenceClassification`

`sklearn.metrics.f1_score`

`sklearn.metrics.accuracy_score`

`TrainingArguments`

`Trainer`

`Trainer.push_to_hub`