Notes on Transformers Book Ch. 9
- Dealing with Few to No Labels
- Project: Build a GitHub Issues Tagger
- Implementing a Naive Bayesline
- Working with No Labeled Data
- Working with a Few Labels
- Leveraging Unlabeled Data
- Conclusion
- References
import transformers
import datasets
import accelerate
# Only print error messages
transformers.logging.set_verbosity_error()
datasets.logging.set_verbosity_error()
transformers.__version__, datasets.__version__, accelerate.__version__
('4.18.0', '2.1.0', '0.5.1')
import ast
# https://astor.readthedocs.io/en/latest/
import astor
import inspect
import textwrap
def print_source(obj, exclude_doc=True):
# Get source code
= inspect.getsource(obj)
source # Remove any common leading whitespace from every line
= textwrap.dedent(source)
cleaned_source # Parse the source into an AST node.
= ast.parse(cleaned_source)
parsed
for node in ast.walk(parsed):
# Skip any nodes that are not class or function definitions
if not isinstance(node, (ast.FunctionDef, ast.ClassDef, ast.AsyncFunctionDef)):
continue
if exclude_doc and len(node.body) > 1: node.body = node.body[1:]
print(astor.to_source(parsed))
Dealing with Few to No Labels
- We often have little to no labeled data when starting a new project.
- Non-pretrained models do not perform well with little data.
- Annotating additional training examples is time-consuming and expensive.
- There are several methods for dealing with few to no labels.
- Zero-shot learning often sets a strong baseline when there is no labeled data.
- Standard fine-tuning works well when there is a lot of labeled data.
- We can fine-tune a language model on a large corpus of unlabeled data before training a classifier on a small number of labeled examples.
- More sophisticated methods for training with unlabeled data include Unsupervised Data Augmentation and Uncertainty-aware self-training.
- We can use few-shot learning when we only have a small number of labeled examples and no unlabeled data.
- We can also use the embeddings from a pretrained language model to perform lookups with a nearest-neighbor search.
Project: Build a GitHub Issues Tagger
- Many support teams use issue trackers like Jira or GitHub to assist users by tagging issues with metadata based on the issue’s description.
- Tags can define the issue type, the product causing the problem, or which team is responsible for handling the reported issue.
- Automating issue tagging can significantly improve productivity and enables the support teams to focus on helping users.
- The goal is to train a model that automatically tags GitHub issues for the Hugging Face Transformers library.
- GitHub issues contain a title, a description, and a set of tags/labels that characterize them.
- The model will take a title and description as input and predict one or more labels (i.e., multilabel classification).
Getting the Data
- We can use the GitHub REST API to poll the Issues endpoint.
- The Issues endpoint returns a list of JSON objects.
- Each JSON object includes whether it is open or closed, who opened the issue, the title, the body, and the labels.
- The GitHub REST API treats pull requests as issues.
import time
import math
import requests
from pathlib import Path
import pandas as pd
from tqdm.auto import tqdm
Define a function to download issues for a GitHub project to a .jsonl
file
- We need to download the issues in batches to avoid exceeding GitHub’s limit on the number of requests per hour.
def fetch_issues(owner="huggingface", repo="transformers", num_issues=10_000,
=5_000):
rate_limit= []
batch = []
all_issues # Max number of issues we can request per page
= 100
per_page # Number of requests we need to make
= math.ceil(num_issues / per_page)
num_pages = "https://api.github.com/repos"
base_url
for page in tqdm(range(num_pages)):
# Query with state=all to get both open and closed issues
= f"issues?page={page}&per_page={per_page}&state=all"
query # Sample: https://api.github.com/repos/huggingface/transformers/issues?page=0&per_page=100&state=all
= requests.get(f"{base_url}/{owner}/{repo}/{query}")
issues
batch.extend(issues.json())
if len(batch) > rate_limit and len(all_issues) < num_issues:
all_issues.extend(batch)= [] # Flush batch for next time period
batch print(f"Reached GitHub rate limit. Sleeping for one hour ...")
60 * 60 + 1)
time.sleep(
all_issues.extend(batch)= pd.DataFrame.from_records(all_issues)
df f"github-issues-{repo}.jsonl", orient="records", lines=True) df.to_json(
Note: It takes a while to fetch all the issues.
Download the GitHub Issues
# fetch_issues()
Preparing the Data
import pandas as pd
'max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option( pd.__version__
'1.4.2'
Import the dataset
= "https://git.io/nlp-with-transformers"
dataset_url = pd.read_json(dataset_url, lines=True)
df_issues print(f"DataFrame shape: {df_issues.shape}")
DataFrame shape: (9930, 26)
Inspect a single GitHub issue
# Convert Series to DataFrame
2].to_frame() df_issues.loc[
2 | |
---|---|
url | https://api.github.com/repos/huggingface/transformers/issues/11044 |
repository_url | https://api.github.com/repos/huggingface/transformers |
labels_url | https://api.github.com/repos/huggingface/transformers/issues/11044/labels{/name} |
comments_url | https://api.github.com/repos/huggingface/transformers/issues/11044/comments |
events_url | https://api.github.com/repos/huggingface/transformers/issues/11044/events |
html_url | https://github.com/huggingface/transformers/issues/11044 |
id | 849529761 |
node_id | MDU6SXNzdWU4NDk1Mjk3NjE= |
number | 11044 |
title | [DeepSpeed] ZeRO stage 3 integration: getting started and issues |
user | {‘login’: ‘stas00’, ‘id’: 10676103, ‘node_id’: ‘MDQ6VXNlcjEwNjc2MTAz’, ‘avatar_url’: ‘https://avatars.githubusercontent.com/u/10676103?v=4’, ‘gravatar_id’: ’‘, ’url’: ‘https://api.github.com/users/stas00’, ‘html_url’: ‘https://github.com/stas00’, ‘followers_url’: ‘https://api.github.com/users/stas00/followers’, ‘following_url’: ‘https://api.github.com/users/stas00/following{/other_user}’, ‘gists_url’: ‘https://api.github.com/users/stas00/gists{/gist_id}’, ‘starred_url’: ‘https://api.github.com/users/stas00/starred{/owner}{/repo}’, ‘subscriptions_url’: ‘https://api.github.com/users/stas00/subscriptions’, ‘organizations_url’: ‘https://api.github.com/users/stas00/orgs’, ‘repos_url’: ‘https://api.github.com/users/stas00/repos’, ‘events_url’: ‘https://api.github.com/users/stas00/events{/privacy}’, ‘received_events_url’: ‘https://api.github.com/users/stas00/received_events’, ‘type’: ‘User’, ‘site_admin’: False} |
labels | [{‘id’: 2659267025, ‘node_id’: ‘MDU6TGFiZWwyNjU5MjY3MDI1’, ‘url’: ‘https://api.github.com/repos/huggingface/transformers/labels/DeepSpeed’, ‘name’: ‘DeepSpeed’, ‘color’: ‘4D34F7’, ‘default’: False, ‘description’: ’’}] |
state | open |
locked | False |
assignee | {‘login’: ‘stas00’, ‘id’: 10676103, ‘node_id’: ‘MDQ6VXNlcjEwNjc2MTAz’, ‘avatar_url’: ‘https://avatars.githubusercontent.com/u/10676103?v=4’, ‘gravatar_id’: ’‘, ’url’: ‘https://api.github.com/users/stas00’, ‘html_url’: ‘https://github.com/stas00’, ‘followers_url’: ‘https://api.github.com/users/stas00/followers’, ‘following_url’: ‘https://api.github.com/users/stas00/following{/other_user}’, ‘gists_url’: ‘https://api.github.com/users/stas00/gists{/gist_id}’, ‘starred_url’: ‘https://api.github.com/users/stas00/starred{/owner}{/repo}’, ‘subscriptions_url’: ‘https://api.github.com/users/stas00/subscriptions’, ‘organizations_url’: ‘https://api.github.com/users/stas00/orgs’, ‘repos_url’: ‘https://api.github.com/users/stas00/repos’, ‘events_url’: ‘https://api.github.com/users/stas00/events{/privacy}’, ‘received_events_url’: ‘https://api.github.com/users/stas00/received_events’, ‘type’: ‘User’, ‘site_admin’: False} |
assignees | [{‘login’: ‘stas00’, ‘id’: 10676103, ‘node_id’: ‘MDQ6VXNlcjEwNjc2MTAz’, ‘avatar_url’: ‘https://avatars.githubusercontent.com/u/10676103?v=4’, ‘gravatar_id’: ’‘, ’url’: ‘https://api.github.com/users/stas00’, ‘html_url’: ‘https://github.com/stas00’, ‘followers_url’: ‘https://api.github.com/users/stas00/followers’, ‘following_url’: ‘https://api.github.com/users/stas00/following{/other_user}’, ‘gists_url’: ‘https://api.github.com/users/stas00/gists{/gist_id}’, ‘starred_url’: ‘https://api.github.com/users/stas00/starred{/owner}{/repo}’, ‘subscriptions_url’: ‘https://api.github.com/users/stas00/subscriptions’, ‘organizations_url’: ‘https://api.github.com/users/stas00/orgs’, ‘repos_url’: ‘https://api.github.com/users/stas00/repos’, ‘events_url’: ‘https://api.github.com/users/stas00/events{/privacy}’, ‘received_events_url’: ‘https://api.github.com/users/stas00/received_events’, ‘type’: ‘User’, ‘site_admin’: False}] |
milestone | NaN |
comments | 0 |
created_at | 2021-04-02 23:40:42 |
updated_at | 2021-04-03 00:00:18 |
closed_at | NaT |
author_association | COLLABORATOR |
active_lock_reason | None |
body |
[This is not yet alive, preparing for the release, so please ignore for now]DeepSpeed ZeRO-3 has been integrated into HF transformers . I tried to write tests for a wide range of situations I’m sure I’ve missed some scenarios so if you run into any problems please file a separate issue. I’m going to use this issue to track progress on individual ZeRO3 issues.# Why would you want ZeRO-3a few words, while ZeRO-2 was very limited scability-wise - if model.half() couldn’t fit onto a single gpu, adding more gpus won’t have helped so if you had a 24GB GPU you couldn’t train a model larger than about 5B params.with ZeRO-3 the model weights are partitioned across multiple GPUs plus offloaded to CPU, the upper limit on model size has increased by about 2 orders of magnitude. That is ZeRO-3 allows you to scale to huge models with Trillions of parameters assuming you have enough GPUs and general RAM to support this. ZeRO-3 can benefit a lot from general RAM if you have it. If not that’s OK too. ZeRO-3 combines all your GPUs memory and general RAM into a vast pool of memory.you don’t have many GPUs but just a single one but have a lot of general RAM ZeRO-3 will allow you to fit larger models.course, if you run in an environment like the free google colab, while you can use run Deepspeed there, you get so little general RAM it’s very hard to make something out of nothing. Some users (or some sessions) one gets 12GB of RAM which is impossible to work with - you want at least 24GB instances. Setting is up might be tricky too, please see this notebook for an example:://github.com/stas00/porting/blob/master/transformers/deepspeed/DeepSpeed_on_colab_CLI.ipynb# Getting startedthe latest deepspeed version:\r\npip install deepspeed\r\n will want to be on a transformers master branch, if you want to run a quick test:\r\ngit clone https://github.com/huggingface/transformers\r\ncd transformers\r\nBS=4; PYTHONPATH=src USE_TF=0 deepspeed examples/seq2seq/run_translation.py \\r\n--model_name_or_path t5-small --output_dir /tmp/zero3 --overwrite_output_dir --max_train_samples 64 \\r\n--max_val_samples 64 --max_source_length 128 --max_target_length 128 --val_max_target_length 128 \\r\n--do_train --num_train_epochs 1 --per_device_train_batch_size $BS --per_device_eval_batch_size $BS \\r\n--learning_rate 3e-3 --warmup_steps 500 --predict_with_generate --logging_steps 0 --save_steps 0 \\r\n--eval_steps 1 --group_by_length --adafactor --dataset_name wmt16 --dataset_config ro-en --source_lang en \\r\n--target_lang ro --source_prefix "translate English to Romanian: " \\r\n--deepspeed examples/tests/deepspeed/ds_config_zero3.json\r\n will find a very detailed configuration here: https://huggingface.co/transformers/master/main_classes/trainer.html#deepspeednew config file will look like this:json\r\n{\r\n "fp16": {\r\n "enabled": true,\r\n "loss_scale": 0,\r\n "loss_scale_window": 1000,\r\n "initial_scale_power": 16,\r\n "hysteresis": 2,\r\n "min_loss_scale": 1\r\n },\r\n\r\n "zero_optimization": {\r\n "stage": 3,\r\n "cpu_offload": true,\r\n "cpu_offload_params": true,\r\n "cpu_offload_use_pin_memory" : true,\r\n "overlap_comm": true,\r\n "contiguous_gradients": true,\r\n "stage3_max_live_parameters": 1e9,\r\n "stage3_max_reuse_distance": 1e9,\r\n "stage3_prefetch_bucket_size": 0.94e6,\r\n "stage3_param_persistence_threshold": 1e4,\r\n "reduce_bucket_size": 1e6,\r\n "prefetch_bucket_size": 3e6,\r\n "sub_group_size": 1e14,\r\n "stage3_gather_fp16_weights_on_model_save": true\r\n },\r\n\r\n "optimizer": {\r\n "type": "AdamW",\r\n "params": {\r\n "lr": 3e-5,\r\n "betas": [0.8, 0.999],\r\n "eps": 1e-8,\r\n "weight_decay": 3e-7\r\n }\r\n },\r\n\r\n "scheduler": {\r\n "type": "WarmupLR",\r\n "params": {\r\n "warmup_min_lr": 0,\r\n "warmup_max_lr": 3e-5,\r\n "warmup_num_steps": 500\r\n }\r\n },\r\n\r\n "steps_per_print": 2000,\r\n "wall_clock_breakdown": false\r\n}\r\n\r\n if you were already using ZeRO-2 it’s only the zero_optimization stage that has changed.of the biggest nuances of ZeRO-3 is that the model weights aren’t inside model.state_dict , as they are spread out through multiple gpus. The Trainer has been modified to support this but you will notice a slow model saving - as it has to consolidate weights from all the gpus. I’m planning to do more performance improvements in the future PRs, but for now let’s focus on making things work.# Issues / Questionsyou have any general questions or something is unclear/missing in the docs please don’t hesitate to ask in this thread. But for any bugs or problems please open a new Issue and tag me there. You don’t need to tag anybody else. Thank you!
|
performed_via_github_app | NaN |
pull_request | None |
Note: The labels column contains the tags.
Inspect the labels column
2]['labels']) pd.DataFrame(df_issues.loc[
id | node_id | url | name | color | default | description | |
---|---|---|---|---|---|---|---|
0 | 2659267025 | MDU6TGFiZWwyNjU5MjY3MDI1 | https://api.github.com/repos/huggingface/transformers/labels/DeepSpeed | DeepSpeed | 4D34F7 | False |
Extract the tags names from the labels column
"labels"] = (df_issues["labels"].apply(lambda x: [meta["name"] for meta in x])) df_issues[
"labels"]].head() df_issues[[
labels | |
---|---|
0 | [] |
1 | [] |
2 | [DeepSpeed] |
3 | [] |
4 | [] |
Get the number of labels per issue
"labels"].apply(lambda x : len(x)).value_counts().to_frame().T df_issues[
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
labels | 6440 | 3057 | 305 | 100 | 25 | 3 |
Note: Most GitHub issues have zero or one label, and very few have more than one label.
View the only three issues with five tags
'labels'].apply(lambda x: len(x) == 5)].T df_issues[df_issues[
6005 | 7541 | 8266 | |
---|---|---|---|
url | https://api.github.com/repos/huggingface/transformers/issues/5057 | https://api.github.com/repos/huggingface/transformers/issues/3513 | https://api.github.com/repos/huggingface/transformers/issues/2787 |
repository_url | https://api.github.com/repos/huggingface/transformers | https://api.github.com/repos/huggingface/transformers | https://api.github.com/repos/huggingface/transformers |
labels_url | https://api.github.com/repos/huggingface/transformers/issues/5057/labels{/name} | https://api.github.com/repos/huggingface/transformers/issues/3513/labels{/name} | https://api.github.com/repos/huggingface/transformers/issues/2787/labels{/name} |
comments_url | https://api.github.com/repos/huggingface/transformers/issues/5057/comments | https://api.github.com/repos/huggingface/transformers/issues/3513/comments | https://api.github.com/repos/huggingface/transformers/issues/2787/comments |
events_url | https://api.github.com/repos/huggingface/transformers/issues/5057/events | https://api.github.com/repos/huggingface/transformers/issues/3513/events | https://api.github.com/repos/huggingface/transformers/issues/2787/events |
html_url | https://github.com/huggingface/transformers/issues/5057 | https://github.com/huggingface/transformers/issues/3513 | https://github.com/huggingface/transformers/issues/2787 |
id | 639635502 | 589781536 | 562124488 |
node_id | MDU6SXNzdWU2Mzk2MzU1MDI= | MDU6SXNzdWU1ODk3ODE1MzY= | MDU6SXNzdWU1NjIxMjQ0ODg= |
number | 5057 | 3513 | 2787 |
title | Examples tests improvements | Adding mbart-large-cc25 | Distillation code loss functions |
user | {‘login’: ‘sshleifer’, ‘id’: 6045025, ‘node_id’: ‘MDQ6VXNlcjYwNDUwMjU=’, ‘avatar_url’: ‘https://avatars.githubusercontent.com/u/6045025?v=4’, ‘gravatar_id’: ’‘, ’url’: ‘https://api.github.com/users/sshleifer’, ‘html_url’: ‘https://github.com/sshleifer’, ‘followers_url’: ‘https://api.github.com/users/sshleifer/followers’, ‘following_url’: ‘https://api.github.com/users/sshleifer/following{/other_user}’, ‘gists_url’: ‘https://api.github.com/users/sshleifer/gists{/gist_id}’, ‘starred_url’: ‘https://api.github.com/users/sshleifer/starred{/owner}{/repo}’, ‘subscriptions_url’: ‘https://api.github.com/users/sshleifer/subscriptions’, ‘organizations_url’: ‘https://api.github.com/users/sshleifer/orgs’, ‘repos_url’: ‘https://api.github.com/users/sshleifer/repos’, ‘events_url’: ‘https://api.github.com/users/sshleifer/events{/privacy}’, ‘received_events_url’: ‘https://api.github.com/users/sshleifer/received_events’, ‘type’: ‘User’, ‘site_admin’: False} | {‘login’: ‘maksym-del’, ‘id’: 8141935, ‘node_id’: ‘MDQ6VXNlcjgxNDE5MzU=’, ‘avatar_url’: ‘https://avatars.githubusercontent.com/u/8141935?v=4’, ‘gravatar_id’: ’‘, ’url’: ‘https://api.github.com/users/maksym-del’, ‘html_url’: ‘https://github.com/maksym-del’, ‘followers_url’: ‘https://api.github.com/users/maksym-del/followers’, ‘following_url’: ‘https://api.github.com/users/maksym-del/following{/other_user}’, ‘gists_url’: ‘https://api.github.com/users/maksym-del/gists{/gist_id}’, ‘starred_url’: ‘https://api.github.com/users/maksym-del/starred{/owner}{/repo}’, ‘subscriptions_url’: ‘https://api.github.com/users/maksym-del/subscriptions’, ‘organizations_url’: ‘https://api.github.com/users/maksym-del/orgs’, ‘repos_url’: ‘https://api.github.com/users/maksym-del/repos’, ‘events_url’: ‘https://api.github.com/users/maksym-del/events{/privacy}’, ‘received_events_url’: ‘https://api.github.com/users/maksym-del/received_events’, ‘type’: ‘User’, ‘site_admin’: False} | {‘login’: ‘snaik2016’, ‘id’: 18183245, ‘node_id’: ‘MDQ6VXNlcjE4MTgzMjQ1’, ‘avatar_url’: ‘https://avatars.githubusercontent.com/u/18183245?v=4’, ‘gravatar_id’: ’‘, ’url’: ‘https://api.github.com/users/snaik2016’, ‘html_url’: ‘https://github.com/snaik2016’, ‘followers_url’: ‘https://api.github.com/users/snaik2016/followers’, ‘following_url’: ‘https://api.github.com/users/snaik2016/following{/other_user}’, ‘gists_url’: ‘https://api.github.com/users/snaik2016/gists{/gist_id}’, ‘starred_url’: ‘https://api.github.com/users/snaik2016/starred{/owner}{/repo}’, ‘subscriptions_url’: ‘https://api.github.com/users/snaik2016/subscriptions’, ‘organizations_url’: ‘https://api.github.com/users/snaik2016/orgs’, ‘repos_url’: ‘https://api.github.com/users/snaik2016/repos’, ‘events_url’: ‘https://api.github.com/users/snaik2016/events{/privacy}’, ‘received_events_url’: ‘https://api.github.com/users/snaik2016/received_events’, ‘type’: ‘User’, ‘site_admin’: False} |
labels | [Examples, Good First Issue, Help wanted, cleanup, wontfix] | [Documentation, Help wanted, New model, seq2seq, translation] | [Core: Modeling, Distillation, PyTorch, Usage, wontfix] |
state | closed | closed | closed |
locked | False | False | False |
assignee | {‘login’: ‘sshleifer’, ‘id’: 6045025, ‘node_id’: ‘MDQ6VXNlcjYwNDUwMjU=’, ‘avatar_url’: ‘https://avatars.githubusercontent.com/u/6045025?v=4’, ‘gravatar_id’: ’‘, ’url’: ‘https://api.github.com/users/sshleifer’, ‘html_url’: ‘https://github.com/sshleifer’, ‘followers_url’: ‘https://api.github.com/users/sshleifer/followers’, ‘following_url’: ‘https://api.github.com/users/sshleifer/following{/other_user}’, ‘gists_url’: ‘https://api.github.com/users/sshleifer/gists{/gist_id}’, ‘starred_url’: ‘https://api.github.com/users/sshleifer/starred{/owner}{/repo}’, ‘subscriptions_url’: ‘https://api.github.com/users/sshleifer/subscriptions’, ‘organizations_url’: ‘https://api.github.com/users/sshleifer/orgs’, ‘repos_url’: ‘https://api.github.com/users/sshleifer/repos’, ‘events_url’: ‘https://api.github.com/users/sshleifer/events{/privacy}’, ‘received_events_url’: ‘https://api.github.com/users/sshleifer/received_events’, ‘type’: ‘User’, ‘site_admin’: False} | {‘login’: ‘sshleifer’, ‘id’: 6045025, ‘node_id’: ‘MDQ6VXNlcjYwNDUwMjU=’, ‘avatar_url’: ‘https://avatars.githubusercontent.com/u/6045025?v=4’, ‘gravatar_id’: ’‘, ’url’: ‘https://api.github.com/users/sshleifer’, ‘html_url’: ‘https://github.com/sshleifer’, ‘followers_url’: ‘https://api.github.com/users/sshleifer/followers’, ‘following_url’: ‘https://api.github.com/users/sshleifer/following{/other_user}’, ‘gists_url’: ‘https://api.github.com/users/sshleifer/gists{/gist_id}’, ‘starred_url’: ‘https://api.github.com/users/sshleifer/starred{/owner}{/repo}’, ‘subscriptions_url’: ‘https://api.github.com/users/sshleifer/subscriptions’, ‘organizations_url’: ‘https://api.github.com/users/sshleifer/orgs’, ‘repos_url’: ‘https://api.github.com/users/sshleifer/repos’, ‘events_url’: ‘https://api.github.com/users/sshleifer/events{/privacy}’, ‘received_events_url’: ‘https://api.github.com/users/sshleifer/received_events’, ‘type’: ‘User’, ‘site_admin’: False} | None |
assignees | [{‘login’: ‘sshleifer’, ‘id’: 6045025, ‘node_id’: ‘MDQ6VXNlcjYwNDUwMjU=’, ‘avatar_url’: ‘https://avatars.githubusercontent.com/u/6045025?v=4’, ‘gravatar_id’: ’‘, ’url’: ‘https://api.github.com/users/sshleifer’, ‘html_url’: ‘https://github.com/sshleifer’, ‘followers_url’: ‘https://api.github.com/users/sshleifer/followers’, ‘following_url’: ‘https://api.github.com/users/sshleifer/following{/other_user}’, ‘gists_url’: ‘https://api.github.com/users/sshleifer/gists{/gist_id}’, ‘starred_url’: ‘https://api.github.com/users/sshleifer/starred{/owner}{/repo}’, ‘subscriptions_url’: ‘https://api.github.com/users/sshleifer/subscriptions’, ‘organizations_url’: ‘https://api.github.com/users/sshleifer/orgs’, ‘repos_url’: ‘https://api.github.com/users/sshleifer/repos’, ‘events_url’: ‘https://api.github.com/users/sshleifer/events{/privacy}’, ‘received_events_url’: ‘https://api.github.com/users/sshleifer/received_events’, ‘type’: ‘User’, ‘site_admin’: False}] | [{‘login’: ‘sshleifer’, ‘id’: 6045025, ‘node_id’: ‘MDQ6VXNlcjYwNDUwMjU=’, ‘avatar_url’: ‘https://avatars.githubusercontent.com/u/6045025?v=4’, ‘gravatar_id’: ’‘, ’url’: ‘https://api.github.com/users/sshleifer’, ‘html_url’: ‘https://github.com/sshleifer’, ‘followers_url’: ‘https://api.github.com/users/sshleifer/followers’, ‘following_url’: ‘https://api.github.com/users/sshleifer/following{/other_user}’, ‘gists_url’: ‘https://api.github.com/users/sshleifer/gists{/gist_id}’, ‘starred_url’: ‘https://api.github.com/users/sshleifer/starred{/owner}{/repo}’, ‘subscriptions_url’: ‘https://api.github.com/users/sshleifer/subscriptions’, ‘organizations_url’: ‘https://api.github.com/users/sshleifer/orgs’, ‘repos_url’: ‘https://api.github.com/users/sshleifer/repos’, ‘events_url’: ‘https://api.github.com/users/sshleifer/events{/privacy}’, ‘received_events_url’: ‘https://api.github.com/users/sshleifer/received_events’, ‘type’: ‘User’, ‘site_admin’: False}] | [] |
milestone | NaN | NaN | NaN |
comments | 12 | 8 | 4 |
created_at | 2020-06-16 12:45:32 | 2020-03-29 12:32:30 | 2020-02-09 05:21:33 |
updated_at | 2020-10-04 01:14:08 | 2020-07-07 17:23:01 | 2020-04-19 22:29:10 |
closed_at | 2020-10-04 01:14:08 | 2020-07-07 17:23:01 | 2020-04-19 22:29:10 |
author_association | MEMBER | CONTRIBUTOR | NONE |
active_lock_reason | None | None | None |
body |
There are a few things about the examples/ tests that are suboptimal:. They never use cuda or fp16, even if they are available.. The @slow decorator used in the main tests is not importable, so there are no @slow tests.. test_run_glue uses distilbert-case-cased. It should use a smaller model, one of the tiny family here or a new tiny model.. There is no test coverage for TPU.help on any of these fronts would be much appreciated!
|
# 🌟 New model additionBART model implemented in fairseq introduced by FAIR## Model descriptionissue is to request adding mBART model existing as a part of fairseq lib. (https://github.com/pytorch/fairseq/tree/master/examples/mbart)(https://arxiv.org/abs/2001.08210)pretrained BART checkpoint.<!– Important information –>model code follows the original BART model code which is already a part of transformers repo. However, it introduces a couple more features like multilingual denoising and translation from pretrained BART. ## Open source status- [x] the model implementation is available: (give details)(https://github.com/pytorch/fairseq/commit/5e79322b3a4a9e9a11525377d3dda7ac520b921c) PR shows the main pieces that were added to the fairseq to make mBART work considering BART which is already existing in the codebase. However, a few additional mBART commits were added afterward.- [x] the model weights are available: (give details)(https://github.com/pytorch/fairseq/tree/master/examples/mbart#pre-trained-models)- [x] who are the authors: (mention them, if possible by @gh-username)AI Research (@MultiPath)
|
# ❓ Questions & Helpcompute cross entropy loss from the hard labels in distillation code?self.alpha_clm > 0.0:shift_logits = s_logits[…, :-1, :].contiguous()shift_labels = lm_labels[…, 1:].contiguous()loss_clm = self.lm_loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))loss += self.alpha_clm * loss_clmmodel outputs loss when passed with the labels.<!– The GitHub issue tracker is primarly intended for bugs, feature requests,new models and benchmarks, and migration questions. For all other questions,we direct you to Stack Overflow (SO) where a whole community of PyTorch andTensorflow enthusiast can help you out. Make sure to tag your question with theright deep learning framework as well as the huggingface-transformers tag: https://stackoverflow.com/questions/tagged/huggingface-transformers If your question wasn’t answered after a period of time on Stack Overflow, youcan always open a question on GitHub. You should then link to the SO question that you posted.–>## Details<!– Description of your issue –><!– You should first ask your question on SO, and only ifyou didn’t get an answer ask it here on GitHub. –>*A link to original question on Stack Overflow**: |
performed_via_github_app | NaN | NaN | NaN |
pull_request | None | None | None |
pandas.DataFrame.explode
- Documentation
- Transform each element of a list-like into a row, replicating index values.
Get the 20 most frequent labels in the dataset
= df_issues["labels"].explode().value_counts()
df_counts print(f"Number of labels: {len(df_counts)}")
# Display the top-20 label categories
20) df_counts.to_frame().head(
Number of labels: 65
labels | |
---|---|
wontfix | 2284 |
model card | 649 |
Core: Tokenization | 106 |
New model | 98 |
Core: Modeling | 64 |
Help wanted | 52 |
Good First Issue | 50 |
Usage | 46 |
Core: Pipeline | 42 |
Feature request | 41 |
TensorFlow | 41 |
Tests | 40 |
PyTorch | 37 |
DeepSpeed | 33 |
seq2seq | 32 |
Should Fix | 30 |
marian | 29 |
Discussion | 28 |
Documentation | 28 |
Examples | 24 |
2].sum() / df_counts.sum() df_counts[:
0.7185203331700147
Note: * There are 65 unique tags (i.e., classes) in the dataset. * The dataset is highly imbalanced, with the two most common classes accounting for more than 70% of the dataset. * Some labels (e.g., “Good First” or “Help Wanted”) are potentially too difficult to predict from the issue’s description, while others (e.g., “model card”) might only require simple rules to classify.
Filter the dataset to a subset of labels
= {"Core: Tokenization": "tokenization",
label_map "New model": "new model",
"Core: Modeling": "model training",
"Usage": "usage",
"Core: Pipeline": "pipeline",
"TensorFlow": "tensorflow or tf",
"PyTorch": "pytorch",
"Examples": "examples",
"Documentation": "documentation"}
def filter_labels(x):
return [label_map[label] for label in x if label in label_map]
"labels"] = df_issues["labels"].apply(filter_labels)
df_issues[= list(label_map.values()) all_labels
Check the distribution of the filtered dataset
= df_issues["labels"].explode().value_counts()
df_counts df_counts.to_frame().T
tokenization | new model | model training | usage | pipeline | tensorflow or tf | pytorch | documentation | examples | |
---|---|---|---|---|---|---|---|---|---|
labels | 106 | 98 | 64 | 46 | 42 | 41 | 37 | 28 | 24 |
2].sum() / df_counts.sum() df_counts[:
0.41975308641975306
Note: The filtered dataset is more balanced, with the two most common classes accounting for less than 42% of the dataset.
Create a new column to indicate whether an issue is unlabeled
"split"] = "unlabeled"
df_issues[= df_issues["labels"].apply(lambda x: len(x)) > 0
mask "split"] = "labeled"
df_issues.loc[mask, "split"].value_counts().to_frame() df_issues[
split | |
---|---|
unlabeled | 9489 |
labeled | 441 |
"split"].value_counts()[0] / len(df_issues) df_issues[
0.9555891238670695
Note: Over 95% of issues are unlabeled.
Inspect a labeled example
for column in ["title", "body", "labels"]:
print(f"{column}: {df_issues[column].iloc[26][:500]}\n")
title: Add new CANINE model
body: # 🌟 New model addition
## Model description
Google recently proposed a new **C**haracter **A**rchitecture with **N**o tokenization **I**n **N**eural **E**ncoders architecture (CANINE). Not only the title is exciting:
> Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually en
labels: ['new model']
Note: * This GitHub issue is proposing a new model architecture. * Both the title and description contain helpful information for the label classifier.
Concatenate the title and description for each issue into a new column
"text"] = (df_issues.apply(lambda x: x["title"] + "\n\n" + x["body"], axis=1)) df_issues[
Remove any duplicate rows based on the text
column values
= len(df_issues)
len_before = df_issues.drop_duplicates(subset="text")
df_issues print(f"Removed {(len_before-len(df_issues))/len_before:.2%} duplicates.")
Removed 1.88% duplicates.
import numpy as np
import matplotlib.pyplot as plt
Plot the number of words per issue
"text"].str.split().apply(len).hist(bins=np.linspace(0, 750, 50), grid=False, edgecolor="C0"))
(df_issues["Words per issue")
plt.title("Number of words")
plt.xlabel("Number of issues")
plt.ylabel( plt.show()
Note: * The text for most issues is short, but some have more than 500 words. * Issues with error messages and code snippets are often longer. * Most of the examples should fit into the typical context size of 512 tokens.
Creating Training Sets
- There is no guaranteed balance for all labels when splitting the dataset.
- We can use the scikit-multilearn library to approximate a balanced split.
scikit-multilearn library
- Homepage
- A multi-label classification library built on top of the scikit-learn ecosystem.
from sklearn.preprocessing import MultiLabelBinarizer
MultiLabelBinarizer
- Documentation
- Transform between iterable of iterables and a multilabel format.
- Takes a list of names and creates a vector with zeros for absent labels and ones for present labels.
Create a MultiLabelBinarizer to learn the mapping from label to ID
= MultiLabelBinarizer()
mlb mlb.fit([all_labels])
MultiLabelBinarizer()
Check the label mappings
"tokenization", "new model"], ["pytorch"]]) mlb.transform([[
array([[0, 0, 0, 1, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0]])
for label in all_labels]).T, columns=all_labels).T pd.DataFrame(mlb.transform([[label]
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|---|
tokenization | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
new model | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
model training | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
usage | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
pipeline | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
tensorflow or tf | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
pytorch | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
examples | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
documentation | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
from skmultilearn.model_selection import iterative_train_test_split
iterative_train_test_split
Define a function to iteratively generate a balanced train/test split
def balanced_split(df, test_size=0.5):
= np.expand_dims(np.arange(len(df)), axis=1)
ind = mlb.transform(df["labels"])
labels = iterative_train_test_split(ind, labels,
ind_train, _, ind_test, _
test_size)return df.iloc[ind_train[:, 0]], df.iloc[ind_test[:,0]]
from sklearn.model_selection import train_test_split
Split the data into supervised and unsupervised datasets
= df_issues[["text", "labels", "split"]].reset_index(drop=True).copy()
df_clean = df_clean.loc[df_clean["split"] == "unlabeled", ["text", "labels"]]
df_unsup = df_clean.loc[df_clean["split"] == "labeled", ["text", "labels"]] df_sup
Create balanced training, validation, and test sets
0)
np.random.seed(= balanced_split(df_sup, test_size=0.5)
df_train, df_tmp = balanced_split(df_tmp, test_size=0.5) df_valid, df_test
from datasets import Dataset, DatasetDict
Dataset.from_pandas
- Documentation
- Convert pandas.DataFrame to a pyarrow.Table to create a Dataset.
print_source(Dataset.from_pandas)
@classmethod
def from_pandas(cls, df: pd.DataFrame, features: Optional[Features]=None,
info: Optional[DatasetInfo]=None, split: Optional[NamedSplit]=None,
preserve_index: Optional[bool]=None) ->'Dataset':
if info is not None and features is not None and info.features != features:
raise ValueError(
f"""Features specified in `features` and `info.features` can't be different:
{features}
{info.features}"""
)
features = (features if features is not None else info.features if info
is not None else None)
if info is None:
info = DatasetInfo()
info.features = features
table = InMemoryTable.from_pandas(df=df, preserve_index=preserve_index,
schema=features.arrow_schema if features is not None else None)
return cls(table, info=info, split=split)
Initialize a DatasetDict with the dataset splits
= DatasetDict({
ds "train": Dataset.from_pandas(df_train.reset_index(drop=True)),
"valid": Dataset.from_pandas(df_valid.reset_index(drop=True)),
"test": Dataset.from_pandas(df_test.reset_index(drop=True)),
"unsup": Dataset.from_pandas(df_unsup.reset_index(drop=True))})
ds
DatasetDict({
train: Dataset({
features: ['text', 'labels'],
num_rows: 223
})
valid: Dataset({
features: ['text', 'labels'],
num_rows: 106
})
test: Dataset({
features: ['text', 'labels'],
num_rows: 111
})
unsup: Dataset({
features: ['text', 'labels'],
num_rows: 9303
})
})
Creating Training Slices
Create training slices with different numbers of samples
0)
np.random.seed(= np.expand_dims(list(range(len(ds["train"]))), axis=1)
all_indices = all_indices
indices_pool = mlb.transform(ds["train"]["labels"])
labels = [8, 16, 32, 64, 128]
train_samples = [], 0
train_slices, last_k
for i, k in enumerate(train_samples):
# Split off samples necessary to fill the gap to the next split size
= iterative_train_test_split(
indices_pool, labels, new_slice, _ -last_k)/len(labels))
indices_pool, labels, (k= k
last_k if i==0: train_slices.append(new_slice)
else: train_slices.append(np.concatenate((train_slices[-1], new_slice)))
# Add full dataset as last slice
len(ds["train"]))
train_slices.append(all_indices), train_samples.append(= [np.squeeze(train_slice) for train_slice in train_slices] train_slices
Note: It is not always possible to find a balanced split with a given split size.
print("Target split sizes:")
print(train_samples)
print("Actual split sizes:")
print([len(x) for x in train_slices])
Target split sizes:
[8, 16, 32, 64, 128, 223]
Actual split sizes:
[10, 19, 36, 68, 134, 223]
Implementing a Naive Bayesline
- A baseline based on regular expressions, handcrafted rules, or a simple model might work well enough to solve a given problem.
- These are generally easier to deploy and maintain than transformer models.
- Baseline models provide quick sanity checks when exploring more complex models.
- A more complex model like BERT should perform better than a simple logistic regression classifier on the same dataset.
- A Naive Bayes Classifier is a great baseline model for text classification as it is simple, quick to train, and reasonably robust to changes in input.
Create a new ids column with the multilabel vectors for each training sample
def prepare_labels(batch):
"label_ids"] = mlb.transform(batch["labels"])
batch[return batch
= ds.map(prepare_labels, batched=True) ds
'train'][:5]['labels'], ds['train'][:5]['label_ids'] ds[
([['new model'], ['new model'], ['new model'], ['new model'], ['examples']],
[[0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0]])
from collections import defaultdict
Create dictionaries to store micro and macro \(F_{1}\)-scores
- The micro \(F_{1}\)-score tracks performance for on the frequent labels
- The macro \(F_{1}\)-score tracks performance on all the labels regardless of frequency
= defaultdict(list), defaultdict(list) macro_scores, micro_scores
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.feature_extraction.text import CountVectorizer
sklearn.naive_bayes.MultinomialNB
- Documentation
- Create a Naive Bayes classifier for multinomial models.
sklearn.metrics._classification.classification_report
- Documentation
- Build a text report showing the main classification metrics.
skmultilearn.problem_transform.br.BinaryRelevance
- Documentation
- Treat each label as a separate single-class classification problem
sklearn.feature_extraction.text.CountVectorizer
- Documentation
- Create a vector where each entry corresponds to the frequency with which a token appeared in the text.
- Count vectorization is a bag-of-words approach since all information on the order of the words is lost.
# count_vect = CountVectorizer()
# pd.DataFrame(count_vect.fit_transform(ds['train'].select(train_slices[0])["text"]).toarray())
Train a baseline Naive Bayes Classifier for each of the training slices
for train_slice in train_slices:
# Get training slice and test data
= ds["train"].select(train_slice)
ds_train_sample = np.array(ds_train_sample["label_ids"])
y_train = np.array(ds["test"]["label_ids"])
y_test # Use a simple count vectorizer to encode our texts as token counts
= CountVectorizer()
count_vect = count_vect.fit_transform(ds_train_sample["text"])
X_train_counts = count_vect.transform(ds["test"]["text"])
X_test_counts # Create and train our model!
= BinaryRelevance(classifier=MultinomialNB())
classifier
classifier.fit(X_train_counts, y_train)# Generate predictions and evaluate
= classifier.predict(X_test_counts)
y_pred_test = classification_report(
clf_report =mlb.classes_, zero_division=0,
y_test, y_pred_test, target_names=True)
output_dict# Store metrics
"Naive Bayes"].append(clf_report["macro avg"]["f1-score"])
macro_scores["Naive Bayes"].append(clf_report["micro avg"]["f1-score"]) micro_scores[
Plot the performance of the baseline classifiers
def plot_metrics(micro_scores, macro_scores, sample_sizes, current_model):
= plt.subplots(1, 2, figsize=(10, 4), sharey=True)
fig, (ax0, ax1)
for run in micro_scores.keys():
if run == current_model:
=run, linewidth=2)
ax0.plot(sample_sizes, micro_scores[run], label=run, linewidth=2)
ax1.plot(sample_sizes, macro_scores[run], labelelse:
=run,
ax0.plot(sample_sizes, micro_scores[run], label="dashed")
linestyle=run,
ax1.plot(sample_sizes, macro_scores[run], label="dashed")
linestyle
"Micro F1 scores")
ax0.set_title("Macro F1 scores")
ax1.set_title("Test set F1 score")
ax0.set_ylabel(="lower right")
ax0.legend(locfor ax in [ax0, ax1]:
"Number of training samples")
ax.set_xlabel("log")
ax.set_xscale(
ax.set_xticks(sample_sizes)
ax.set_xticklabels(sample_sizes)
ax.minorticks_off()
plt.tight_layout() plt.show()
"Naive Bayes") plot_metrics(micro_scores, macro_scores, train_samples,
Note:
- The number of samples is on a logarithmic scale.
- The micro and macro F1 scores improve as the number of training samples increases.
- The results are slightly noisy since each slice can have a different class distribution.
Working with No Labeled Data
- Zero-shot classification is suitable when there is no labeled data at all.
- Zero-shot classification uses a pretrained model without additional fine-tuning on a task-specific corpus.
Zero-shot Fill-Mask Prediction
The pretrained model needs to be aware of the topic in the context to predict a missing token.
We can trick the model into classifying a document by providing a sentence like: > “This section was about the topic [MASK].”
The model should give a reasonable suggestion for the document’s topic.
Credit: Joe Davison
from transformers import pipeline
Create a Masked Language Modeling pipeline
= pipeline("fill-mask", model="bert-base-uncased")
pipe type(pipe.model), pipe.device pipe,
(<transformers.pipelines.fill_mask.FillMaskPipeline at 0x7fd190bcd3a0>,
transformers.models.bert.modeling_bert.BertForMaskedLM,
device(type='cpu'))
transformers.pipelines.fill_mask.FillMaskPipeline
- Documentation
- Create a masked language modeling prediction pipeline
Predict the topic for a movie about animals base on its description
= "The main characters of the movie madacascar \
movie_desc are a lion, a zebra, a giraffe, and a hippo. "
= "The movie is about [MASK]."
prompt
= pipe(movie_desc + prompt)
output for element in output:
print(f"Token {element['token_str']}:\t{element['score']:.3f}%")
Token animals: 0.103%
Token lions: 0.066%
Token birds: 0.025%
Token love: 0.015%
Token hunting: 0.013%
Note: The model successfully performs zero-shot classification and only predicts tokens related to animals.
Check the probability that the movie description is about specific topics
= pipe(movie_desc + prompt, targets=["animals", "cars"])
output for element in output:
print(f"Token {element['token_str']}:\t{element['score']:.3f}%")
Token animals: 0.103%
Token cars: 0.001%
Note: The model is confident the movie is not about cars.
Predict the topic for a movie about cars based on its description
= "In the movie transformers aliens \
movie_desc can morph into a wide range of vehicles."
= pipe(movie_desc + prompt)
output for element in output:
print(f"Token {element['token_str']}:\t{element['score']:.3f}%")
Token aliens: 0.221%
Token cars: 0.139%
Token robots: 0.099%
Token vehicles: 0.074%
Token transformers: 0.059%
Check the probability that the movie description is about specific topics
= "In the movie transformers aliens \
movie_desc can morph into a wide range of vehicles."
= pipe(movie_desc + prompt, targets=["animals", "cars"])
output for element in output:
print(f"Token {element['token_str']}:\t{element['score']:.3f}%")
Token cars: 0.139%
Token animals: 0.006%
Text Entailment
- The model determines whether two text passages are likely to follow or contradict each other.
- A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
- XNLI: Evaluating Cross-lingual Sentence Representations
- Typical datasets for text entailment tasks include the Multi-Genere NLI Corpus (MNLI) and the Cross-Lingual NLI Corpus (XNLI).
- Each sample in the dataset contains a premise, a hypothesis, and a label.
- The label can be either
entailment
,neutral
, orcontradiction
.- The
entailment
label indicates the hypothesis text is necessarily correct under the premise. - The
contradiction
label indicates the hypothesis is necessarily false or inappropriate under the premise. - The
neutral
label indicates the hypothesis is unrelated to the premise.
- The
Premise | Hypothesis | Label |
---|---|---|
His favorite color is blue. | He is into heavy metal | neutral |
She finds the joke hilarious. | She thinkg the joke is not funny at all. | contradiction |
The house was recently built. | The house is new. | entailment |
Zero-shot classification with Text Entailment
We can use a model trained on the MNLI dataset to build a classifier without needing any labels.
We treat the input text as a premise and formulate a hypothesis as: > “This example is about {label}.”
We insert the class name for the label.
The resulting entailment score indicates how likely the premise is about the topic.
We need to test different classes sequentially, meaning we need to execute a forward pass for each test.
The choice of label names can significantly impact prediction accuracy.
Choosing labels with a semantic meaning is generally the best approach.
from transformers import pipeline
Create a zero-shot classification pipeline
# Use GPU if available
= pipeline("zero-shot-classification", device=0, fp16=True)
pipe type(pipe.model), pipe.device pipe,
(<transformers.pipelines.zero_shot_classification.ZeroShotClassificationPipeline at 0x7fd189242d00>, transformers.models.bart.modeling_bart.BartForSequenceClassification, device(type='cuda', index=0))
Get the default hypothesis template
'hypothesis_template'] inspect.signature(pipe.preprocess).parameters[
<Parameter "hypothesis_template='This example is {}.'">
transformers.pipelines.zero_shot_classification.ZeroShotClassificationPipeline
- Documentation
- Create a Natural-Language-Inference (NLI)-based zero-shot classification pipeline.
- The pipeline takes any combination of sequences and labels.
- The pipeline poses each combination as a premise-hypothesis pair and passes them to the pretrained model.
"train"][0]['text']).to_frame().style.hide(axis='columns').hide(axis='rows') pd.Series(ds[
len(ds["train"][0]['text'].split(' ')), len(pipe.tokenizer(ds["train"][0]['text'])['input_ids'])
(244, 562)
"train"][0]['text'].split(' ')).T.style.hide(axis='columns') pd.DataFrame(ds[
= pipe.tokenizer(ds["train"][0]['text'])['input_ids']
input_ids ='columns') pd.DataFrame(pipe.tokenizer.convert_ids_to_tokens(input_ids)).T.style.hide(axis
Test each possible label using text entailment
= ds["train"][0]
sample print(f"Labels: {sample['labels']}")
= pipe(sample["text"], candidate_labels=all_labels, multi_label=True)
output print(output["sequence"][:400])
print("\nPredictions:")
for label, score in zip(output["labels"], output["scores"]):
print(f"{label}, {score:.2f}")
Labels: ['new model']
Add new CANINE model
# 🌟 New model addition
## Model description
Google recently proposed a new **C**haracter **A**rchitecture with **N**o tokenization **I**n **N**eural **E**ncoders architecture (CANINE). Not only the title is exciting:
> Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokeni
Predictions:
new model, 0.98
tensorflow or tf, 0.37
examples, 0.34
usage, 0.30
pytorch, 0.25
documentation, 0.25
model training, 0.24
tokenization, 0.17
pipeline, 0.16
Note:
- The model is confident the text is about a new model, but it also produces relatively high scores for labels not found in the text.
- The highly technical domain of the text is very different from the original text distribution in the MNLI dataset.
- We can feed input with code to the model since we use a subword tokenizer.
- Tokenization might be inefficient with code since only a tiny fraction of the pretraining dataset contains code snippets.
- Code blocks can contain important information, such as frameworks used in the code.
Define a function to feed a single example through the zero-shot pipeline
def zero_shot_pipeline(example):
= pipe(example["text"], all_labels, multi_label=True)
output "predicted_labels"] = output["labels"]
example["scores"] = output["scores"]
example[return example
Feed the whole validation set through the pipeline
= ds["valid"].map(zero_shot_pipeline) ds_zero_shot
Note: We can determine which labels to assign to each example using a minimum threshold value or selecting the top-k predictions.
Define a function to determine which set of labels to assign to each example using either a threshold value or top-k value
def get_preds(example, threshold=None, topk=None):
= []
preds if threshold:
for label, score in zip(example["predicted_labels"], example["scores"]):
if score >= threshold:
preds.append(label)elif topk:
for i in range(topk):
"predicted_labels"][i])
preds.append(example[else:
raise ValueError("Set either `threshold` or `topk`.")
return {"pred_label_ids": list(np.squeeze(mlb.transform([preds])))}
Define a function that returns the scikit-learn classification report
def get_clf_report(ds):
= np.array(ds["label_ids"])
y_true = np.array(ds["pred_label_ids"])
y_pred return classification_report(
=mlb.classes_, zero_division=0,
y_true, y_pred, target_names=True) output_dict
Test using top-k values to select labels
= [], []
macros, micros = [1, 2, 3, 4]
topks for topk in topks:
= ds_zero_shot.map(get_preds, batched=False,
ds_zero_shot ={'topk': topk})
fn_kwargs= get_clf_report(ds_zero_shot)
clf_report 'micro avg']['f1-score'])
micros.append(clf_report['macro avg']['f1-score']) macros.append(clf_report[
='Micro F1')
plt.plot(topks, micros, label='Macro F1')
plt.plot(topks, macros, label"Top-k")
plt.xlabel("F1-score")
plt.ylabel(='best')
plt.legend(loc plt.show()
Note: We obtain the best results using only the highest score per example (i.e., top-1), given most examples only have one label.
Test using a threshold value to select labels
= [], []
macros, micros = np.linspace(0.01, 1, 100)
thresholds for threshold in thresholds:
= ds_zero_shot.map(get_preds,
ds_zero_shot ={"threshold": threshold})
fn_kwargs= get_clf_report(ds_zero_shot)
clf_report "micro avg"]["f1-score"])
micros.append(clf_report["macro avg"]["f1-score"]) macros.append(clf_report[
="Micro F1")
plt.plot(thresholds, micros, label="Macro F1")
plt.plot(thresholds, macros, label"Threshold")
plt.xlabel("F1-score")
plt.ylabel(="best")
plt.legend(loc plt.show()
Note: The threshold approach performs slightly worse than the top-1 approach.
= thresholds[np.argmax(micros)], np.max(micros)
best_t, best_micro print(f'Best threshold (micro): {best_t} with F1-score {best_micro:.2f}.')
= thresholds[np.argmax(macros)], np.max(macros)
best_t, best_macro print(f'Best threshold (micro): {best_t} with F1-score {best_macro:.2f}.')
Best threshold (micro): 0.75 with F1-score 0.46.
Best threshold (micro): 0.72 with F1-score 0.42.
Note: A threshold value of around 0.8 provides the best tradeoff between precision and recall.
Compare the zero-shot classifier to the baseline Naive Bayes model
= ds['test'].map(zero_shot_pipeline)
ds_zero_shot = ds_zero_shot.map(get_preds, fn_kwargs={'topk': 1})
ds_zero_shot = get_clf_report(ds_zero_shot)
clf_report for train_slice in train_slices:
'Zero Shot'].append(clf_report['macro avg']['f1-score'])
macro_scores['Zero Shot'].append(clf_report['micro avg']['f1-score']) micro_scores[
"Zero Shot") plot_metrics(micro_scores, macro_scores, train_samples,
Note:
- The zero-shot pipeline outperforms the baseline when using less than 60 labeled examples universally outperforms the baseline when considering both micro and macro F1 scores.
- The baseline model performs better on the more common classes when using more than 60 examples.
- The zero-shot classification pipeline is sensitive to the names of labels and might perform better when using different or several names in parallel and aggregating them.
- Using a different
hypothesis_template
might improve performance.
Working with a Few Labels
- There are often a few labeled examples available, at least, for most NLP projects.
- The labels might come directly from a client, a cross-company team, from hand annotating a few examples.
Data Augmentation
- We can use data augmentation to generate new training examples from existing ones. Perturbing words or characters can completely change the meaning. Noise introduced by data augmentation is less likely to change the meaning when the text is more than a few sentences.
Back Translation
- Back translation involves translating the original text into one or more target languages and translating it back to the source language.
- Back translation works best for high-resource languages or corpora that don’t contain too many domain-specific words.
- We can implement back translation models using machine translation models like MSM100.
Token Perturbations
- Token perturbations involve randomly choosing and performing simple transformations like synonym replacement, word insertion, swap, or deletion.
- Libraries like NlpAug and TextAttack provide various recipes for token perturbations.
- EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
Augmentation | Sentence |
---|---|
None | Even if you defeat me Megatron, others will rise to defeat your tyranny |
Synonym replace | Even if you kill me, Megatron, others will prove to defeat your tyranny |
Random Insert | Even if you defeat me Megatron, others humanity will rise to defeat your tyranny |
Random Swap | You even if defeat me Megatron, others will rise defeat to tyranny your |
Random delete | Even if you me Megatron, others to defeat tyranny |
Back translate (German) | Even if you defeat me, other will rise up to defeat your tyranny |
Disable Tokenizers Parallelism
%env TOKENIZERS_PARALLELISM=false
env: TOKENIZERS_PARALLELISM=false
from transformers import set_seed
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc
import nltk
Download a perceptron model for tagging words and the Wordnet corpora
'averaged_perceptron_tagger')
nltk.download('wordnet') nltk.download(
True
'omw-1.4') nltk.download(
True
!ls ~/nltk_data/corpora/wordnet
adj.exc cntlist.rev data.noun index.adv index.verb noun.exc
adv.exc data.adj data.verb index.noun lexnames README
citation.bib data.adv index.adj index.sense LICENSE verb.exc
!head -5 ~/nltk_data/corpora/wordnet/noun.exc
aardwolves aardwolf
abaci abacus
aboideaux aboideau
aboiteaux aboiteau
abscissae abscissa
Reset random seed
3) set_seed(
Define original text
= "Even if you defeat me Megatron, others will rise to defeat your tyranny" text
Initialize augmentation dictionary
= {} augs
nlpaug.augmenter.word.synonym.SynonymAug
- Documentation
- Create an augmenter that leverages semantic meaning to substitute words.
Add Synonym Replacement Augmentation using the WordNet corpora
"synonym_replace"] = naw.SynonymAug(aug_src='wordnet')
augs["synonym_replace"].augment(text) augs[
'Even if you kill me Megatron, others will prove to defeat your tyranny'
nlpaug.augmenter.word.context_word_embs.ContextualWordEmbsAug
- Documentation
- Create an augmenter that finds the top n similar words using contextual word embeddings
Add Random Insert Augmentation using the contextual word embeddings of DistilBERT
"random_insert"] = naw.ContextualWordEmbsAug(model_path="distilbert-base-uncased",
augs[="cpu", action="insert", aug_max=1)
device"random_insert"].augment(text) augs[
'even if you defeat me megatron, others humanity will rise to defeat your tyranny'
nlpaug.augmenter.word.random.RandomWordAug
- Documentation
- Randomly apply substitute, swap, delete or crop augmentation
- The default augmentation is to delete words.
Randomly swap words
"random_swap"] = naw.RandomWordAug(action="swap")
augs["random_swap"].augment(text) augs[
'You even if defeat me Megatron, others will rise defeat to tyranny your'
Randomly delete words
"random_delete"] = naw.RandomWordAug()
augs["random_delete"].augment(text) augs[
'Even if you me Megatron, others to defeat tyranny'
nlpaug.augmenter.word.back_translation.BackTranslationAug
- Documentation
- Use two translation models to apply back translation
"bt_en_de"] = naw.BackTranslationAug(
augs[='facebook/wmt19-en-de',
from_model_name='facebook/wmt19-de-en'
to_model_name
)"bt_en_de"].augment(text) augs[
'Even if you defeat me, others will rise up to defeat your tyranny'
for k,v in augs.items():
print(f"Original text: {text}")
print(f"{k}: {v.augment(text)}")
print("")
Original text: Even if you defeat me Megatron, others will rise to defeat your tyranny
synonym_replace: Even if you defeat me Megatron, others will go up to vote out your tyranny
Original text: Even if you defeat me Megatron, others will rise to defeat your tyranny
random_insert: even if you defeat me megatron, others will rise to defeat of your tyranny
Original text: Even if you defeat me Megatron, others will rise to defeat your tyranny
random_swap: If even you defeat me Megatron, others will rise to defeat tyranny your
Original text: Even if you defeat me Megatron, others will rise to defeat your tyranny
random_delete: If you me Megatron, will to defeat your tyranny
Original text: Even if you defeat me Megatron, others will rise to defeat your tyranny
bt_en_de: Even if you defeat me, others will rise up to defeat your tyranny
Reset random seed
3) set_seed(
Add Random Synonym Replacement using the contextual word embeddings of DistilBERT
= naw.ContextualWordEmbsAug(model_path="distilbert-base-uncased",
aug ="cpu", action="substitute")
device
= "Transformers are the most popular toys"
text print(f"Original text: {text}")
print(f"Augmented text: {aug.augment(text)}")
Original text: Transformers are the most popular toys
Augmented text: transformers'the most popular toys
Define a function to apply synonym replacement augmentation to a batch
def augment_text(batch, transformations_per_example=1):
= [], []
text_aug, label_ids for text, labels in zip(batch["text"], batch["label_ids"]):
+= [text]
text_aug += [labels]
label_ids for _ in range(transformations_per_example):
+= [aug.augment(text)]
text_aug += [labels]
label_ids return {"text": text_aug, "label_ids": label_ids}
Train the baseline Naive Bayes Classifier using synonym replacement augmentation
for train_slice in train_slices:
# Get training slice and test data
= ds["train"].select(train_slice)
ds_train_sample # Flatten augmentations and align labels!
= (ds_train_sample.map(
ds_train_aug =True, remove_columns=ds_train_sample.column_names)
augment_text, batched=42))
.shuffle(seed= np.array(ds_train_aug["label_ids"])
y_train = np.array(ds["test"]["label_ids"])
y_test # Use a simple count vectorizer to encode our texts as token counts
= CountVectorizer()
count_vect = count_vect.fit_transform(ds_train_aug["text"])
X_train_counts = count_vect.transform(ds["test"]["text"])
X_test_counts # Create and train our model!
= BinaryRelevance(classifier=MultinomialNB())
classifier
classifier.fit(X_train_counts, y_train)# Generate predictions and evaluate
= classifier.predict(X_test_counts)
y_pred_test = classification_report(
clf_report =mlb.classes_, zero_division=0,
y_test, y_pred_test, target_names=True)
output_dict# Store metrics
"Naive Bayes + Aug"].append(clf_report["macro avg"]["f1-score"])
macro_scores["Naive Bayes + Aug"].append(clf_report["micro avg"]["f1-score"]) micro_scores[
Compare the results
"Naive Bayes + Aug") plot_metrics(micro_scores, macro_scores, train_samples,
Note:
- A small amount of data augmentation improves the F1 score of the Naive Bayes Classifier.
- The Naive Bayes Classifier overtakes the zero-shot pipeline for the macro F1 score at around 220 training samples.
Using Embeddings as a Lookup Table
Large language models like GPT-3 are excellent at solving tasks with limited data because they learn representations of text that encode information across many dimensions.
We can use embeddings of large language models to develop a semantic search engine, find similar documents or comments, or classify text.
This approach does not require fine-tuning models to leverage the few labeled data points.
The embeddings should ideally be from a pretrained on a similar domain to the target dataset.
Steps to classify text using embeddings:
- Use the language model to embed all labeled texts.
- Perform a nearest-neighbor search over the stored embeddings.
- Aggregate the labels of the nearest neighbors to get a prediction.
- We need to embed new text we want to classify and assign labels based on the labels of its nearest neighbors.
- It is crucial to calibrate the number of neighbors for the nearest-neighbors search.
- Using too few neighbors might result in noisy predictions.
- Using too many neighbors might mix neighboring groups.
import torch
from transformers import AutoTokenizer, AutoModel
Instantiate a tokenizer and model using a GPT-2 checkpoint trained on Python code
= "miguelvictor/python-gpt2-large"
model_ckpt = AutoTokenizer.from_pretrained(model_ckpt)
tokenizer = AutoModel.from_pretrained(model_ckpt) model
Note: Transformer models like GPT-2 return one embedding vector per token, and we want a single embedding for the entire output.
Define a function to create a single-vector representation for model output using average pooling
- We don’t want to include padding tokens in the average.
def mean_pooling(model_output, attention_mask):
# Extract the token embeddings
= model_output[0]
token_embeddings # Compute the attention mask
= (attention_mask
input_mask_expanded -1)
.unsqueeze(
.expand(token_embeddings.size())float())
.# Sum the embeddings, but ignore masked tokens
= torch.sum(token_embeddings * input_mask_expanded, 1)
sum_embeddings = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sum_mask # Return the average as a single vector
return sum_embeddings / sum_mask
Define a function to embed sample text
def embed_text(examples):
= tokenizer(examples["text"], padding=True, truncation=True,
inputs =128, return_tensors="pt")
max_lengthwith torch.no_grad():
= model(**inputs)
model_output = mean_pooling(model_output, inputs["attention_mask"])
pooled_embeds return {"embedding": pooled_embeds.cpu().numpy()}
Use the end-of-string token as the padding token since GPT-style models don’t have one
= tokenizer.eos_token tokenizer.pad_token
Get the embeddings for each split
= ds["train"].map(embed_text, batched=True, batch_size=16)
embs_train = ds["valid"].map(embed_text, batched=True, batch_size=16)
embs_valid = ds["test"].map(embed_text, batched=True, batch_size=16) embs_test
Dataset.add_faiss_index
- Documentation
- Add a dense index using FAISS for fast retrieval.
- FAISS is a library for efficient similarity search of dense vectors.
import faiss
Create a FAISS index using the embeddings for the train split
"embedding") embs_train.add_faiss_index(
Dataset({
features: ['text', 'labels', 'label_ids', 'embedding'],
num_rows: 223
})
datasets.search.IndexableMixin.get_nearest_examples
- Documentation
- Find the nearest examples in the dataset to the query.
Perform a nearest-neighbor lookup
= 0, 3 # Select the first query and 3 nearest neighbors
i, k = "\r\n\r\n", "\n" # Used to remove newlines in text for compact display
rn, nl
= np.array(embs_valid[i]["embedding"], dtype=np.float32)
query = embs_train.get_nearest_examples("embedding", query, k=k)
scores, samples
print(f"QUERY LABELS: {embs_valid[i]['labels']}")
print(f"QUERY TEXT:\n{embs_valid[i]['text'][:200].replace(rn, nl)} [...]\n")
print("="*50)
print(f"Retrieved documents:")
for score, label, text in zip(scores, samples["labels"], samples["text"]):
print("="*50)
print(f"TEXT:\n{text[:200].replace(rn, nl)} [...]")
print(f"SCORE: {score:.2f}")
print(f"LABELS: {label}")
QUERY LABELS: ['new model']
QUERY TEXT:
Implementing efficient self attention in T5
# 🌟 New model addition
My teammates and I (including @ice-americano) would like to use efficient self attention methods such as Linformer, Performer and [...]
==================================================
Retrieved documents:
==================================================
TEXT:
Add Linformer model
# 🌟 New model addition
## Model description
### Linformer: Self-Attention with Linear Complexity
Paper published June 9th on ArXiv: https://arxiv.org/abs/2006.04768
La [...]
SCORE: 54.92
LABELS: ['new model']
==================================================
TEXT:
Add FAVOR+ / Performer attention
# 🌟 FAVOR+ / Performer attention addition
Are there any plans to add this new attention approximation block to Transformers library?
## Model description
The n [...]
SCORE: 57.90
LABELS: ['new model']
==================================================
TEXT:
Implement DeLighT: Very Deep and Light-weight Transformers
# 🌟 New model addition
## Model description
DeLight, that delivers similar or better performance than transformer-based models with sign [...]
SCORE: 60.12
LABELS: ['new model']
Note:
- The three retrieved documents all have the same labels as they should.
- The query and the retrieved documents all relate to adding new and efficient transformer models.
Define a function that returns the sample predictions using a label occurrence threshold
def get_sample_preds(sample, m):
return (np.sum(sample["label_ids"], axis=0) >= m).astype(int)
datasets.search.IndexableMixin.get_nearest_examples_batch
Define a function to test different k and threshold values for nearest-neighbor search
def find_best_k_m(ds_train, valid_queries, valid_labels, max_k=17):
= min(len(ds_train), max_k)
max_k = np.zeros((max_k, max_k))
perf_micro = np.zeros((max_k, max_k))
perf_macro for k in range(1, max_k):
for m in range(1, k + 1):
= ds_train.get_nearest_examples_batch("embedding",
_, samples =k)
valid_queries, k= np.array([get_sample_preds(s, m) for s in samples])
y_pred = classification_report(valid_labels, y_pred,
clf_report =mlb.classes_, zero_division=0, output_dict=True)
target_names= clf_report["micro avg"]["f1-score"]
perf_micro[k, m] = clf_report["macro avg"]["f1-score"]
perf_macro[k, m] return perf_micro, perf_macro
Test different k and threshold values
= np.array(embs_valid["label_ids"])
valid_labels = np.array(embs_valid["embedding"], dtype=np.float32)
valid_queries = find_best_k_m(embs_train, valid_queries, valid_labels) perf_micro, perf_macro
Plot the results
= plt.subplots(1, 2, figsize=(10, 3.5), sharey=True)
fig, (ax0, ax1)
ax0.imshow(perf_micro)
ax1.imshow(perf_macro)
"micro scores")
ax0.set_title("k")
ax0.set_ylabel("macro scores")
ax1.set_title(for ax in [ax0, ax1]:
0.5, 17 - 0.5])
ax.set_xlim([17 - 0.5, 0.5])
ax.set_ylim(["m")
ax.set_xlabel( plt.show()
Note:
- Choosing a threshold value \(m\) that is too large or small for a given \(k\) value yields suboptimal results.
- A ratio of approximately \(m/k = 1/3\) achieves the best results.
Find the best k and threshold values
= np.unravel_index(perf_micro.argmax(), perf_micro.shape)
k, m print(f"Best k: {k}, best m: {m}")
Best k: 15, best m: 5
Note: We get the best performance when we retrieve the 15 nearest neighbors and then assign the labels that occurred at least five times.
Evaluate the embedding lookup performance using different training slices
# Drop the FAISS index
"embedding")
embs_train.drop_index(= np.array(embs_test["label_ids"])
test_labels = np.array(embs_test["embedding"], dtype=np.float32)
test_queries
for train_slice in train_slices:
# Create a FAISS index from training slice
= embs_train.select(train_slice)
embs_train_tmp "embedding")
embs_train_tmp.add_faiss_index(# Get best k, m values with validation set
= find_best_k_m(embs_train_tmp, valid_queries, valid_labels)
perf_micro, _ = np.unravel_index(perf_micro.argmax(), perf_micro.shape)
k, m # Get predictions on test set
= embs_train_tmp.get_nearest_examples_batch("embedding",
_, samples
test_queries,=int(k))
k= np.array([get_sample_preds(s, m) for s in samples])
y_pred # Evaluate predictions
= classification_report(test_labels, y_pred,
clf_report =mlb.classes_, zero_division=0, output_dict=True,)
target_names"Embedding"].append(clf_report["macro avg"]["f1-score"])
macro_scores["Embedding"].append(clf_report["micro avg"]["f1-score"]) micro_scores[
Compare performance to previous methods
"Embedding") plot_metrics(micro_scores, macro_scores, train_samples,
Note:
- The embedding lookup is competitive on the micro F1 scores while only having two “learnable” parameters, k and m, but performs slightly worse on the macro scores.
- The method that works best in practice strongly depends on the domain.
- The zero-shot pipeline might work much better on tasks closer to the pretraining domain.
- The embeddings quality depends on the model and the original training data.
Efficient Similarity Search with FAISS
- We usually speed up text search by creating an inverted index that maps terms to documents.
- An inverted index works just like an index at the end of a book, where each word maps to the pages/documents it occurs in.
- We can quickly find which documents the search terms appear in when performing a query.
- An inverted index works well with discrete objects such as words but does not work with continuous ones like vectors.
- We need to look for similar matches instead of exact matches since each document likely has a unique vector.
- FAISS avoids comparing the query vector to every vector in the database with several tricks.
- FAISS speeds up the comparison process by applying k-means clustering to the dataset, grouping the embeddings by similarity.
- We get a centroid vector for each group, which is the average of all group members.
- We can then search across the k centroids for the one that is most similar to our query and then search within the corresponding group.
- This approach reduces the number of comparisons from n to \(k + \frac{n}{k}\).
- The minimum k value is \(k = \sqrt{n}\).
- FAISS also provides a GPU-enabled version for increased speed and several options to compress vectors with advanced quantization schemes.
- Guidelines to choose an index
Fine-Tuning a Vanilla Transformer
- We can fine-tune a pretrained transformer model when we have labeled data.
- Starting with a pretrained BERT-like model is often a good idea.
- The target corpus should not be too different from the pretraining corpus.
- The Hugging Face hub has many models pretrained on different corpora.
import torch
from transformers import (AutoTokenizer, AutoConfig,
AutoModelForSequenceClassification)
Load the pretrained tokenizer for the standard BERT checkpoint
= "bert-base-uncased"
model_ckpt = AutoTokenizer.from_pretrained(model_ckpt) tokenizer
Tokenize the dataset
def tokenize(batch):
return tokenizer(batch["text"], truncation=True, max_length=128)
= ds.map(tokenize, batched=True)
ds_enc = ds_enc.remove_columns(['labels', 'text']) ds_enc
Change the label_ids
column data type to float for the multilabel loss function
"torch")
ds_enc.set_format(= ds_enc.map(lambda x: {"label_ids_f": x["label_ids"].to(torch.float)},
ds_enc =["label_ids"])
remove_columns= ds_enc.rename_column("label_ids_f", "label_ids") ds_enc
from transformers import Trainer, TrainingArguments
Keep the best model based on the micro \(F_{1}\)-score
= TrainingArguments(
training_args_fine_tune ="./results", num_train_epochs=20, learning_rate=3e-5,
output_dir='constant', per_device_train_batch_size=4,
lr_scheduler_type=32, weight_decay=0.0,
per_device_eval_batch_size="epoch", save_strategy="epoch",logging_strategy="epoch",
evaluation_strategy=True, metric_for_best_model='micro f1',
load_best_model_at_end=1, log_level='error') save_total_limit
from scipy.special import expit as sigmoid
Define a function to compute the \(F_{1}\)-scores
def compute_metrics(pred):
= pred.label_ids
y_true # Normalize the model predictions
= sigmoid(pred.predictions)
y_pred = (y_pred>0.5).astype(float)
y_pred
= classification_report(y_true, y_pred, target_names=all_labels,
clf_dict =0, output_dict=True)
zero_divisionreturn {"micro f1": clf_dict["micro avg"]["f1-score"],
"macro f1": clf_dict["macro avg"]["f1-score"]}
Intitialize a BertConfig for Multi-label classification
= AutoConfig.from_pretrained(model_ckpt)
config = len(all_labels)
config.num_labels = "multi_label_classification" config.problem_type
config
BertConfig {
"_name_or_path": "bert-base-uncased",
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1",
"2": "LABEL_2",
"3": "LABEL_3",
"4": "LABEL_4",
"5": "LABEL_5",
"6": "LABEL_6",
"7": "LABEL_7",
"8": "LABEL_8"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1,
"LABEL_2": 2,
"LABEL_3": 3,
"LABEL_4": 4,
"LABEL_5": 5,
"LABEL_6": 6,
"LABEL_7": 7,
"LABEL_8": 8
},
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"problem_type": "multi_label_classification",
"transformers_version": "4.18.0",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 30522
}
Train a classifier from scratch for each training slice
= torch.device("cuda" if torch.cuda.is_available() else "cpu")
device
for train_slice in train_slices:
= AutoModelForSequenceClassification.from_pretrained(model_ckpt,
model =config).to(device)
config= Trainer(
trainer =model, tokenizer=tokenizer,
model=training_args_fine_tune,
args=compute_metrics,
compute_metrics=ds_enc["train"].select(train_slice),
train_dataset=ds_enc["valid"])
eval_dataset
= trainer.data_collator
old_collator = lambda data: dict(old_collator(data))
trainer.data_collator
trainer.train()= trainer.predict(ds_enc["test"])
pred = compute_metrics(pred)
metrics "Fine-tune (vanilla)"].append(metrics["macro f1"])
macro_scores["Fine-tune (vanilla)"].append(metrics["micro f1"]) micro_scores[
Compare the results to previous methods
"Fine-tune (vanilla)") plot_metrics(micro_scores, macro_scores, train_samples,
Note: The fine-tuned model is competitive when we have at least 64 training examples.
In-Context and Few-Shot Learning with Prompts
- The GPT-3 paper found that large-language models can effectively learn from examples presented in a prompt.
- The larger the model, the better it is at using in-context examples.
- An alternative to using labeled data is to create examples of the prompts and desired predictions and continue training the language model on those examples.
- ADAPET beats GPT-3 on a wide variety of tasks using this approach.
- Researchers at Hugging Face found that this approach can be more data-efficient than fine-tuning a custom head.
Leveraging Unlabeled Data
- Domain adaptation involves continuing the pretraining process of predicting masked words using unlabeled data from the target domain.
- We can then reuse the adapted model for many use cases.
- Domain adaptation can help boost model performance with unlabeled data and little effort.
Fine-Tuning a Language Model
- We need to mask the
[CLS]
and[SEP]
tokens from the loss, so we don’t train the model to predict them when doing masked language modeling. - We can get a mask when tokenizing by setting
return_special_tokens_mask=True
. - We can use a specialized data collator to prepare the elements in a batch for masked language modeling on the fly.
- The data collator needs to mask tokens in the input sequence and have the target tokens in the outputs.
Define a function to tokenize the text and get the special token mask information
def tokenize(batch):
return tokenizer(batch["text"], truncation=True,
=128, return_special_tokens_mask=True) max_length
Retokenize the text
= ds.map(tokenize, batched=True)
ds_mlm = ds_mlm.remove_columns(["labels", "text", "label_ids"])
ds_mlm ds_mlm
DatasetDict({
train: Dataset({
features: ['input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'],
num_rows: 223
})
valid: Dataset({
features: ['input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'],
num_rows: 106
})
test: Dataset({
features: ['input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'],
num_rows: 111
})
unsup: Dataset({
features: ['input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'],
num_rows: 9303
})
})
'train']['special_tokens_mask'][0]).T.style.hide(axis='columns').hide(axis='rows') pd.DataFrame(ds_mlm[
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
'train']['input_ids'][0])).T.style.hide(axis='columns').hide(axis='rows') pd.DataFrame(tokenizer.convert_ids_to_tokens(ds_mlm[
[CLS] | add | new | canine | model | # | [UNK] | new | model | addition | # | # | model | description | recently | proposed | a | new | * | * | c | * | * | hara | ##cter | * | * | a | * | * | rc | ##hit | ##ect | ##ure | with | * | * | n | * | * | o | token | ##ization | * | * | i | * | * | n | * | * | n | * | * | eu | ##ral | * | * | e | * | * | nc | ##oder | ##s | architecture | ( | canine | ) | . | not | only | the | title | is | exciting | : |
|
pipeline | ##d | nl | ##p | systems | have | largely | been | superseded | by | end | - | to | - | end | neural | modeling | , | yet | nearly | all | commonly | - | used | models | still | require | an | explicit | token | ##ization | step | . | while | recent | token | ##ization | approaches | based | on | data | - | derived | sub | ##word | lexi | ##con | ##s | are | [SEP] |
from transformers import DataCollatorForLanguageModeling, set_seed
transformers.data.data_collator.DataCollatorForLanguageModeling
- Documentation
- Create a data collator for language modeling.
Initialize the data collator
= DataCollatorForLanguageModeling(tokenizer=tokenizer,
data_collator =0.15) mlm_probability
Reset random seed
3) set_seed(
Test the data collator
# Switch the return formats of the tokenizer and data collator to NumPy
= "np"
data_collator.return_tensors = tokenizer("Transformers are awesome!", return_tensors="np")
inputs
= data_collator([{"input_ids": inputs["input_ids"][0]}])
outputs
pd.DataFrame({"Original tokens": tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]),
"Masked tokens": tokenizer.convert_ids_to_tokens(outputs["input_ids"][0]),
"Original input_ids": inputs["input_ids"][0],
"Masked input_ids": outputs["input_ids"][0],
"Labels": outputs["labels"][0]}).T
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
Original tokens | [CLS] | transformers | are | awesome | ! | [SEP] |
Masked tokens | [CLS] | transformers | are | awesome | [MASK] | [SEP] |
Original input_ids | 101 | 19081 | 2024 | 12476 | 999 | 102 |
Masked input_ids | 101 | 19081 | 2024 | 12476 | 103 | 102 |
Labels | -100 | -100 | -100 | -100 | 999 | -100 |
Switch the return formats of the data collator to PyTorch
= "pt" data_collator.return_tensors
from transformers import AutoModelForMaskedLM
Initialize the training arguments
= TrainingArguments(
training_args = f"{model_ckpt}-issues-128", per_device_train_batch_size=32,
output_dir ="epoch", evaluation_strategy="epoch", save_strategy="no",
logging_strategy=16, push_to_hub=True, log_level="error", report_to="none", fp16=True) num_train_epochs
Initialize the Trainer
= Trainer(
trainer =AutoModelForMaskedLM.from_pretrained("bert-base-uncased").to(device),
model=tokenizer, args=training_args, data_collator=data_collator,
tokenizer=ds_mlm["unsup"], eval_dataset=ds_mlm["train"]) train_dataset
Train the model
trainer.train()
TrainOutput(global_step=4656, training_loss=1.2827145487991805, metrics={'train_runtime': 561.7109, 'train_samples_per_second': 264.99, 'train_steps_per_second': 8.289, 'train_loss': 1.2827145487991805, 'epoch': 16.0})
Push the trained model to Hugging Face Hub
"Training complete!") trainer.push_to_hub(
'https://huggingface.co/cj-mills/bert-base-uncased-issues-128/commit/c7ba31377378cb7fb9412e9524aaf76eed6956b7'
Inspect the training and validation losses from the training session
= pd.DataFrame(trainer.state.log_history)
df_log
# Drop missing values
=["eval_loss"]).reset_index()["eval_loss"]
(df_log.dropna(subset="Validation"))
.plot(label=["loss"]).reset_index()["loss"].plot(label="Train")
df_log.dropna(subset
"Epochs")
plt.xlabel("Loss")
plt.ylabel(="upper right")
plt.legend(loc plt.show()
Note: Both the training and validation loss decreased significantly.
Free unoccupied cached memory
torch.cuda.empty_cache()
Fine-Tuning a Classifier
Initialize a BertConfig for multi-label classification with the custom checkpoint
= "bert-base-uncased"
model_ckpt = f'{model_ckpt}-issues-128'
model_ckpt = AutoConfig.from_pretrained(model_ckpt)
config = len(all_labels)
config.num_labels = "multi_label_classification"
config.problem_type config
BertConfig {
"_name_or_path": "bert-base-uncased-issues-128",
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1",
"2": "LABEL_2",
"3": "LABEL_3",
"4": "LABEL_4",
"5": "LABEL_5",
"6": "LABEL_6",
"7": "LABEL_7",
"8": "LABEL_8"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1,
"LABEL_2": 2,
"LABEL_3": 3,
"LABEL_4": 4,
"LABEL_5": 5,
"LABEL_6": 6,
"LABEL_7": 7,
"LABEL_8": 8
},
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"problem_type": "multi_label_classification",
"torch_dtype": "float32",
"transformers_version": "4.18.0",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 30522
}
Fine-tune the classifier on each training slice
for train_slice in train_slices:
= AutoModelForSequenceClassification.from_pretrained(model_ckpt,
model =config).to(device)
config= Trainer(
trainer =model,
model=tokenizer,
tokenizer=training_args_fine_tune,
args=compute_metrics,
compute_metrics=ds_enc["train"].select(train_slice),
train_dataset=ds_enc["valid"],
eval_dataset
)
trainer.train()= trainer.predict(ds_enc['test'])
pred = compute_metrics(pred)
metrics # DA refers to domain adaptation
'Fine-tune (DA)'].append(metrics['macro f1'])
macro_scores['Fine-tune (DA)'].append(metrics['micro f1']) micro_scores[
"Fine-tune (DA)") plot_metrics(micro_scores, macro_scores, train_samples,
Note: The fine-tuned classifier performs better than the vanilla BERT, especially in the low-data domain.
Advanced Methods
Unsupervised data augmentation
- Unsupervised data augmentation (UDA) works off the idea that a model’s predictions should be consistent for an unlabeled example and a slightly distorted one.
- We can introduce distortions using standard data augmentation strategies.
- We enforce consistency by minimizing the KL divergence between the predictions of the original and distorted examples.
- We incorporate the consistency requirement by augmenting the cross-entropy loss with an additional term from the unlabeled examples.
- The model trains on the labeled data with the standard supervised approach, but we constrain the model to make consistent predictions on the unlabeled data.
- BERT models trained with UDA on a handful of labeled examples get similar performance to models trained on thousands of examples.
- UDA requires a data augmentation pipeline to generate distorted examples.
- Training takes much longer since it requires multiple forward passes to generate the predicted distributions on the unlabeled and augmented examples.
Uncertainty-aware self-training
- Uncertainty-aware self-training (UST) involves training a teacher model on labeled data and using that model to create pseudo-labels for unlabeled data.
- A student model then trains on the pseudo-labeled data.
- We get an uncertainty measure of the student model’s predictions by feeding it the same input several times with dropout turned on.
- The variance of the predictions gives a proxy for the certainty of the model on a specific sample.
- We then sample the pseudo-labels using Bayesian Active Learning by Disagreement (BALD).
- The teacher continuously gets better at creating pseudo-labels, improving the student’s performance.
- UST gets within a few percentages of models trained on datasets with thousands of labeled samples and beats UDA on several datasets.
- Uncertainty-aware Self-training for Text Classification with Few Labels
Conclusion
- Set up an evaluation pipeline to test different approaches for dealing with little to no labeled data.
- Build a validation and test set early on.
- There are tradeoffs between more complex approaches like UDA and UST and getting more data.
- It might make more sense to create a small high-quality dataset rather than engineering a very complex method to compensate for the lack of data.
- Annotating a few hundred examples usually takes a couple of hours to a few days.
- There are many annotation tools available to speed up labeling new data.
References
Previous: Notes on Transformers Book Ch. 8
Next: Notes on Transformers Book Ch. 10
- I’m Christian Mills, a deep learning consultant specializing in computer vision and practical AI implementations.
- I help clients leverage cutting-edge AI technologies to solve real-world problems.
- Learn more about me or reach out via email at [email protected] to discuss your project.