Notes on Transformers Book Ch. 10
- Training Transformers from Scratch
- Project: Python Source Code Generator
- Large Datasets and Where to Find Them
- Building a Tokenizer
- Training a Model from Scratch
- Results and Analysis
- References
import transformers
import datasets
import accelerate
# Only print error messages
transformers.logging.set_verbosity_error()
datasets.logging.set_verbosity_error()
transformers.__version__, datasets.__version__, accelerate.__version__
('4.18.0', '2.1.0', '0.5.1')
import ast
# https://astor.readthedocs.io/en/latest/
import astor
import inspect
import textwrap
def print_source(obj, exclude_doc=True):
# Get source code
= inspect.getsource(obj)
source # Remove any common leading whitespace from every line
= textwrap.dedent(source)
cleaned_source # Parse the source into an AST node.
= ast.parse(cleaned_source)
parsed
for node in ast.walk(parsed):
# Skip any nodes that are not class or function definitions
if not isinstance(node, (ast.FunctionDef, ast.ClassDef, ast.AsyncFunctionDef)):
continue
if exclude_doc and len(node.body) > 1: node.body = node.body[1:]
print(astor.to_source(parsed))
Training Transformers from Scratch
- Efficiently training large models from scratch requires special tools for distributed training.
Project: Python Source Code Generator
- The goal is to train a GPT-like model to generate Python source code.
Existing AI Code Completion Products
CodeParrot
- GitHub Repository
- CodeParrot is a GPT-2 model trained from scratch on Python code.
Large Datasets and Where to Find Them
- Many domains often have large amounts of data available such as legal documents, biomedical databases, and programming codebases.
- Large datasets can usually only be labeled using heuristics or accompanying metadata.
- We can still use large unlabeled datasets to fine-tune language models for domain adaptation.
- Using a pretrained model forces you to use the model’s corresponding tokenizer.
- Using a tokenizer trained on a corpus from a different domain is typically suboptimal.
Challenges of Building a Large-Scale Corpus
- The model will inherit any defects in the pretraining corpus.
- It becomes more difficult to control or fully understand the contents of a dataset the larger it gets.
- Most exceedingly large datasets are not handcrafted.
- Creating large-scale datasets typically requires using data generated as a side effect of other activities.
- The high degree of automation used to create large-scale datasets means there is limited control over the content and the method to create them.
- There is an increased risk of training a model on lower-quality and biased data.
- A significant portion of the C4 corpus used to train T5 is machine-translated rather than human-translated.
- The stopword filtering in C4 disproportionately removed African-American English from the corpus.
- It is challenging to find a middle ground between including too much explicit content and erasing all mention of sexuality or gender.
- Common words like “sex” are absent from C4.
- There are many copyright violations in the Bookcorpus dataset used to train BERT.
- Bookcorpus also contains genre-skew toward “romance” novels.
- Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books
- Addressing “Documentation Debt” in Machine Learning Research: A Retrospective Datasheet for BookCorpus
Compare text generations from GPT and GPT-2
- The original GPT model trained predominately on BookCorpus.
- GPT-2 trained on web pages, blogs, and news articles linked from Reddit.
from transformers import pipeline, set_seed
Initialze text generation pipelines with the original GPT and GPT-2
= pipeline("text-generation", model="openai-gpt")
generation_gpt = pipeline("text-generation", model="gpt2") generation_gpt2
Note: The main difference between the two models is the pretraining dataset.
Compare the model sizes
def model_size(model):
return sum(t.numel() for t in model.parameters())
print(f"GPT size: {model_size(generation_gpt.model)/1000**2:.1f}M parameters")
print(f"GPT2 size: {model_size(generation_gpt2.model)/1000**2:.1f}M parameters")
GPT size: 116.5M parameters
GPT2 size: 124.4M parameters
Note: The original GPT model is approximately the same size as the smallest GPT-2 variant.
Reset random seed
1) set_seed(
Define a function to generate text using a prompt
def enum_pipeline_ouputs(pipe, prompt, num_return_sequences):
= pipe(prompt, num_return_sequences=num_return_sequences,
out =True)
clean_up_tokenization_spacesreturn "\n".join(f"{i+1}." + s["generated_text"] for i, s in enumerate(out))
Compare the output of the two models
= "\nWhen they came back"
prompt print("GPT completions:\n" + enum_pipeline_ouputs(generation_gpt, prompt, 3))
print("")
print("GPT-2 completions:\n" + enum_pipeline_ouputs(generation_gpt2, prompt, 3))
GPT completions:
1.
When they came back.
" we need all we can get, " jason said once they had settled into the back of the truck without anyone stopping them. " after getting out here, it 'll be up to us what to find. for now
2.
When they came back.
his gaze swept over her body. he 'd dressed her, too, in the borrowed clothes that she 'd worn for the journey.
" i thought it would be easier to just leave you there. " a woman like
3.
When they came back to the house and she was sitting there with the little boy.
" don't be afraid, " he told her. she nodded slowly, her eyes wide. she was so lost in whatever she discovered that tom knew her mistake
GPT-2 completions:
1.
When they came back we had a big dinner and the other guys went to see what their opinion was on her. I did an hour and they were happy with it.
2.
When they came back to this island there had been another massacre, but he could not help but feel pity for the helpless victim who had been left to die, and that they had failed that day. And so was very, very grateful indeed.
3.
When they came back to our house after the morning, I asked if she was sure. She said, "Nope." The two kids were gone that morning. I thought they were back to being a good friend.
When Dost
Note:
- The text generated with the original GPT model has a distinctive romance skew.
- GPT-2 generates more neutral text containing blog-like or adventure-related elements.
- A model reflects the language bias and over or underrepresentation of populations of the dataset used to train it.
- We need to consider the model’s biases concerning the target audience.
- Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure
Building a Custom Code Dataset
- We can obtain a pretraining corpus of Python code from GitHub repositories.
- We can access GitHub repositories via the GitHub REST API or public dataset inventories like Google BigQuery.
- The GitHub REST API is rate limited but provides access to additional attributes like star and downstream usage information.
- The Libraries.io service monitors open source packages.
bigquery-public-data.github_repos.contents
table
- The
bigquery-public-data.github_repos.contents
table contains copies of all ASCII files less than 10MB in size.
CodeSearchNet corpus
- The CodeSearchNet corpus contains 2 million comment-code pairs from open-source libraries hosted on GitHub.
- It contains code and documentation for several programming languages.
- Hugging Face Dataset Card
Creating a dataset with Google BigQuery
Steps to export Python files
- Create a Google Cloud account.
- Create a Google BigQuery project under your account.
- Create a dataset inside the project.
- Create a table in the dataset to store the results of the SQL request.
- Prepare the following SQL query and specify a destination table
SELECT
size, c.content, l.license
f.repo_name, f.path, c.copies, c.FROM
-public-data.github_repos.files` AS f
`bigqueryJOIN
-public-data.github_repos.contents` AS c
`bigqueryON
id = c.id
f.JOIN
-public-data.github_repos.licenses` as l
`bigqueryON
= l.repo_name
f.repo_name WHERE
NOT c.binary
AND ((F.path LIKE '%.py')
AND (c.size BETWEEN 1024 and 1048575))
- Run the query
Note: Encoutered the following error when attempting to run the query
Quota exceeded: Your project exceeded quota for free query bytes scanned. For more information, see https://cloud.google.com/bigquery/docs/troubleshoot-quotas
- The above command processes about 2.6TB of data to extract 26.8 million files.
- The resulting dataset contains about 50 GB of compressed JSON files.
- The dataset is about 200GB when uncompressed.
- Each JSON file contains source code from Python files.
- The query filters empty files like
__init__.py
files and files larger than 1MB. - The query includes the licenses for the files so we can filter the training data later on.
Steps to download results from Google Cloud
- Export results to Google Cloud
- Create a bucket and a folder in Google Cloud Storage (GCS).
- Export your table to this bucket by selecting Export > Export to GCS, with a JSON export format and gzip compression.
- Download the bucket to your local machine using gsutil
- Install gsutil with pip install gsutil.
- Configure gsutil with your Google account: gsutil config.
- Copy your bucket on your machine:
gsutil -m -o "GSUtil:parallel_process_count=1" cp -r gs://<name_of_bucket>
Alternative: Download the dataset from Hugging Face Hub
git clone https://huggingface.co/datasets/transformersbook/codeparrot
To Filter the Noise or Not?
- Data preparation is crucial, and we should clean the dataset as much as possible.
- The quality of code in GitHub repositories varies greatly.
- Having some noise in the training dataset makes our code generation system robust to noisy inputs at inference time but also makes predictions more random.
- The intended use case and whole-system integration determine whether you want more or less noisy data and add pre and post-filtering operations.
Potential steps to clean dataset
- Filter code based on stars or usage information.
- Code with more stars or higher usage is more likely to be higher quality.
- Remove duplicated code samples.
- Consider copyright information.
- Investigate the language used in the documentation, comments, or docstrings.
- Remove personal identifying information such as passwords or keys.
Working with Large Datasets
- Working with large datasets requires additional considerations regarding disk space and RAM usage.
- It is common for datasets to be larger than the available RAM.
- The Hugging Face Datasets library provides memory mapping and streaming functionality to address RAM and disk space limitations.
Memory mapping
- Hugging Face Datasets uses a mechanism for zero-copy and zero-overhead memory mapping.
- The mechanism caches each dataset in a file that directly reflects the content in RAM.
- Hugging Face Datasets opens a read-only pointer to this file and uses it as a substitute for RAM.
from datasets import load_dataset, DownloadConfig
Decompress and load the downloaded dataset from the local folder
Note: The following code block assumes that you have downloaded the BigQuery dataset to a folder called
codeparrot
. We suggest skipping this step since it will unpack the compressed files and require ~180GB of disk space. This code is just for demonstration purposes and you can just continue below with the streamed dataset which will not consume that much disk space.
= DownloadConfig(delete_extracted=True, cache_dir="/mnt/980SSD/Datasets/codeparrot-cache")
download_config = load_dataset("/mnt/980SSD/Datasets/codeparrot", cache_dir="/mnt/980SSD/Datasets/codeparrot-cache", split="train",
dataset =download_config) download_config
Dataset json downloaded and prepared to /mnt/980SSD/Datasets/codeparrot-cache/json/codeparrot-43fc192cc9f62326/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.
Note:
- The
delete_extracted=True
argument deletes the extracted files to free up disk space. - The Hugging Face Datasets library extracted and read the compressed JSON files by loading them in a single optimized cache file.
import psutil, os
Check the size of the cached dataset
print(f"Number of python files code in dataset : {len(dataset)}")
= sum(os.stat(f["filename"]).st_size for f in dataset.cache_files)
ds_size # os.stat.st_size is expressed in bytes, so we convert to GB
print(f"Dataset size (cache file) : {ds_size / 2**30:.2f} GB")
# Process.memory_info is expressed in bytes, so we convert to MB
print(f"RAM used: {psutil.Process(os.getpid()).memory_info().rss >> 20} MB")
Number of python files code in dataset : 18695559
Dataset size (cache file) : 183.68 GB
RAM used: 4359 MB
Note:
- The dataset is much larger than the available RAM, but we can still load and access it.
- NLP data is typically lightweight to load compared to the model processing computations.
- The zero-copy/zero-overhead format uses Apache Arrow under the hood for efficiency.
Streaming
- Some datasets are too large to fit in most hard drives.
- The Hugging Face Datasets library supports streaming many compressed and uncompressed file formats that we can read line-by-line.
- Hugging Face Datasets opens and reads compressed JSON files on the fly in streaming mode.
- Streamed datasets are of the type
IterableDataset
. - We cannot access random elements and need to read them in order.
- Methods like
shuffle()
operate by fetching a buffer of examples and shuffling within this buffer. - The samples of a streamed dataset are identical to those of a nonstreamed dataset.
- Streamed datasets do not generate a cache file on the drive or require significant RAM.
- Individual batches load into memory as requested, reducing the memory footprint.
- We can also stream remote datasets from the Hugging Face Hub, allowing us to use arbitrarily large datasets on small servers.
= load_dataset("/mnt/980SSD/Datasets/codeparrot", split="train", streaming=True) streamed_dataset
AttributeError: '_io.BufferedReader' object has no attribute 'loc'
Iterate through the streamed dataset
= iter(streamed_dataset)
iterator
print(dataset[0] == next(iterator))
print(dataset[1] == next(iterator))
Stream a remote dataset
= load_dataset('transformersbook/codeparrot', split="train",
remote_dataset =True) streaming
Adding Datasets to the Hugging Face Hub
- Pushing our dataset to the Hugging Face Hub allows us to access it from a training server and share it with the community.
Command Line Steps
- Log into Hugging Face account
huggingface-cli login
- Create a new dataset repository on the Hub for the training split
huggingface-cli repo create --type dataset codeparrot-train
- Create a new dataset repository on the Hub for the validation split
huggingface-cli repo create --type dataset codeparrot-valid
- Clone the training repository
huggingface-cli repo create --type dataset codeparrot-train
- Clone the validation repository
huggingface-cli repo create --type dataset codeparrot-valid
- Copy all but the last GitHub file to the as the training set
cd codeparrot-train
cp ../codeparrot/*.json.gz .
rm ./file-000000000183.json.gz
- Commit the files and push them to the Hub
git add .
git commit -m "Adding dataset files"
git push
- Repeat the process for the validation set
cd ../codeparrot-valid
cp ../codeparrot/file-000000000183.json.gz
mv ./file-000000000183.json.gz ./file-000000000183_validation.json.gz
git add .
git commit -m "Adding dataset files"
git push
- It is good practice to add README cards that explain how the datasets were created and provide as much helpful information as possible.
- A well-documented dataset is more likely to be valuable to other people, including the future you.
- Hugging Face Dataset Card Creation Guide
Building a Tokenizer
- It is crucial to stick with the same preprocessing design choices used during the pretraining process when using a pretrained model.
- Using a tokenizer prepared for another dataset when training a new model can be suboptimal.
- The T5 tokenizer uses extensive stopword filtering and is unaware of some common English words like “sex.”
- The CamemBERT tokenizer is only trained on French text and is unaware of common English words such as “being.”
from transformers import AutoTokenizer
def tok_list(tokenizer, string):
= tokenizer(string, add_special_tokens=False)["input_ids"]
input_ids return [tokenizer.decode(tok) for tok in input_ids]
Initialize tokenizers using the pretrained T5 and CamemBERT model vocabularies
= AutoTokenizer.from_pretrained("t5-base")
tokenizer_T5 = AutoTokenizer.from_pretrained("camembert-base") tokenizer_camembert
Test the limitations of the T5 and CamemBERT tokenizers
print(f'T5 tokens for "sex": {tok_list(tokenizer_T5,"sex")}')
print(f'CamemBERT tokens for "being": {tok_list(tokenizer_camembert,"being")}')
T5 tokens for "sex": ['', 's', 'ex']
CamemBERT tokens for "being": ['be', 'ing']
Note:
- Splitting such short and common words into subparts is often inefficient as it increases the sequence length of the model.
- It is essential to consider the domain and the preprocessing of the dataset used to train a tokenizer.
- The tokenizer and model can encode bias from the dataset that impacts the downstream behavior of the model.
The Tokenizer Model
- Training a tokenizer is a way to create an optimal mapping from a string of text to a list of integers that the model can ingest.
- The optimal string-to-integer conversion involves a vocabulary consisting of a list of atomic strings and an associated method to convert, normalize, cut, or map a text string into a list of indices with this vocabulary.
- The list of indices is the input for the neural network.
- The tokenizer processing pipeline involves normalization, pre-tokenization, the tokenizer model, and postprocessing.
- The tokenizer model trains on a corpus.
- Several subword tokenization algorithms are available, such as BPE, WordPiece, and Unigram.
- BPE starts from a list of single characters and creates a vocabulary by progressively creating new tokens formed by merging the most frequently co-occurring basic units and adding them to the list.
- This process continues until we reach the predefined vocabulary size.
- Unigram initializes its base vocabulary with all the words in the corpus and potential subwords and progressively removes or splits the less helpful tokens until it reaches the target vocab size.
- The impact of the chosen tokenization algorithm on downstream performance varies based on the task.
- It is difficult to identify if one algorithm is better than the others.
- Both BPE and Unigram perform reasonably well in most cases.
Measuring Tokenizer Performance
- It is challenging to measure a tokenizer’s optimality and performance in practice.
- Subword fertility calculates the average number of subwords produced per tokenized word.
- The proportion of continued words refers to the amount of tokenized words in a corpus split into at least two subtokens.
- Coverage metrics track information like the proportion of unknown words or rarely used tokens in a tokenized corpus.
- We often estimate the robustness to misspelling or noise and model performance on such out-of-domain examples.
- These measures provide different views on tokenizer performance.
- However, they tend to ignore the interaction of the tokenizer with the model.
- The best way to evaluate tokenizers is using the downstream performance of the model.
A Tokenizer for Python
- Using a natural language pre-tokenizer for Python code might be suboptimal.
- Indentation has semantic meaning in Python code.
- Splitting on all whitespaces and removing them would remove valuable indentation information.
- Line breaks are not meaningful in Python code, and we can remove them without issue.
- Underscores can be part of single variable names and would not to use for splitting text.
- Byte-level tokenizers preserve spaces and might be a good candidate for tokenizing code.
- Python has a built-in tokenize module that splits Python code strings into meaningful units.
- This approach is slow since it is Python-based and limited by the Python global interpreter lock (GIL).
- Most tokenizers provided by the Hugging Face Tokenizers library are in Rust and many orders of magnitude faster to train and use.
from transformers import AutoTokenizer
Test the byte-level GPT-2 tokenizer on Python code
= r"""def say_hello():
python_code print("Hello, World!")
# Print it
say_hello()
"""
python_code
'def say_hello():\n print("Hello, World!")\n# Print it\nsay_hello()\n'
= AutoTokenizer.from_pretrained("gpt2")
tokenizer pd.DataFrame(tokenizer(python_code).tokens()).T
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | def | Ġsay | _ | hello | (): | Ċ | Ġ | Ġ | Ġ | Ġprint | (“ | Hello | , | ĠWorld | !“ | ) | Ċ | # | ĠPrint | Ġit | Ċ | say | _ | hello | () | Ċ |
Inspect the normalization step
print(tokenizer.backend_tokenizer.normalizer)
None
Note: The GPT-2 tokenizer does not use normalization and works directly on raw Unicode inputs.
import pandas as pd
'max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None) pd.set_option(
Inspect the pre-tokenization step
pd.DataFrame(tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(python_code))
0 | 1 | |
---|---|---|
0 | def | (0, 3) |
1 | Ġsay | (3, 7) |
2 | _ | (7, 8) |
3 | hello | (8, 13) |
4 | (): | (13, 16) |
5 | ĊĠĠĠ | (16, 20) |
6 | Ġprint | (20, 26) |
7 | (“ | (26, 28) |
8 | Hello | (28, 33) |
9 | , | (33, 34) |
10 | ĠWorld | (34, 40) |
11 | !“) | (40, 43) |
12 | Ċ | (43, 44) |
13 | # | (44, 45) |
14 | ĠPrint | (45, 51) |
15 | Ġit | (51, 54) |
16 | Ċ | (54, 55) |
17 | say | (55, 58) |
18 | _ | (58, 59) |
19 | hello | (59, 64) |
20 | () | (64, 66) |
21 | Ċ | (66, 67) |
Note:
- Hugging Face Tokenizers provides an offset tracking feature for switching between strings and tokens.
- Hugging Face Tokenizers tracks all operations on the input string so that it is possible to know what part of the input string corresponds to a token after tokenization.
- The numbers in the above output indicate where each token originates in the original string.
- The word “hello” corresponds to the characters 8 to 13 in the original string.
- Each Unicode character is composed of between 1 and 4 bytes.
- There are 143,859 Unicode characters and 256 elements in the byte alphabet.
- We can express each Unicode character as a sequence of bytes.
- We can have a model using an alphabet of only 256 words and process any Unicode string.
Check the representations of some Unicode characters
= u"a", u"€"
a, e = ord(a.encode("utf-8"))
byte print(f'`{a}` is encoded as `{a.encode("utf-8")}` with a single byte: {byte}')
= [ord(chr(i)) for i in e.encode("utf-8")]
byte print(f'`{e}` is encoded as `{e.encode("utf-8")}` with three bytes: {byte}')
`a` is encoded as `b'a'` with a single byte: 97
`€` is encoded as `b'\xe2\x82\xac'` with three bytes: [226, 130, 172]
Note:
- Building our vocabulary from the 143,859 Unicode characters would make the model’s embedding layer extremely large.
- Using only the 256 byte-values as the vocabulary would result in longer input sequences.
- ByT5: Towards a token-free future with pre-trained byte-to-byte models
- The ByT5 paper provides a details study of the overhead from using byte values for our vocabulary.
- ByT5: Towards a token-free future with pre-trained byte-to-byte models
- The BPE algorithm constructs a medium-sized vocabulary by extending the 256 byte-values with the most common combinations of bytes.
- The name, Byte-Pair Encoding, comes from a data compression technique proposed by Philip Gage in 1994, which operated on bytes.
- Standard BPE algorithms in NLP typically operate on Unicode strings rather than bytes.
- A recent type of BPE that works specifically on bytes is called byte-level BPE.
- The BPE algorithms are designed to work with clean Unicode strings as inputs, not bytes, and expect regular ASCII characters in the inputs without spaces or control characters.
- Many Unicode control characters correspond to the 256 first bytes.
- The GPT-2 tokenizer maps all the 256 input bytes to printable Unicode characters, which the BPE algorithms can digest.
from transformers.models.gpt2.tokenization_gpt2 import bytes_to_unicode
Inspect the GPT-2 mapping of bytes to Unicode characters
= bytes_to_unicode()
byte_to_unicode_map = dict((v, k) for k, v in byte_to_unicode_map.items())
unicode_to_byte_map = list(unicode_to_byte_map.keys())
base_vocab
print(f'Size of our base vocabulary: {len(base_vocab)}')
print(f'First element: `{base_vocab[0]}`, last element: `{base_vocab[-1]}`')
Size of our base vocabulary: 256
First element: `!`, last element: `Ń`
Examples of character mappings in BPE
= bytes_to_unicode()
byte_to_unicode_map = dict((v, k) for k, v in byte_to_unicode_map.items())
unicode_to_byte_map = list(unicode_to_byte_map.keys())
base_vocab
= [
examples 'Regular characters', '`a` and `?`', f'{ord("a")} and {ord("?")}' , f'`{byte_to_unicode_map[ord("a")]}` and `{byte_to_unicode_map[ord("?")]}`'],
['Nonprintable control character (carriage return)', '`U+000D`', f'13', f'`{byte_to_unicode_map[13]}`'],
['A space', '` `', f'{ord(" ")}', f'`{byte_to_unicode_map[ord(" ")]}`'],
['A nonbreakable space', '`\\xa0`', '160', f'`{byte_to_unicode_map[ord(chr(160))]}`'],
['A newline character', '`\\n`', '10', f'`{byte_to_unicode_map[ord(chr(10))]}`'],
[
]
= ['Description', 'Character', 'Bytes', 'Mapped bytes']) pd.DataFrame(examples, columns
Description | Character | Bytes | Mapped bytes | |
---|---|---|---|---|
0 | Regular characters |
a and ?
|
97 and 63 |
a and ?
|
1 | Nonprintable control character (carriage return) |
U+000D
|
13 |
č
|
2 | A space |
|
32 |
Ġ
|
3 | A nonbreakable space |
\xa0
|
160 |
ł
|
4 | A newline character |
\n
|
10 |
Ċ
|
Inspect the pre-tokenization step again
pd.DataFrame(tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(python_code))
0 | 1 | |
---|---|---|
0 | def | (0, 3) |
1 | Ġsay | (3, 7) |
2 | _ | (7, 8) |
3 | hello | (8, 13) |
4 | (): | (13, 16) |
5 | ĊĠĠĠ | (16, 20) |
6 | Ġprint | (20, 26) |
7 | (“ | (26, 28) |
8 | Hello | (28, 33) |
9 | , | (33, 34) |
10 | ĠWorld | (34, 40) |
11 | !“) | (40, 43) |
12 | Ċ | (43, 44) |
13 | # | (44, 45) |
14 | ĠPrint | (45, 51) |
15 | Ġit | (51, 54) |
16 | Ċ | (54, 55) |
17 | say | (55, 58) |
18 | _ | (58, 59) |
19 | hello | (59, 64) |
20 | () | (64, 66) |
21 | Ċ | (66, 67) |
Note:
- Consecutive spaces count as a single word.
- Each space preceding a word is attached to and considered part of the following word.
Check the size of the GPT-2 vocabulary
print(f"Size of the vocabulary: {len(tokenizer)}")
Size of the vocabulary: 50257
Note: The GPT-2 vocabulary consists of the base vocabulary with the 256 values of the bytes, 50,000 additional tokens created by repeatedly merging the most commonly occurring tokens, and a special character to represent document boundaries.
Run the GPT-2 tokenizer pipeline again
pd.DataFrame(tokenizer(python_code).tokens()).T
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | def | Ġsay | _ | hello | (): | Ċ | Ġ | Ġ | Ġ | Ġprint | (“ | Hello | , | ĠWorld | !“ | ) | Ċ | # | ĠPrint | Ġit | Ċ | say | _ | hello | () | Ċ |
Note:
- The tokenizer keeps most of the words but splits indentations into several consecutive spaces.
- The training corpus for the tokenizer mostly contained text where consecutive spaces are rare.
- The BPE model does not include a specific token for indentation, meaning it is not well suited for Python code.
Training a Tokenizer
- A tokenizer learns which letter combinations are the most frequent in a target corpus.
- The corpus does not need to be very large, just representative of the target domain.
- We can train a tokenizer on a target corpus using the
tokenizer.train_new_from_iterator()
method. - We need to specify a target vocab size and prepare an iterator to supply lists of input strings.
- The tokenizer might store unusual character sequences depending on the vocab size and the exact texts in the corpus.
Check the longest words in the GPT-2 tokenizer vocabulary
= sorted(tokenizer.vocab.items(), key=lambda x: len(x[0]), reverse=True)
tokens f'{tokenizer.convert_tokens_to_string(t)}' for t, _ in tokens[:8]]).style.hide(axis='columns') pd.DataFrame([
0 | ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ |
---|---|
1 | ================================================================= |
2 | —————————————————————- |
3 | ================================================================ |
4 | ________________________________________________________________ |
5 | —————————————————————- |
6 | ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ |
7 | ………………………………………………………. |
Note: These tokens look like separator lines used on forums.
Check the least frequent words
= sorted(tokenizer.vocab.items(), key=lambda x: x[1], reverse=True)
tokens f'{tokenizer.convert_tokens_to_string(t)}' for t, _ in tokens[:12]]) pd.DataFrame([
0 | |
---|---|
0 | <|endoftext|> |
1 | gazed |
2 | informants |
3 | Collider |
4 | regress |
5 | ominated |
6 | amplification |
7 | Compar |
8 | ….” |
9 | (/ |
10 | Commission |
11 | Hitman |
Note:
- The
<|endoftext|>
token specifies the end of a text sequence and is not from the training corpus. - The model has to learn an associated word embedding for each token.
- This tokenizer embeds some highly time and space-specific knowledge of the world by granting these words separate tokens.
- Overly specific tokens can indicate the target vocab size is too large or that the corpus contains peculiar tokens.
- We don’t want the embedding matrix to contain too many noisy words.
from tqdm.auto import tqdm
Train a fresh tokenizer on 100,000 documents
= 100000
length = 'transformersbook/codeparrot-train'
dataset_name = load_dataset(dataset_name, split="train", streaming=True)
dataset = iter(dataset)
iter_dataset
def batch_iterator(batch_size=10):
for _ in tqdm(range(0, length, batch_size)):
yield [next(iter_dataset)['content'] for _ in range(batch_size)]
= tokenizer.train_new_from_iterator(batch_iterator(),
new_tokenizer =12500,
vocab_size=base_vocab) initial_alphabet
Examine the first tokens added by the BPE algorithm
= sorted(new_tokenizer.vocab.items(), key=lambda x: x[1], reverse=False)
tokens f'{tokenizer.convert_tokens_to_string(t)}' for t, _ in tokens[257:280]]).T pd.DataFrame([
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | se | in | re | on | te | or | st | de | th | le | = | lf | self | me | al |
Note:
- There are various standard levels of indentation and whitespace tokens and short common Python keywords.
- The BPE algorithm is working as intended.
Examine the last tokens added by the BPE algorithm
f'{new_tokenizer.convert_tokens_to_string(t)}' for t,_ in tokens[-12:]]).T pd.DataFrame([
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | capt | embedded | regarding | Bundle | 355 | recv | dmp | vault | Mongo | possibly | implementation | Matches |
Note:
- There are still some relatively common words like the
recv
method. - There are also some more noisy words potentially from comments.
Test the custom tokenizer on the sample code
pd.DataFrame(new_tokenizer(python_code).tokens()).T
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | def | Ġs | ay | _ | hello | (): | ĊĠĠĠ | Ġprint | (“ | Hello | , | ĠWor | ld | !“) | Ċ | # | ĠPrint | Ġit | Ċ | s | ay | _ | hello | () | Ċ |
Note: The tokenize splits common English words like “World” and “say.”
import keyword
keyword
- Documentation
- Determine if a string is a keyword or soft keyword.
Check if all the Python reserved words are in the vocabulary
print(f'There are in total {len(keyword.kwlist)} Python keywords.')
for keyw in keyword.kwlist:
if keyw not in new_tokenizer.vocab:
print(f'No, keyword `{keyw}` is not in the vocabulary')
There are in total 36 Python keywords.
No, keyword `__peg_parser__` is not in the vocabulary
No, keyword `await` is not in the vocabulary
No, keyword `finally` is not in the vocabulary
No, keyword `nonlocal` is not in the vocabulary
Note: Several frequent keywords like “finally” are not in the vocabulary.
Reset random seed
1) set_seed(
Train a tokenizer using a larger target vocab size and dataset sample
= 200000
length = tokenizer.train_new_from_iterator(batch_iterator(),
new_tokenizer_larger =32768, initial_alphabet=base_vocab) vocab_size
Check the last tokens added
= sorted(new_tokenizer_larger.vocab.items(), key=lambda x: x[1],
tokens =False)
reversef'{tokenizer.convert_tokens_to_string(t)}' for t, _ in tokens[-12:]]).T pd.DataFrame([
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 组 | typically | ARGIN | Termination | StaticText | interesting | Circular | combinatorics | )([ | 969 | EAR | Gap |
Note: The group of least-frequent tokens does not contain any Python keywords.
Test the new tokenizer on the sample code
pd.DataFrame(new_tokenizer_larger(python_code).tokens()).T
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | def | Ġsay | _ | hello | (): | ĊĠĠĠ | Ġprint | (“ | Hello | , | ĠWorld | !“) | Ċ | # | ĠPrint | Ġit | Ċ | say | _ | hello | () | Ċ |
Note: The new tokenizer keeps the indents in the vocabulary and does not split common English words.
for keyw in keyword.kwlist:
if keyw not in new_tokenizer_larger.vocab:
print(f'No, keyword `{keyw}` is not in the vocabulary')
No, keyword `__peg_parser__` is not in the vocabulary
No, keyword `nonlocal` is not in the vocabulary
Note:
- The new tokenizer vocabulary is still missing a couple of rare Python keywords, neither of which are relevant for most Python code.
- The
__peg_parser__
keyword is an easter egg for the new PEG parser and will not be in Python 3.10. - The
nonlocal
keyword causes listed identifiers to refer to previously bound variables in the nearest enclosing scope, excluding globals. - The new tokenizer is more efficient than the standard GPT-2 tokenizer as it uses fewer tokens to encode a given code sample.
Disable Tokenizers Parallelism
%env TOKENIZERS_PARALLELISM=false
env: TOKENIZERS_PARALLELISM=false
Saving a Custom Tokenizer on the Hub
Log into Hugging Face account
from huggingface_hub import notebook_login
notebook_login()
Login successful
Your token has been saved to /home/innom-dt/.huggingface/token
Push custom tokenizer to Hugging Face Hub
= "codeparrot" model_ckpt
# org = "transformersbook"
new_tokenizer_larger.push_to_hub(model_ckpt)
'https://huggingface.co/cj-mills/codeparrot/commit/97c7905ef55cb4139e88f9b9d17225c372fc8f55'
Load the custom tokenizer from the Hub repository
# reloaded_tokenizer = AutoTokenizer.from_pretrained(org + "/" + model_ckpt)
= AutoTokenizer.from_pretrained(model_ckpt)
reloaded_tokenizer pd.DataFrame(reloaded_tokenizer(python_code).tokens()).T
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | def | Ġsay | _ | hello | (): | ĊĠĠĠ | Ġprint | (“ | Hello | , | ĠWorld | !“) | Ċ | # | ĠPrint | Ġit | Ċ | say | _ | hello | () | Ċ |
Push the smaller tokenizer to Hugging Face Hub
+ "-small-vocabulary") new_tokenizer.push_to_hub(model_ckpt
'https://huggingface.co/cj-mills/codeparrot-small-vocabulary/commit/b4efe8c9692ce772175b97b01cffc9f1924ae706'
Training a Model from Scratch
A Tale of Pretraining Objectives
- The large-scale pretraining corpus allows us to tackle several downstream tasks.
- The selected task will influence which pretraining objective we choose.
Causal language modeling
- Causal language modeling is a self-supervised approach that does not require annotations.
- Code autocompletion is a directly related downstream task.
- We can provide a model with the beginning of a code sample and have it generate possible completions.
- A decoder-only architecture like the GPT family is usually best suited for this task.
Masked language modeling
- Masked language modeling (also called denoising) is a self-supervised training objective.
- We can provide a model with a noisy code sample (e.g., by replacing a code instruction with a random or masked word) and have it reconstruct the original clean sequence.
- Masked language modeling is not directly related to a downstream task like autocompletion, but it is a practical pretraining objective for learning general representations.
- We can combine masked language modeling with fine-tuning the model on a downstream task.
- Encoder architectures are best suited to this pretraining objective.
Sequence-to-sequence training
- Sequence-to-sequence training is a supervised learning objective where one category serves as input while another serves as labels.
- We can use a heuristic like regular expressions to separate comments or docstrings from code and build a large-scale annotated dataset of code-comment pairs.
- We can then use this dataset to train a model to transcript comments in code or vice versa.
- Document generation from code and code generation from comments are directly-related downstream tasks.
- Encoder decoder architectures are best suited to sequence-to-sequence objectives.
Initializing the Model
NOTE: In the following code block, a large GPT-2 checkpoint is loaded into memory. On platforms like Colab and Kaggle, this can cause the instance to crash due to insufficient RAM or GPU memory. You can still run the example if you use the small checkpoint by replacing the configuration with
config = AutoConfig.from_pretrained("gpt2", vocab_size=len(tokenizer))
.
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
Instantiate a tokenizer using the custom checkpoint
= AutoTokenizer.from_pretrained(model_ckpt) tokenizer
Start with the hyperparameters for training the 1.5 billion-parameter GPT-2 variant
= AutoConfig.from_pretrained("gpt2-xl", vocab_size=len(tokenizer))
config config
GPT2Config {
"_name_or_path": "gpt2-xl",
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 1600,
"n_head": 25,
"n_inner": null,
"n_layer": 48,
"n_positions": 1024,
"output_past": true,
"reorder_and_upcast_attn": false,
"resid_pdrop": 0.1,
"scale_attn_by_inverse_layer_idx": false,
"scale_attn_weights": true,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"transformers_version": "4.18.0",
"use_cache": true,
"vocab_size": 32768
}
Free unoccupied cached memory
import torch
torch.cuda.empty_cache()
Initialize a GPT-2 XL model using the custom tokenizer
= AutoModelForCausalLM.from_config(config) model
Check the model size
print(f'GPT-2 (xl) size: {model_size(model)/1000**2:.1f}M parameters')
GPT-2 (xl) size: 1529.6M parameters
Note: Large models are generally more efficient to train as long as the dataset is reasonably large.
!git lfs install
Updated Git hooks.
Git LFS initialized.
Save the newly initialized model to the Hub
"models/" + model_ckpt+"-large", push_to_hub=True) model.save_pretrained(
OSError: EOF
error: failed to push some refs to 'https://user:[email protected]/cj-mills/codeparrot-large'
Initialize a smaller GPT-2 variant using the custom tokenizer
= AutoTokenizer.from_pretrained(model_ckpt)
tokenizer = AutoConfig.from_pretrained("gpt2", vocab_size=len(tokenizer))
config_small = AutoModelForCausalLM.from_config(config_small) model_small
Check smaller model size
print(f'GPT-2 size: {model_size(model_small)/1000**2:.1f}M parameters')
GPT-2 size: 111.0M parameters
Push the smaller model to the Hub
"models/" + model_ckpt + "-small", push_to_hub=True) model_small.save_pretrained(
Implementing the Dataloader
- We want to supply our model with sequences that fill its context length for maximal efficiency.
- Some code examples might be shorter or longer than the 1,024 token context length.
- We can concatenate several examples to create a long sequence using the EOS token as a separator.
- We then split this sequence into equally sized chunks that fill the context length.
= number_of_sequences * sequence_length * characters_per_token input_characters
input_characters
: the number of characters in the string input to the tokenizernumber_of_seqeunces
: the number of (truncated) sequences returned by the tokenizersequence_length
: the number of tokens per sequence returned by the tokenizercharacters_per_token
: the average number of characters per output token that we first need to estimate
Estimate the average character length per token
= 500, 0, 0
examples, total_characters, total_tokens = load_dataset('transformersbook/codeparrot-train', split='train',
dataset =True)
streaming
for _, example in tqdm(zip(range(examples), iter(dataset)), total=examples):
+= len(example['content'])
total_characters += len(tokenizer(example['content']).tokens())
total_tokens
= total_characters / total_tokens characters_per_token
print(characters_per_token)
3.621530410894045
Note: We’ll round this to \(3.6\).
import torch
from torch.utils.data import IterableDataset
Define an IterableDataset class for preparing constant-length inputs
class ConstantLengthDataset(IterableDataset):
def __init__(self, tokenizer, dataset, seq_length=1024,
=1024, chars_per_token=3.6):
num_of_sequencesself.tokenizer = tokenizer
self.concat_token_id = tokenizer.eos_token_id
self.dataset = dataset
self.seq_length = seq_length
self.input_characters = num_of_sequences * seq_length * chars_per_token
def __iter__(self):
= iter(self.dataset)
iterator = True
more_examples while more_examples:
buffer, buffer_len = [], 0
while True:
# Check if the buffer is full
if buffer_len >= self.input_characters:
=f"Buffer full: {buffer_len}>={self.input_characters:.0f}"
mprint(m)
break
# Try to add the next code sample to the buffer
try:
=f"Fill buffer: {buffer_len}<{self.input_characters:.0f}"
mprint(m)
buffer.append(next(iterator)["content"])
+= len(buffer[-1])
buffer_len # Reset iterator
except StopIteration:
= iter(self.dataset)
iterator
= []
all_token_ids # Tokenize the code samples in the buffer
= self.tokenizer(buffer, truncation=False)
tokenized_inputs # Concatenate the tokenized code samples
for tokenized_input in tokenized_inputs['input_ids']:
+ [self.concat_token_id])
all_token_ids.extend(tokenized_input # Split the sequence into equally sized chunks
for i in range(0, len(all_token_ids), self.seq_length):
= all_token_ids[i : i + self.seq_length]
input_ids if len(input_ids) == self.seq_length:
yield torch.tensor(input_ids)
Note: We don’t need attention masks here since all sequences precisely fill the context length of 1024 tokens.
Prepare the constant-length dataset
= dataset.shuffle(buffer_size=100)
shuffled_dataset = ConstantLengthDataset(tokenizer, shuffled_dataset,
constant_length_dataset =10) num_of_sequences
Note: We can’t shuffle iterable datasets as a whole, so we need to use a buffer instead.
Verify the dataset yields equal length chunks
= iter(constant_length_dataset)
dataset_iterator
= [len(b) for _, b in zip(range(5), dataset_iterator)]
lengths print(f"Lengths of the sequences: {lengths}")
Fill buffer: 0<36864
Fill buffer: 4344<36864
Fill buffer: 5460<36864
Fill buffer: 7467<36864
Fill buffer: 13812<36864
Fill buffer: 16142<36864
Fill buffer: 17571<36864
Fill buffer: 25693<36864
Fill buffer: 27359<36864
Fill buffer: 28903<36864
Fill buffer: 32076<36864
Buffer full: 49996>=36864
Lengths of the sequences: [1024, 1024, 1024, 1024, 1024]
Defining the Training Loop
- Even modern GPUs can’t train a model at GPT-2 scale in a reasonable time.
- We need to use data parallelism to utilize several GPUs for training.
- The Hugging Face Accelerate library makes distributed training and changing the underlying hardware for training easier.
- Hugging Face Accelerate provides an API to make training scripts run with mixed precision and in any distributed setting.
- The same code can run seamlessly on your local machine for debugging and a beefy training cluster for a final training run.
from argparse import Namespace
Define the hyperparameters
# Commented parameters correspond to the small model
= {"train_batch_size": 2, # 12
config "valid_batch_size": 2, # 12
"weight_decay": 0.1,
"shuffle_buffer": 1000,
"learning_rate": 2e-4, # 5e-4
"lr_scheduler_type": "cosine",
"num_warmup_steps": 750, # 2000
"gradient_accumulation_steps": 16, # 1
"max_train_steps": 50000, # 150000
"max_eval_steps": -1,
"seq_length": 1024,
"seed": 1,
"save_checkpoint_steps": 50000} # 15000
= Namespace(**config) args
from torch.utils.tensorboard import SummaryWriter
import logging
import wandb
logging.getLogger()
- Documentation
- Create a Logger object.
torch.utils.tensorboard.writer.SummaryWriter
- Documentation
- Write entries directly to event files for TensorBoard
wandb
- GitHub Repository
- Documentation
- A tool for visualizing and tracking machine learning experiements.
Define a method to initialize the loggers for the training process
def setup_logging(project_name):
= logging.getLogger(__name__)
logger
logging.basicConfig(format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
="%m/%d/%Y %H:%M:%S", level=logging.INFO, handlers=[
datefmtf"log/debug_{accelerator.process_index}.log"),
logging.FileHandler(
logging.StreamHandler()])if accelerator.is_main_process: # We only want to set up logging once
=project_name, config=args)
wandb.init(project= wandb.run.name
run_name = SummaryWriter()
tb_writer vars(args), {'0': 0})
tb_writer.add_hparams(
logger.setLevel(logging.INFO)
datasets.utils.logging.set_verbosity_debug()
transformers.utils.logging.set_verbosity_info()else:
= None
tb_writer = ''
run_name
logger.setLevel(logging.ERROR)
datasets.utils.logging.set_verbosity_error()
transformers.utils.logging.set_verbosity_error()return logger, tb_writer, run_name
Note:
- Each worker gets a unique
accelerator.process_index
, which we use with the FileHandler to write the logs of each worker to an individual file. - We’ll use the unique
run_name
to name our experiment branch on the Hub.
Define function to log metrics with TensorBoard and Weights and Biases
def log_metrics(step, metrics):
f"Step {step}: {metrics}")
logger.info(if accelerator.is_main_process:
wandb.log(metrics)for k, v in metrics.items()] [tb_writer.add_scalar(k, v, step)
from torch.utils.data.dataloader import DataLoader
Define a function to create dataloaders for the training and validation sets
def create_dataloaders(dataset_name):
= load_dataset(dataset_name+'-train', split="train",
train_data =True)
streaming= train_data.shuffle(buffer_size=args.shuffle_buffer,
train_data =args.seed)
seed= load_dataset(dataset_name+'-valid', split="validation",
valid_data =True)
streaming
= ConstantLengthDataset(tokenizer, train_data,
train_dataset =args.seq_length)
seq_length= ConstantLengthDataset(tokenizer, valid_data,
valid_dataset =args.seq_length)
seq_length
=DataLoader(train_dataset, batch_size=args.train_batch_size)
train_dataloader=DataLoader(valid_dataset, batch_size=args.valid_batch_size)
eval_dataloaderreturn train_dataloader, eval_dataloader
Note: Hugging Face Accelerate takes care of distributing batches to each worker.
Define a helper function to differentiate the parameters that should receive weight decay
- Biases and LayerNorm weights are generally not subject to weight decay.
def get_grouped_params(model, no_decay=["bias", "LayerNorm.weight"]):
= [], []
params_with_wd, params_without_wd for n, p in model.named_parameters():
if any(nd in n for nd in no_decay):
params_without_wd.append(p)else:
params_with_wd.append(p)return [{'params': params_with_wd, 'weight_decay': args.weight_decay},
'params': params_without_wd, 'weight_decay': 0.0}] {
Define a function to evaluate the model on the validation set
def evaluate():
eval()
model.= []
losses for step, batch in enumerate(eval_dataloader):
with torch.no_grad():
= model(batch, labels=batch)
outputs = outputs.loss.repeat(args.valid_batch_size)
loss
losses.append(accelerator.gather(loss))if args.max_eval_steps > 0 and step >= args.max_eval_steps: break
= torch.mean(torch.cat(losses))
loss try:
= torch.exp(loss)
perplexity except OverflowError:
= torch.tensor(float("inf"))
perplexity return loss.item(), perplexity.item()
Note:
- The perplexity measures how well the model’s output probability distributions predict the targeted tokens.
- A lower perplexity corresponds to better performance.
- We compute the perplexity by exponentiating the cross-entropy loss from the model’s output.
Training session
# Reset random seed
set_seed(args.seed)
# Accelerator
= Accelerator()
accelerator = accelerator.state.num_processes * args.train_batch_size
samples_per_step
# Logging
= setup_logging(project_name.split("/")[1])
logger, tb_writer, run_name
logger.info(accelerator.state)
# Load model and tokenizer
if accelerator.is_main_process:
# Check out a new branch for the current run
= Repository("./", clone_from=project_name, revision=run_name)
hf_repo = AutoModelForCausalLM.from_pretrained("./", gradient_checkpointing=True)
model = AutoTokenizer.from_pretrained("./")
tokenizer
# Load dataset and dataloader
= create_dataloaders(dataset_name)
train_dataloader, eval_dataloader
# Prepare the optimizer and learning rate scheduler
= AdamW(get_grouped_params(model), lr=args.learning_rate)
optimizer = get_scheduler(name=args.lr_scheduler_type, optimizer=optimizer,
lr_scheduler =args.num_warmup_steps,
num_warmup_steps=args.max_train_steps,)
num_training_stepsdef get_lr():
return optimizer.param_groups[0]['lr']
# Prepare everything with our `accelerator` (order of args is not important)
= accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader
model, optimizer, train_dataloader, eval_dataloader)
# Train model
model.train()= 0
completed_steps for step, batch in enumerate(train_dataloader, start=1):
= model(batch, labels=batch).loss
loss 'lr': get_lr(), 'samples': step*samples_per_step,
log_metrics(step, {'steps': completed_steps, 'loss/train': loss.item()})
= loss / args.gradient_accumulation_steps
loss
accelerator.backward(loss)# Use gradient accumulation to imitate larger batch sizes
if step % args.gradient_accumulation_steps == 0:
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()+= 1
completed_steps if step % args.save_checkpoint_steps == 0:
'Evaluating and saving model checkpoint')
logger.info(# Evaluate the model every time we save a new checkpoint
= evaluate()
eval_loss, perplexity 'loss/eval': eval_loss, 'perplexity': perplexity})
log_metrics(step, {# Synchronize the model before storing the latest checkpoint
accelerator.wait_for_everyone()= accelerator.unwrap_model(model)
unwrapped_model if accelerator.is_main_process:
# Save the latest checkpoint to disk
"./")
unwrapped_model.save_pretrained(# Push the latest checkpoint to the Hub
=f'step {step}')
hf_repo.push_to_hub(commit_message
model.train()if completed_steps >= args.max_train_steps:
break
# Evaluate and save the last checkpoint
'Evaluating and saving model after training')
logger.info(= evaluate()
eval_loss, perplexity 'loss/eval': eval_loss, 'perplexity': perplexity})
log_metrics(step, {
accelerator.wait_for_everyone()= accelerator.unwrap_model(model)
unwrapped_model if accelerator.is_main_process:
"./")
unwrapped_model.save_pretrained(=f'final model') hf_repo.push_to_hub(commit_message
Note:
- here are several approaches to distributed training depending on the model size and volume of data.
- Hugging Face Accelerate uses DataDistributedParalellism (DDP).
- DDP allows you to train models faster with larger batch sizes that wouldn’t fit into any single GPU.
- Hugging Face Accelerate prepares batches of data and sends them to the workers.
- Each worker consists of a GPU and calculates the loss and their respective accumulated gradients from forward and backward passes with a local copy of the model.
- We average the gradients from each node with a
reduce
pattern and send the average back to each worker. - We apply the gradients using the optimizer on each node to avoid transferring copies of the large models between nodes.
- We repeat the process after updating the models for each worker.
- DDP requires that the model fits on a single GPU.
- Fitting larger networks into memory.
- Model Paralellism
The Training Run
- We can save the training steps to a script and push them to a repository on the Hub.
- We can then execute the training script on a training server using the
accelerate launch
command.
git clone https://huggingface.co/transformerbook/codeparrot
cd codeparrot
pip install -r requirements.txt
wandb login
accelerate config
accelerate launch codparrot_training.py
- The
accelerate config
command guides you through setting up the infrastructure. - Hugging Face uses
a2-megagpu-16g
instances on Google Cloud for experiments (pricing). - Reducing 90% in costs with Spot VMs for Machine Learning on Google Kubernetes Engine in GCP
Configuration used to train CodeParrot models
Setting | Value |
---|---|
Compute Environment? | multi-gpu |
How many machines? | 1 |
DeepSpeed? | No |
How many processes? | 16 |
Use FP16? | Yes |
- Running the training script with the above settings takes about 24 hours for the small model and seven days for the large model.
- Test the code on smaller infrastructure before using expensive cloud instances.
- We can merge the experiment branch back into the main one after training completes.
git checkout main
git merge <RUN_NAME>
git push
Results and Analysis
- The training loss and validation perplexity should go down continuously during training.
- The large model converges with fewer processed tokens, but training takes longer overall.
- Qualitative analysis involves looking at concrete examples and trying to better understand in which cases the model succeeds and fails.
- Quantitative analysis involves evaluating model performance statistically on a large set of test cases.
from transformers import pipeline, set_seed
Wrap the small model in a text generation pipeline
= 'transformersbook/codeparrot-small'
model_ckpt = pipeline('text-generation', model=model_ckpt, device=0) generation
import re
from transformers import set_seed
Define a function to extract the first code block from the model output
def first_block(string):
return re.split('\nclass|\ndef|\n#|\n@|\nprint|\nif', string)[0].rstrip()
Define a function to print out generated code completions
def complete_code(pipe, prompt, max_length=64, num_completions=4, seed=1):
set_seed(seed)= {"temperature":0.4, "top_p":0.95, "top_k":0, "num_beams":1,
gen_kwargs "do_sample":True,}
= generation(prompt, num_return_sequences=num_completions,
code_gens =max_length, **gen_kwargs)
max_length= []
code_strings for code_gen in code_gens:
= first_block(code_gen['generated_text'][len(prompt):])
generated_code
code_strings.append(generated_code)print(('\n'+'='*80 + '\n').join(code_strings))
Test the model on a simple task
= '''def area_of_rectangle(a: float, b: float):
prompt """Return the area of the rectangle."""'''
complete_code(generation, prompt)
return math.sqrt(a * b)
================================================================================
return a * b / 2.0
================================================================================
return a * b
================================================================================
return a * b / 2.0
Note: The generated outputs look convincing, but not all of them are correct.
Test the model on a more complex task
= '''def get_urls_from_html(html):
prompt """Get all embedded URLs in a HTML string."""'''
complete_code(generation, prompt)
if not html:
return []
return [url for url in re.findall(r'<a href="(/[^/]+/[^"]+?)">', html)]
================================================================================
return [url for url in re.findall(r'<a href="(.*?)"', html)
if url]
================================================================================
return [url for url in re.findall(r'<a href="(.*?)"', html)]
================================================================================
return re.findall(r'<a href="([^"]+)">', html)
Note: The second attempt is not quite right, but the other three generations are correct.
Test the generated code
import requests
def get_urls_from_html(html):
return [url for url in re.findall(r'<a href="(.*?)"', html) if url]
'https://hf.co/').text)) pd.DataFrame(get_urls_from_html(requests.get(
0 | |
---|---|
0 | https://huggingface.co/bigscience/tr11-176B-ml-logs |
1 | https://github.com/huggingface/transformers |
2 | /join |
3 | /tasks |
4 | https://huggingface.co/transformers |
5 | /inference-api |
6 | /distilbert-base-uncased |
7 | /dbmdz/bert-large-cased-finetuned-conll03-english |
8 | https://bigscience.huggingface.co/ |
9 | https://bigscience.huggingface.co/blog/t0 |
10 | https://medium.com/huggingface/distilbert-8cf3380435b5 |
11 | https://arxiv.org/abs/1811.06031 |
12 | https://arxiv.org/abs/1803.10631 |
13 | /coref |
14 | https://transformer.huggingface.co/ |
Note: The URLs starting with https
are external pages, while the others are subpages of the main website.
Wrap the large model in a text generation pipeline
= 'transformersbook/codeparrot'
model_ckpt = pipeline('text-generation', model=model_ckpt, device=0) generation
Try to translate a function from pure Python to NumPy using the large model
= '''# a function in native python:
prompt def mean(a):
return sum(a)/len(a)
# the same function using numpy:
import numpy as np
def mean(a):'''
=64) complete_code(generation, prompt, max_length
return np.mean(a)
================================================================================
return sum(a)/len(a)
================================================================================
return np.mean(a)
================================================================================
return sum(a)/len(a)
Note: It worked.
Try building a Scilit-learn model
= '''X = np.random.randn(100, 100)
prompt y = np.random.randint(0, 1, 100)
# fit random forest classifier with 20 estimators'''
=96) complete_code(generation, prompt, max_length
reg = DummyRegressor()
forest = RandomForestClassifier(n_estimators=20)
forest.fit(X, y)
================================================================================
clf = ExtraTreesClassifier(n_estimators=100, max_features='sqrt')
clf.fit(X, y)
================================================================================
clf = RandomForestClassifier(n_estimators=20, n_jobs=n_jobs, random_state=1)
clf.fit(X, y)
================================================================================
clf = RandomForestClassifier(n_estimators=20, n_jobs=n_jobs, random_state=1)
clf.fit(X, y)
Note:
- The second attempt used an extra-trees classifier, but the other three generated what we asked.
- The BLEU score is not well suited for measuring the quality of generated code as it would punish a generation that deviates from the reference naming.
- The success of a program does not depend on the naming scheme as long as it is consistent.
- We can use traditional software development methods like unit tests to measure the quality of generated code.
- OpenAI evaluated Codex models by running several code generations for coding tasks through some unit tests and calculating the fraction that passes the tests.
- Evaluating Large Language Models Trained on Code
References
Previous: Notes on Transformers Book Ch. 9
Next: Notes on Transformers Book Ch. 11
- I’m Christian Mills, a deep learning consultant specializing in computer vision and practical AI implementations.
- I help clients leverage cutting-edge AI technologies to solve real-world problems.
- Learn more about me or reach out via email at [email protected] to discuss your project.