Notes on fastai Book Ch. 10
- NLP Deep Dive
- Text Preprocessing
- Training a Text Classifier
- Disinformation and Language Models
- References
import fastbook
fastbook.setup_book()
#hide
from fastbook import *
from IPython.display import display,HTML
import inspect
def print_source(obj):
for line in inspect.getsource(obj).split("\n"):
print(line)
NLP Deep Dive: RNNs
- In NLP, pretrained models are typically trained on a different type of task than your target task
- A language model is trained to predict the next word in a text (having read the ones before)
- We do not feed the model labes, we just feed it lots of text
- the model uses self-supervised learning to develop an understanding the underlying language of the text
- Self-Supervised Learning: training a model using labels that are embedded in the independent variables, rather than requireing external labels
- Self-supervised learning can also be used in other domains
- Self-supervised learning is not usually used for the model that is trained directly
- used for pretraining a model that is then used for transfer learning
- A pretrained language model is often trained using a different body of text than the one you are targeting
- can be useful to further pretrain the model on your target body of text
Universal Language Model Fine-tunine (ULMFiT)
- showed that fine-tuning a language model on on the target body of text prior to transfer learning to a classification task, resulted in significantly better predictions
Text Preprocessing
- can use an approach similar to preparing categorical variables
- Make a list of all possible levels of that categorical variable (called the vocab)
- Tokenization: convert the text to a list of words
- Replace each level with its index in the vocab
- Numericalization: the process of mapping tokens to numbers
- List all of the unique words that appear, and convert eachword into a number by looking up its index in the vocab
- Numericalization: the process of mapping tokens to numbers
- Create an embedding matrix for this contianing a row for each item in the vocab
- Language model data loader creation: fastai provides an LMDataLoader that automatically handles creating a depdendent variable that is offset from the independent variable by one token. Also handles some important details such as how to shuffle the training data in such a way that the dependent and independent variables maintain their structure as required
- Use this embedding matrix as the first layer of a neural network
- a dedicated embedding matrix can take as inputs the raw vocab indexes created in step 2
- Language Model Creation: need a special kind of model that can handle arbitrarily big or small input lists
- Make a list of all possible levels of that categorical variable (called the vocab)
- we first concatenate all documents in our dataset into one big long string and split it into words (or tokens)
- our independent variable will be the sequence of words starting with the first word in our long list of words and ending with the second to last.
- our dependent variable will be the sequence of words starting with the second word and ending with the last word
- our vocab will consist of a mix of common words and new words specific to target dataset
- we can use the corresponding rows in the embedding matrix for the pretrained model and only initialize new rows in the matrix for new words
Tokenization
token: one element of a list created by the tokenization process. * could be a word, part of a word, or a single character
- tokenization is an active field of research, with new and improved tokenizers coming out all the time
Approaches
- Word-based
- split a sentence on spaces, as well as applying language-specific rules to try to separate parts of meaning even when there are no spaces
- punctuation marks are typically split into separate tokens
- relies on the assumption that spaces provide a useful separation of components of meaning in a sentence
- some languages don’t have spaces or even a well-defined concept of a “word”
- Subword-based
- split words into smaller parts, based on the most commonly occurring sub-strings
- Example: “occasion” -> “o c ca sion”
- handles every human language without needing language specific algorithms to be develop
- can handle other sequences like genomic sequences or MIDI music notation
- Character-based
- split a sentence into its individual characters
Word Tokenization with fastai
- fastai provides a consistent interface to a range of tokenizers in external libraries
- The default English word tokenizer for fastai uses spaCy
from fastai.text.all import *
URLs.IMDB
'https://s3.amazonaws.com/fast-ai-nlp/imdb.tgz'
= untar_data(URLs.IMDB)
path path
Path('/home/innom-dt/.fastai/data/imdb')
fastai get_text_files
- Documentation
- Get text files in path recursively, only in folders, if specified.
get_text_files
<function fastai.data.transforms.get_text_files(path, recurse=True, folders=None)>
print_source(get_text_files)
def get_text_files(path, recurse=True, folders=None):
"Get text files in `path` recursively, only in `folders`, if specified."
return get_files(path, extensions=['.txt'], recurse=recurse, folders=folders)
print_source(get_files)
def get_files(path, extensions=None, recurse=True, folders=None, followlinks=True):
"Get all the files in `path` with optional `extensions`, optionally with `recurse`, only in `folders`, if specified."
path = Path(path)
folders=L(folders)
extensions = setify(extensions)
extensions = {e.lower() for e in extensions}
if recurse:
res = []
for i,(p,d,f) in enumerate(os.walk(path, followlinks=followlinks)): # returns (dirpath, dirnames, filenames)
if len(folders) !=0 and i==0: d[:] = [o for o in d if o in folders]
else: d[:] = [o for o in d if not o.startswith('.')]
if len(folders) !=0 and i==0 and '.' not in folders: continue
res += _get_files(p, f, extensions)
else:
f = [o.name for o in os.scandir(path) if o.is_file()]
res = _get_files(path, f, extensions)
return L(res)
= get_text_files(path, folders = ['train', 'test', 'unsup']) files
len(files)
100000
= files[0].open().read(); txt[:75] txt
'This conglomeration fails so miserably on every level that it is difficult '
fastai SpacyTokenizer
WordTokenizer
fastai.text.core.SpacyTokenizer
print_source(WordTokenizer)
class SpacyTokenizer():
"Spacy tokenizer for `lang`"
def __init__(self, lang='en', special_toks=None, buf_sz=5000):
self.special_toks = ifnone(special_toks, defaults.text_spec_tok)
nlp = spacy.blank(lang)
for w in self.special_toks: nlp.tokenizer.add_special_case(w, [{ORTH: w}])
self.pipe,self.buf_sz = nlp.pipe,buf_sz
def __call__(self, items):
return (L(doc).attrgot('text') for doc in self.pipe(map(str,items), batch_size=self.buf_sz))
first
<function fastcore.basics.first(x, f=None, negate=False, **kwargs)>
print_source(first)
<function first at 0x7fdb8da3de50>
def first(x, f=None, negate=False, **kwargs):
"First element of `x`, optionally filtered by `f`, or None if missing"
x = iter(x)
if f: x = filter_ex(x, f=f, negate=negate, gen=True, **kwargs)
return next(x, None)
coll_repr
<function fastcore.foundation.coll_repr(c, max_n=10)>
print_source(coll_repr)
def coll_repr(c, max_n=10):
"String repr of up to `max_n` items of (possibly lazy) collection `c`"
return f'(#{len(c)}) [' + ','.join(itertools.islice(map(repr,c), max_n)) + (
'...' if len(c)>max_n else '') + ']'
= WordTokenizer() spacy
spacy.buf_sz
5000
spacy.pipe
<bound method Language.pipe of <spacy.lang.en.English object at 0x7fdb6545f1c0>>
spacy.special_toks
['xxunk',
'xxpad',
'xxbos',
'xxeos',
'xxfld',
'xxrep',
'xxwrep',
'xxup',
'xxmaj']
# Wrap text in a list before feeding it to the tokenizer
= first(spacy([txt]))
toks print(coll_repr(toks, 30))
(#174) ['This','conglomeration','fails','so','miserably','on','every','level','that','it','is','difficult','to','decide','what','to','say','.','It','does',"n't",'merit','one','line',',','much','less','ten',',','but'...]
'The U.S. dollar $1 is $1.00.'])) first(spacy([
(#9) ['The','U.S.','dollar','$','1','is','$','1.00','.']
Tokenizer
fastai.text.core.Tokenizer
print_source(Tokenizer)
class Tokenizer(Transform):
"Provides a consistent `Transform` interface to tokenizers operating on `DataFrame`s and folders"
input_types = (str, list, L, tuple, Path)
def __init__(self, tok, rules=None, counter=None, lengths=None, mode=None, sep=' '):
if isinstance(tok,type): tok=tok()
store_attr('tok,counter,lengths,mode,sep')
self.rules = defaults.text_proc_rules if rules is None else rules
@classmethod
@delegates(tokenize_df, keep=True)
def from_df(cls, text_cols, tok=None, rules=None, sep=' ', **kwargs):
if tok is None: tok = WordTokenizer()
res = cls(tok, rules=rules, mode='df')
res.kwargs,res.train_setup = merge({'tok': tok}, kwargs),False
res.text_cols,res.sep = text_cols,sep
return res
@classmethod
@delegates(tokenize_folder, keep=True)
def from_folder(cls, path, tok=None, rules=None, **kwargs):
path = Path(path)
if tok is None: tok = WordTokenizer()
output_dir = tokenize_folder(path, tok=tok, rules=rules, **kwargs)
res = cls(tok, counter=load_pickle(output_dir/fn_counter_pkl),
lengths=load_pickle(output_dir/fn_lengths_pkl), rules=rules, mode='folder')
res.path,res.output_dir = path,output_dir
return res
def setups(self, dsets):
if not self.mode == 'df' or not isinstance(dsets.items, pd.DataFrame): return
dsets.items,count = tokenize_df(dsets.items, self.text_cols, rules=self.rules, **self.kwargs)
if self.counter is None: self.counter = count
return dsets
def encodes(self, o:Path):
if self.mode=='folder' and str(o).startswith(str(self.path)):
tok = self.output_dir/o.relative_to(self.path)
return L(tok.read_text(encoding='UTF-8').split(' '))
else: return self._tokenize1(o.read_text())
def encodes(self, o:str): return self._tokenize1(o)
def _tokenize1(self, o): return first(self.tok([compose(*self.rules)(o)]))
def get_lengths(self, items):
if self.lengths is None: return None
if self.mode == 'df':
if isinstance(items, pd.DataFrame) and 'text_lengths' in items.columns: return items['text_length'].values
if self.mode == 'folder':
try:
res = [self.lengths[str(Path(i).relative_to(self.path))] for i in items]
if len(res) == len(items): return res
except: return None
def decodes(self, o): return TitledStr(self.sep.join(o))
= Tokenizer(spacy)
tkn print(coll_repr(tkn(txt), 31))
(#177) ['xxbos','xxmaj','this','conglomeration','fails','so','miserably','on','every','level','that','it','is','difficult','to','decide','what','to','say','.','xxmaj','it','does',"n't",'merit','one','line',',','much','less','ten'...]
Special Tokens
- tokens that start with
xx
are special tokens - designed to make it easier for a model to recognize the important parts of a sentence
xxbos
: Indicates the beginning of a textxxmaj
: Indicates the next word begins with a capital letter (since everything is made lowercase)xxunk
: Indicates the next word is unknown
Preprocessing Rules
defaults.text_proc_rules
[<function fastai.text.core.fix_html(x)>,
<function fastai.text.core.replace_rep(t)>,
<function fastai.text.core.replace_wrep(t)>,
<function fastai.text.core.spec_add_spaces(t)>,
<function fastai.text.core.rm_useless_spaces(t)>,
<function fastai.text.core.replace_all_caps(t)>,
<function fastai.text.core.replace_maj(t)>,
<function fastai.text.core.lowercase(t, add_bos=True, add_eos=False)>]
for rule in defaults.text_proc_rules:
print_source(rule)
def fix_html(x):
"Various messy things we've seen in documents"
x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace('nbsp;', ' ').replace(
'#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace('<br />', "\n").replace(
'\\"', '"').replace('<unk>',UNK).replace(' @.@ ','.').replace(' @-@ ','-').replace('...',' …')
return html.unescape(x)
def replace_rep(t):
"Replace repetitions at the character level: cccc -- TK_REP 4 c"
def _replace_rep(m):
c,cc = m.groups()
return f' {TK_REP} {len(cc)+1} {c} '
return _re_rep.sub(_replace_rep, t)
def replace_wrep(t):
"Replace word repetitions: word word word word -- TK_WREP 4 word"
def _replace_wrep(m):
c,cc,e = m.groups()
return f' {TK_WREP} {len(cc.split())+2} {c} {e}'
return _re_wrep.sub(_replace_wrep, t)
def spec_add_spaces(t):
"Add spaces around / and #"
return _re_spec.sub(r' \1 ', t)
def rm_useless_spaces(t):
"Remove multiple spaces"
return _re_space.sub(' ', t)
def replace_all_caps(t):
"Replace tokens in ALL CAPS by their lower version and add `TK_UP` before."
def _replace_all_caps(m):
tok = f'{TK_UP} ' if len(m.groups()[1]) > 1 else ''
return f"{m.groups()[0]}{tok}{m.groups()[1].lower()}"
return _re_all_caps.sub(_replace_all_caps, t)
def replace_maj(t):
"Replace tokens in Sentence Case by their lower version and add `TK_MAJ` before."
def _replace_maj(m):
tok = f'{TK_MAJ} ' if len(m.groups()[1]) > 1 else ''
return f"{m.groups()[0]}{tok}{m.groups()[1].lower()}"
return _re_maj.sub(_replace_maj, t)
def lowercase(t, add_bos=True, add_eos=False):
"Converts `t` to lowercase"
return (f'{BOS} ' if add_bos else '') + t.lower().strip() + (f' {EOS}' if add_eos else '')
Postprocessing Rules
for rule in defaults.text_postproc_rules:
print_source(rule)
def replace_space(t):
"Replace embedded spaces in a token with unicode line char to allow for split/join"
return t.replace(' ', '_')
'© Fast.ai www.fast.ai/INDEX'), 31) coll_repr(tkn(
"(#11) ['xxbos','©','xxmaj','fast.ai','xxrep','3','w','.fast.ai','/','xxup','index']"
Subword Tokenization
Process
- Analyze a corpus of documents to find the most commonly occurring groups of letters. These become the vocab
- Tokenize the corpus using this vocab of subword units.
= L(o.open().read() for o in files[:2000]) txts
SubwordTokenizer
fastai.text.core.SentencePieceTokenizer
print_source(SubwordTokenizer)
class SentencePieceTokenizer():#TODO: pass the special tokens symbol to sp
"SentencePiece tokenizer for `lang`"
def __init__(self, lang='en', special_toks=None, sp_model=None, vocab_sz=None, max_vocab_sz=30000,
model_type='unigram', char_coverage=None, cache_dir='tmp'):
try: from sentencepiece import SentencePieceTrainer,SentencePieceProcessor
except ImportError:
raise Exception('sentencepiece module is missing: run `pip install sentencepiece!=0.1.90,!=0.1.91`')
self.sp_model,self.cache_dir = sp_model,Path(cache_dir)
self.vocab_sz,self.max_vocab_sz,self.model_type = vocab_sz,max_vocab_sz,model_type
self.char_coverage = ifnone(char_coverage, 0.99999 if lang in eu_langs else 0.9998)
self.special_toks = ifnone(special_toks, defaults.text_spec_tok)
if sp_model is None: self.tok = None
else:
self.tok = SentencePieceProcessor()
self.tok.Load(str(sp_model))
os.makedirs(self.cache_dir, exist_ok=True)
def _get_vocab_sz(self, raw_text_path):
cnt = Counter()
with open(raw_text_path, 'r') as f:
for line in f.readlines():
cnt.update(line.split())
if len(cnt)//4 > self.max_vocab_sz: return self.max_vocab_sz
res = len(cnt)//4
while res%8 != 0: res+=1
return max(res,29)
def train(self, raw_text_path):
"Train a sentencepiece tokenizer on `texts` and save it in `path/tmp_dir`"
from sentencepiece import SentencePieceTrainer
vocab_sz = self._get_vocab_sz(raw_text_path) if self.vocab_sz is None else self.vocab_sz
spec_tokens = ['\u2581'+s for s in self.special_toks]
SentencePieceTrainer.Train(" ".join([
f"--input={raw_text_path} --vocab_size={vocab_sz} --model_prefix={self.cache_dir/'spm'}",
f"--character_coverage={self.char_coverage} --model_type={self.model_type}",
f"--unk_id={len(spec_tokens)} --pad_id=-1 --bos_id=-1 --eos_id=-1 --minloglevel=2",
f"--user_defined_symbols={','.join(spec_tokens)} --hard_vocab_limit=false"]))
raw_text_path.unlink()
return self.cache_dir/'spm.model'
def setup(self, items, rules=None):
from sentencepiece import SentencePieceProcessor
if rules is None: rules = []
if self.tok is not None: return {'sp_model': self.sp_model}
raw_text_path = self.cache_dir/'texts.out'
with open(raw_text_path, 'w') as f:
for t in progress_bar(maps(*rules, items), total=len(items), leave=False):
f.write(f'{t}\n')
sp_model = self.train(raw_text_path)
self.tok = SentencePieceProcessor()
self.tok.Load(str(sp_model))
return {'sp_model': sp_model}
def __call__(self, items):
if self.tok is None: self.setup(items)
for t in items: yield self.tok.EncodeAsPieces(t)
sentencepiece tokenizer
- GitHub Repository
- Unsupervised text tokenizer for Neural Network-based text generation.
def subword(sz):
# Initialize a tokenizer with the desired vocab size
= SubwordTokenizer(vocab_sz=sz)
sp # Generate vocab based on target body of text
sp.setup(txts)return ' '.join(first(sp([txt]))[:40])
Picking a vocab size
- provides an easy way to scale between character tokenization and word tokenization
- a smaller vocab size results in each token representing fewer characters
- an overly large vocab size results in most common words ending up in the vocab
- fewer tokens per sentence
- faster training
- less memory
- less state for the model to remember
- larger embedding matrices which require more data to learn
# Use a vocab size of 1000
1000) subword(
sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=tmp/texts.out --vocab_size=1000 --model_prefix=tmp/spm --character_coverage=0.99999 --model_type=unigram --unk_id=9 --pad_id=-1 --bos_id=-1 --eos_id=-1 --minloglevel=2 --user_defined_symbols=▁xxunk,▁xxpad,▁xxbos,▁xxeos,▁xxfld,▁xxrep,▁xxwrep,▁xxup,▁xxmaj --hard_vocab_limit=false
"▁This ▁con g lo m er ation ▁fail s ▁so ▁mis er ably ▁on ▁every ▁level ▁that ▁it ▁is ▁di ff ic ul t ▁to ▁decide ▁what ▁to ▁say . ▁It ▁doesn ' t ▁me ri t ▁one ▁line ,"
Note: The _
character represents a space character in the original text
# Use a vocab size of 200
200) subword(
'▁ T h i s ▁c on g l o m er at ion ▁f a i l s ▁ s o ▁ m i s er a b ly ▁on ▁ e v er y ▁ le ve l'
# Use a vocab size of 10000
10000) subword(
"▁This ▁con g l ome ration ▁fails ▁so ▁miserably ▁on ▁every ▁level ▁that ▁it ▁is ▁difficult ▁to ▁decide ▁what ▁to ▁say . ▁It ▁doesn ' t ▁merit ▁one ▁line , ▁much ▁less ▁ten , ▁but ▁to ▁adhere ▁to ▁the ▁rules"
Numericalization with fastai
= tkn(txt)
toks print(coll_repr(tkn(txt), 31))
(#177) ['xxbos','xxmaj','this','conglomeration','fails','so','miserably','on','every','level','that','it','is','difficult','to','decide','what','to','say','.','xxmaj','it','does',"n't",'merit','one','line',',','much','less','ten'...]
= txts[:200].map(tkn)
toks200 0] toks200[
(#177) ['xxbos','xxmaj','this','conglomeration','fails','so','miserably','on','every','level'...]
fastai Numericalize
Numericalize
fastai.text.data.Numericalize
print_source(Numericalize)
class Numericalize(Transform):
"Reversible transform of tokenized texts to numericalized ids"
def __init__(self, vocab=None, min_freq=3, max_vocab=60000, special_toks=None):
store_attr('vocab,min_freq,max_vocab,special_toks')
self.o2i = None if vocab is None else defaultdict(int, {v:k for k,v in enumerate(vocab)})
def setups(self, dsets):
if dsets is None: return
if self.vocab is None:
count = dsets.counter if getattr(dsets, 'counter', None) is not None else Counter(p for o in dsets for p in o)
if self.special_toks is None and hasattr(dsets, 'special_toks'):
self.special_toks = dsets.special_toks
self.vocab = make_vocab(count, min_freq=self.min_freq, max_vocab=self.max_vocab, special_toks=self.special_toks)
self.o2i = defaultdict(int, {v:k for k,v in enumerate(self.vocab) if v != 'xxfake'})
def encodes(self, o): return TensorText(tensor([self.o2i [o_] for o_ in o]))
def decodes(self, o): return L(self.vocab[o_] for o_ in o)
= Numericalize()
num # Generate vocab
num.setup(toks200)20) coll_repr(num.vocab,
"(#1992) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','a','and','of','to','is','i','it','this'...]"
TensorText
fastai.text.data.TensorText
print_source(TensorText)
class TensorText(TensorBase): pass
= num(toks)[:20]; nums nums
TensorText([ 2, 8, 19, 0, 585, 51, 1190, 36, 166, 586, 21, 18, 16, 0, 15, 663, 67, 15, 140, 10])
' '.join(num.vocab[o] for o in nums)
'xxbos xxmaj this xxunk fails so miserably on every level that it is xxunk to decide what to say .'
Note: Special rules tokens appear first followed by tokens in order of frequency
Putting Our Texts into Batches for a Language Model
= "In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while."
stream = tkn(stream)
tokens tokens
(#90) ['xxbos','xxmaj','in','this','chapter',',','we','will','go','back'...]
# Visualize 6 batches of 15 tokens
= 6,15
bs,seq_len = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(bs)])
d_tokens = pd.DataFrame(d_tokens)
df =False,header=None))) display(HTML(df.to_html(index
xxbos | xxmaj | in | this | chapter | , | we | will | go | back | over | the | example | of | classifying |
movie | reviews | we | studied | in | chapter | 1 | and | dig | deeper | under | the | surface | . | xxmaj |
first | we | will | look | at | the | processing | steps | necessary | to | convert | text | into | numbers | and |
how | to | customize | it | . | xxmaj | by | doing | this | , | we | ’ll | have | another | example |
of | the | preprocessor | used | in | the | data | block | xxup | api | . | xxmaj | then | we | |
will | study | how | we | build | a | language | model | and | train | it | for | a | while | . |
# 6 batches of 5 tokens
= 6,5
bs,seq_len = np.array([tokens[i*15:i*15+seq_len] for i in range(bs)])
d_tokens = pd.DataFrame(d_tokens)
df =False,header=None))) display(HTML(df.to_html(index
xxbos | xxmaj | in | this | chapter |
movie | reviews | we | studied | in |
first | we | will | look | at |
how | to | customize | it | . |
of | the | preprocessor | used | in |
will | study | how | we | build |
= 6,5
bs,seq_len = np.array([tokens[i*15+seq_len:i*15+2*seq_len] for i in range(bs)])
d_tokens = pd.DataFrame(d_tokens)
df =False,header=None))) display(HTML(df.to_html(index
, | we | will | go | back |
chapter | 1 | and | dig | deeper |
the | processing | steps | necessary | to |
xxmaj | by | doing | this | , |
the | data | block | xxup | api |
a | language | model | and | train |
= 6,5
bs,seq_len = np.array([tokens[i*15+10:i*15+15] for i in range(bs)])
d_tokens = pd.DataFrame(d_tokens)
df =False,header=None))) display(HTML(df.to_html(index
over | the | example | of | classifying |
under | the | surface | . | xxmaj |
convert | text | into | numbers | and |
we | ’ll | have | another | example |
. | xxmaj | then | we | |
it | for | a | while | . |
= toks200.map(num) nums200
LMDataLoader
fastai.text.data.LMDataLoader
print_source(LMDataLoader)
@delegates()
class LMDataLoader(TfmdDL):
"A `DataLoader` suitable for language modeling"
def __init__(self, dataset, lens=None, cache=2, bs=64, seq_len=72, num_workers=0, **kwargs):
self.items = ReindexCollection(dataset, cache=cache, tfm=_maybe_first)
self.seq_len = seq_len
if lens is None: lens = _get_lengths(dataset)
if lens is None: lens = [len(o) for o in self.items]
self.lens = ReindexCollection(lens, idxs=self.items.idxs)
# The "-1" is to allow for final label, we throw away the end that's less than bs
corpus = round_multiple(sum(lens)-1, bs, round_down=True)
self.bl = corpus//bs #bl stands for batch length
self.n_batches = self.bl//(seq_len) + int(self.bl%seq_len!=0)
self.last_len = self.bl - (self.n_batches-1)*seq_len
self.make_chunks()
super().__init__(dataset=dataset, bs=bs, num_workers=num_workers, **kwargs)
self.n = self.n_batches*bs
def make_chunks(self): self.chunks = Chunks(self.items, self.lens)
def shuffle_fn(self,idxs):
self.items.shuffle()
self.make_chunks()
return idxs
def create_item(self, seq):
if seq is None: seq = 0
if seq>=self.n: raise IndexError
sl = self.last_len if seq//self.bs==self.n_batches-1 else self.seq_len
st = (seq%self.bs)*self.bl + (seq//self.bs)*self.seq_len
txt = self.chunks[st : st+sl+1]
return LMTensorText(txt[:-1]),txt[1:]
@delegates(TfmdDL.new)
def new(self, dataset=None, seq_len=None, **kwargs):
lens = self.lens.coll if dataset is None else None
seq_len = self.seq_len if seq_len is None else seq_len
return super().new(dataset=dataset, lens=lens, seq_len=seq_len, **kwargs)
Note: The order of separate documents is shuffled, not the order of the words inside them.
= LMDataLoader(nums200) dl
= first(dl)
x,y x.shape,y.shape
(torch.Size([64, 72]), torch.Size([64, 72]))
' '.join(num.vocab[o] for o in x[0][:20])
'xxbos xxmaj this xxunk fails so miserably on every level that it is xxunk to decide what to say .'
' '.join(num.vocab[o] for o in y[0][:20])
'xxmaj this xxunk fails so miserably on every level that it is xxunk to decide what to say . xxmaj'
Note: The dependent variable is offset by one token, since the goal is to predict the next token in the sequence.
Training a Text Classifier
- Fine-tune a language model pretrained on a standard corpus like Wikipedia on a target dataset
- Use the fine-tuned model to train a classifier
Language Model Using DataBlock
- fastai automatically handles tokenization and numericalization when
TextBlock
is passed toDataBlock
- fastai saves the tokenized documents in a temporary fodler, so it does not have to tokenize them more than once
- fastai runs multiple tokenization processes in parallel
TextBlock
fastai.text.data.TextBlock
print_source(TextBlock)
class TextBlock(TransformBlock):
"A `TransformBlock` for texts"
@delegates(Numericalize.__init__)
def __init__(self, tok_tfm, vocab=None, is_lm=False, seq_len=72, backwards=False, **kwargs):
type_tfms = [tok_tfm, Numericalize(vocab, **kwargs)]
if backwards: type_tfms += [reverse_text]
return super().__init__(type_tfms=type_tfms,
dl_type=LMDataLoader if is_lm else SortedDL,
dls_kwargs={'seq_len': seq_len} if is_lm else {'before_batch': Pad_Chunk(seq_len=seq_len)})
@classmethod
@delegates(Tokenizer.from_df, keep=True)
def from_df(cls, text_cols, vocab=None, is_lm=False, seq_len=72, backwards=False, min_freq=3, max_vocab=60000, **kwargs):
"Build a `TextBlock` from a dataframe using `text_cols`"
return cls(Tokenizer.from_df(text_cols, **kwargs), vocab=vocab, is_lm=is_lm, seq_len=seq_len,
backwards=backwards, min_freq=min_freq, max_vocab=max_vocab)
@classmethod
@delegates(Tokenizer.from_folder, keep=True)
def from_folder(cls, path, vocab=None, is_lm=False, seq_len=72, backwards=False, min_freq=3, max_vocab=60000, **kwargs):
"Build a `TextBlock` from a `path`"
return cls(Tokenizer.from_folder(path, **kwargs), vocab=vocab, is_lm=is_lm, seq_len=seq_len,
backwards=backwards, min_freq=min_freq, max_vocab=max_vocab)
# Define how to get dataset items
= partial(get_text_files, folders=['train', 'test', 'unsup'])
get_imdb
= DataBlock(
dls_lm =TextBlock.from_folder(path, is_lm=True),
blocks=get_imdb, splitter=RandomSplitter(0.1)
get_items=path, bs=128, seq_len=80) ).dataloaders(path, path
=2) dls_lm.show_batch(max_n
text | text_ | |
---|---|---|
0 | xxbos i think xxmaj dark xxmaj angel is great ! xxmaj first season was excellent , and had a good plot . xxmaj with xxunk xxmaj alba ) as an escaped xxup xxunk , manticore creation , trying to adapt to a normal life , but still ” saving the world ” . xxmaj and being hunted by manticore throughout the season which gives the series some extra spice . xxmaj the second season though suddenly became a bit | i think xxmaj dark xxmaj angel is great ! xxmaj first season was excellent , and had a good plot . xxmaj with xxunk xxmaj alba ) as an escaped xxup xxunk , manticore creation , trying to adapt to a normal life , but still ” saving the world ” . xxmaj and being hunted by manticore throughout the season which gives the series some extra spice . xxmaj the second season though suddenly became a bit odd |
1 | cheating boyfriend xxmaj nick xxmaj gordon planning to drop her for the much younger and sexier xxmaj sally xxmaj higgins . xxmaj sally ’s boyfriend xxmaj jerry had earlier participated in a payroll robbery with xxmaj nick where he and a security guard were shot and killed . xxmaj now seeing that there ’s a future , in crime , for her with xxmaj nick xxmaj sally willingly replaced xxmaj mimi as xxmaj nick ’s new squeeze . xxmaj mad | boyfriend xxmaj nick xxmaj gordon planning to drop her for the much younger and sexier xxmaj sally xxmaj higgins . xxmaj sally ’s boyfriend xxmaj jerry had earlier participated in a payroll robbery with xxmaj nick where he and a security guard were shot and killed . xxmaj now seeing that there ’s a future , in crime , for her with xxmaj nick xxmaj sally willingly replaced xxmaj mimi as xxmaj nick ’s new squeeze . xxmaj mad as |
Fine-Tuning the Language Model
- Use embeddings to convert the integer word indices into activations that we can use for our neural network
- Feed those embeddings to a Recurrent Neural Nerwork (RNN), using an architecture called AWD-LSTM
- This process is handled automatically inside language_model_learner
language_model_learner
<function fastai.text.learner.language_model_learner(dls, arch, config=None, drop_mult=1.0, backwards=False, pretrained=True, pretrained_fnames=None, loss_func=None, opt_func=<function Adam at 0x7fdb8123e430>, lr=0.001, splitter=<function trainable_params at 0x7fdb8d9171f0>, cbs=None, metrics=None, path=None, model_dir='models', wd=None, wd_bn_bias=False, train_bn=True, moms=(0.95, 0.85, 0.95))>
print_source(language_model_learner)
@delegates(Learner.__init__)
def language_model_learner(dls, arch, config=None, drop_mult=1., backwards=False, pretrained=True, pretrained_fnames=None, **kwargs):
"Create a `Learner` with a language model from `dls` and `arch`."
vocab = _get_text_vocab(dls)
model = get_language_model(arch, len(vocab), config=config, drop_mult=drop_mult)
meta = _model_meta[arch]
learn = LMLearner(dls, model, loss_func=CrossEntropyLossFlat(), splitter=meta['split_lm'], **kwargs)
url = 'url_bwd' if backwards else 'url'
if pretrained or pretrained_fnames:
if pretrained_fnames is not None:
fnames = [learn.path/learn.model_dir/f'{fn}.{ext}' for fn,ext in zip(pretrained_fnames, ['pth', 'pkl'])]
else:
if url not in meta:
warn("There are no pretrained weights for that architecture yet!")
return learn
model_path = untar_data(meta[url] , c_key='model')
try: fnames = [list(model_path.glob(f'*.{ext}'))[0] for ext in ['pth', 'pkl']]
except IndexError: print(f'The model in {model_path} is incomplete, download again'); raise
learn = learn.load_pretrained(*fnames)
return learn
= language_model_learner(
learn =0.3,
dls_lm, AWD_LSTM, drop_mult=[accuracy, Perplexity()]).to_fp16() metrics
Perplexity Metric
- the exponential of the loss (i.e.
torch.exp(cross_entropy)
) - often used in NLP for language models
Perplexity
fastai.metrics.Perplexity
print_source(Perplexity)
class Perplexity(AvgLoss):
"Perplexity (exponential of cross-entropy loss) for Language Models"
@property
def value(self): return torch.exp(self.total/self.count) if self.count != 0 else None
@property
def name(self): return "perplexity"
# Train only the embeddings (the only part of the model that contains randomly initialize weights)
1, 2e-2) learn.fit_one_cycle(
epoch | train_loss | valid_loss | accuracy | perplexity | time |
---|---|---|---|---|---|
0 | 4.011688 | 3.904507 | 0.300504 | 49.625618 | 09:21 |
Saving and Loading Models
learn.save
<bound method Learner.save of <fastai.text.learner.LMLearner object at 0x7fdb64fd4190>>
print_source(learn.save)
@patch
@delegates(save_model)
def save(self:Learner, file, **kwargs):
"Save model and optimizer state (if `with_opt`) to `self.path/self.model_dir/file`"
file = join_path_file(file, self.path/self.model_dir, ext='.pth')
save_model(file, self.model, getattr(self,'opt',None), **kwargs)
return file
save_model
<function fastai.learner.save_model(file, model, opt, with_opt=True, pickle_protocol=2)>
print_source(save_model)
def save_model(file, model, opt, with_opt=True, pickle_protocol=2):
"Save `model` to `file` along with `opt` (if available, and if `with_opt`)"
if rank_distrib(): return # don't save if child proc
if opt is None: with_opt=False
state = get_model(model).state_dict()
if with_opt: state = {'model': state, 'opt':opt.state_dict()}
torch.save(state, file, pickle_protocol=pickle_protocol)
print_source(rank_distrib)
def rank_distrib():
"Return the distributed rank of this process (if applicable)."
return int(os.environ.get('RANK', 0))
print_source(get_model)
def get_model(model):
"Return the model maybe wrapped inside `model`."
return model.module if isinstance(model, (DistributedDataParallel, nn.DataParallel)) else model
print_source(torch.save)
def save(obj, f: Union[str, os.PathLike, BinaryIO, IO[bytes]],
pickle_module=pickle, pickle_protocol=DEFAULT_PROTOCOL, _use_new_zipfile_serialization=True) -> None:
# Reference: https://github.com/pytorch/pytorch/issues/54354
# The first line of this docstring overrides the one Sphinx generates for the
# documentation. We need it so that Sphinx doesn't leak `pickle`s path from
# the build environment (e.g. `<module 'pickle' from '/leaked/path').
"""save(obj, f, pickle_module=pickle, pickle_protocol=DEFAULT_PROTOCOL, _use_new_zipfile_serialization=True)
Saves an object to a disk file.
See also: :ref:`saving-loading-tensors`
Args:
obj: saved object
f: a file-like object (has to implement write and flush) or a string or
os.PathLike object containing a file name
pickle_module: module used for pickling metadata and objects
pickle_protocol: can be specified to override the default protocol
.. note::
A common PyTorch convention is to save tensors using .pt file extension.
.. note::
PyTorch preserves storage sharing across serialization. See
:ref:`preserve-storage-sharing` for more details.
.. note::
The 1.6 release of PyTorch switched ``torch.save`` to use a new
zipfile-based file format. ``torch.load`` still retains the ability to
load files in the old format. If for any reason you want ``torch.save``
to use the old format, pass the kwarg ``_use_new_zipfile_serialization=False``.
Example:
>>> # Save to file
>>> x = torch.tensor([0, 1, 2, 3, 4])
>>> torch.save(x, 'tensor.pt')
>>> # Save to io.BytesIO buffer
>>> buffer = io.BytesIO()
>>> torch.save(x, buffer)
"""
_check_dill_version(pickle_module)
with _open_file_like(f, 'wb') as opened_file:
if _use_new_zipfile_serialization:
with _open_zipfile_writer(opened_file) as opened_zipfile:
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
return
_legacy_save(obj, opened_file, pickle_module, pickle_protocol)
'1epoch') learn.save(
Path('/home/innom-dt/.fastai/data/imdb/models/1epoch.pth')
learn.load
<bound method TextLearner.load of <fastai.text.learner.LMLearner object at 0x7fdb64fd4190>>
print_source(learn.load)
@delegates(load_model_text)
def load(self, file, with_opt=None, device=None, **kwargs):
if device is None: device = self.dls.device
if self.opt is None: self.create_opt()
file = join_path_file(file, self.path/self.model_dir, ext='.pth')
load_model_text(file, self.model, self.opt, device=device, **kwargs)
return self
load_model_text
<function fastai.text.learner.load_model_text(file, model, opt, with_opt=None, device=None, strict=True)>
print_source(load_model_text)
def load_model_text(file, model, opt, with_opt=None, device=None, strict=True):
"Load `model` from `file` along with `opt` (if available, and if `with_opt`)"
distrib_barrier()
if isinstance(device, int): device = torch.device('cuda', device)
elif device is None: device = 'cpu'
state = torch.load(file, map_location=device)
hasopt = set(state)=={'model', 'opt'}
model_state = state['model'] if hasopt else state
get_model(model).load_state_dict(clean_raw_keys(model_state), strict=strict)
if hasopt and ifnone(with_opt,True):
try: opt.load_state_dict(state['opt'])
except:
if with_opt: warn("Could not load the optimizer state.")
elif with_opt: warn("Saved filed doesn't contain an optimizer state.")
distrib_barrier
<function fastai.torch_core.distrib_barrier()>
print_source(distrib_barrier)
def distrib_barrier():
"Place a synchronization barrier in distributed training"
if num_distrib() > 1 and torch.distributed.is_initialized(): torch.distributed.barrier()
= learn.load('1epoch') learn
learn.unfreeze()10, 2e-3) learn.fit_one_cycle(
epoch | train_loss | valid_loss | accuracy | perplexity | time |
---|---|---|---|---|---|
0 | 3.767039 | 3.763731 | 0.316231 | 43.108986 | 09:44 |
1 | 3.692761 | 3.705623 | 0.323240 | 40.675396 | 09:29 |
2 | 3.634718 | 3.654817 | 0.328937 | 38.660458 | 09:31 |
3 | 3.563724 | 3.624163 | 0.332917 | 37.493317 | 09:32 |
4 | 3.486968 | 3.600153 | 0.335374 | 36.603825 | 09:36 |
5 | 3.435516 | 3.585277 | 0.337806 | 36.063351 | 09:30 |
6 | 3.363010 | 3.575442 | 0.339413 | 35.710388 | 09:18 |
7 | 3.300442 | 3.574242 | 0.340387 | 35.667561 | 09:22 |
8 | 3.247055 | 3.576924 | 0.340627 | 35.763359 | 09:14 |
9 | 3.210976 | 3.581657 | 0.340366 | 35.933022 | 09:18 |
Encoder
- the model not including the task-specific final layer(s)
- typically used to refer to the body of NLP and generative models
learn.save_encoder
<bound method TextLearner.save_encoder of <fastai.text.learner.LMLearner object at 0x7fdb64fd4190>>
print_source(learn.save_encoder)
def save_encoder(self, file):
"Save the encoder to `file` in the model directory"
if rank_distrib(): return # don't save if child proc
encoder = get_model(self.model)[0]
if hasattr(encoder, 'module'): encoder = encoder.module
torch.save(encoder.state_dict(), join_path_file(file, self.path/self.model_dir, ext='.pth'))
'finetuned') learn.save_encoder(
Text Generation
- Training the model to predict the next word of a sentence enables it to generate new reviews
# Prompt
= "I liked this movie because"
TEXT # Generate 40 new words
= 40
N_WORDS = 2
N_SENTENCES # Add some randomness (temperature) to prevent generating the same review twice
= [learn.predict(TEXT, N_WORDS, temperature=0.75)
preds for _ in range(N_SENTENCES)]
print("\n".join(preds))
i liked this movie because it was a very well done film . Lee Bowman does a wonderful job in his role . Being a big Astaire fan , i think this movie is worth seeing for the musical numbers .
i liked this movie because it was based on a true story . The script was excellent as it was great . i would recommend this movie to anyone interested in history and the history of the holocaust . It was great to
Note: The model has learned a lot about English sentences, despite not having any explicitely programmed knowledge.
Creating the Classifier DataLoaders
- very similar to the DataBlocks used for the image classification datasets
- data augmentation has not been well-explored
- need to pad smaller documents when creating mini-batches
- batches are padded based on the largest document in a given batch
- the data block API automatically handles sorting and padding when using TextBlock with
is_lm=False
GrandparentSplitter
<function fastai.data.transforms.GrandparentSplitter(train_name='train', valid_name='valid')>
print_source(GrandparentSplitter)
def GrandparentSplitter(train_name='train', valid_name='valid'):
"Split `items` from the grand parent folder names (`train_name` and `valid_name`)."
def _inner(o):
return _grandparent_idxs(o, train_name),_grandparent_idxs(o, valid_name)
return _inner
= DataBlock(
dls_clas # Pass in the vocab used by the language model
=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
blocks= parent_label,
get_y =partial(get_text_files, folders=['train', 'test']),
get_items=GrandparentSplitter(valid_name='test')
splitter=path, bs=128, seq_len=72) ).dataloaders(path, path
=3) dls_clas.show_batch(max_n
text | category | |
---|---|---|
0 | xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero | pos |
1 | xxbos xxmaj by now you ’ve probably heard a bit about the new xxmaj disney dub of xxmaj miyazaki ’s classic film , xxmaj laputa : xxmaj castle xxmaj in xxmaj the xxmaj sky . xxmaj during late summer of 1998 , xxmaj disney released ” kiki ’s xxmaj delivery xxmaj service ” on video which included a preview of the xxmaj laputa dub saying it was due out in ” 1 xxrep 3 9 ” . xxmaj it ’s obviously way past that year now , but the dub has been finally completed . xxmaj and it ’s not ” laputa : xxmaj castle xxmaj in xxmaj the xxmaj sky ” , just ” castle xxmaj in xxmaj the xxmaj sky ” for the dub , since xxmaj laputa is not such a nice word in xxmaj spanish ( even though they use the word xxmaj laputa many times | pos |
2 | xxbos xxmaj heavy - handed moralism . xxmaj writers using characters as mouthpieces to speak for themselves . xxmaj predictable , plodding plot points ( say that five times fast ) . a child ’s imitation of xxmaj britney xxmaj spears . xxmaj this film has all the earmarks of a xxmaj lifetime xxmaj special reject . i honestly believe that xxmaj jesus xxmaj xxunk and xxmaj julia xxmaj xxunk set out to create a thought - provoking , emotional film on a tough subject , exploring the idea that things are not always black and white , that one who is a criminal by definition is not necessarily a bad human being , and that there can be extenuating circumstances , especially when one puts the well - being of a child first . xxmaj however , their earnestness ends up being channeled into preachy dialogue and trite | neg |
= toks200[:10].map(num) nums_samp
map(len) nums_samp.
(#10) [177,562,211,125,125,425,421,1330,196,278]
= text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5,
learn =accuracy).to_fp16() metrics
= learn.load_encoder('finetuned') learn
Fine-Tuning the Classifier
- NLP classifiers benefit from gradually unfreezing a few layers at a time
1, 2e-2) learn.fit_one_cycle(
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.242196 | 0.178359 | 0.931280 | 00:29 |
# Freeze all except the last two parameter groups
-2)
learn.freeze_to(1, slice(1e-2/(2.6**4),1e-2)) learn.fit_one_cycle(
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.226292 | 0.162955 | 0.938840 | 00:35 |
# Freeze all except the last two parameter groups
-3)
learn.freeze_to(1, slice(5e-3/(2.6**4),5e-3)) learn.fit_one_cycle(
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.150115 | 0.144669 | 0.947280 | 00:46 |
# Unfreeze all layers
learn.unfreeze()2, slice(1e-3/(2.6**4),1e-3)) learn.fit_one_cycle(
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.160042 | 0.149997 | 0.945320 | 00:54 |
1 | 0.146106 | 0.148102 | 0.945320 | 00:54 |
Note: We can further improve the accuracy by training another model on all the texts read backward and averaging the predictions of the two models.
Disinformation and Language Models
- Even simple algorithms based on rules could be used to create fraudulent accounts and try influence policymakers
- More than a Million Pro-Repeal Net Neutrality Comments were Likely Faked
- Jeff Kao discovered a large cluster of comments opposing net neutrality that seemed to have been generated by some sort of Mad Libs-style mail merge.
- estimated that less than 800,000 of the 22M+ comments could be considered unique
- more than 99% of the truly unique comments were in favor of net neutrality
- The same type of language model as trained above could be used to generate context-appropriate, believable text
References
Previous: Notes on fastai Book Ch. 9
Next: Notes on fastai Book Ch. 11
- I’m Christian Mills, a deep learning consultant specializing in computer vision and practical AI implementations.
- I help clients leverage cutting-edge AI technologies to solve real-world problems.
- Learn more about me or reach out via email at [email protected] to discuss your project.