Skypool Starter - News Text Category - BERT

1. Pre-training methods for HF models

trainer training using tokenizer and MLM packages from HF home page

1. Loading datasets:

Selecting multilingual and multilingual datasets OSCAR corpus

# in this notebook we'll only get one of the files (the Oscar one) for the sake of simplicity and performance
!wget -c https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt

2. Training tokenizer

Choosing byte-level BPE word breakers (similar to those used by GPT2) has the advantage of having few unlisted words'<unk> tokens'as WordPiece (character-level BPE word breakers) from BERT.

# Install transformers and tokenizers
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.11.0
# tokenizers version at notebook update --- 0.8.0rc1
%%time 
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
paths = [str(x) for x in Path(".").glob("**/*.txt")]

# tokenizer initialization
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

The training parameters of the 2.2 word breaker are as follows:

#BPE Word Separator
classtokenizers.trainers.BpeTrainer(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix=None, end_of_word_suffix=None)
  • vocab_size (int, optional) - The size of the final vocabulary, including all tags and alphabets.
  • min_frequency (int, optional) - For merging, the minimum frequency a pair should have.
  • show_progress (bool, optional) - Whether a progress bar is displayed during training.
  • Special_tokens (List [Union [str, AddedToken], optional) - List of special tags that the model should know.
  • limit_alphabet (int, optional) - The maximum number of different characters to retain in the alphabet.
  • initial_alphabet (List[str], optional) - A list of characters contained in the initial alphabet, even if they do not appear in the training dataset. If the string contains more than one character, only the first character is retained.
  • continue_subword_prefix (str, optional) - Prefix used for each subword that is not the beginning of a word.
  • end_of_word_suffix (str, optional) - The suffix used for the subword at each end of a word.
#WordPiece word breaker with the same parameters as the previous one
classtokenizers.trainers.WordPieceTrainer(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix='##', end_of_word_suffix=None)
  • vocab_size (int, optional) - The size of the final vocabulary, including all tags and alphabets.
  • min_frequency (int, optional) - For merging, the minimum frequency a pair should have.
  • show_progress (bool, optional) - Whether a progress bar is displayed during training.
  • Special_tokens (List [Union [str, AddedToken], optional) - List of special tags that the model should know.
  • limit_alphabet (int, optional) - The maximum number of different characters to retain in the alphabet.
  • initial_alphabet (List[str], optional) - A list of characters contained in the initial alphabet, even if they do not appear in the training dataset. If the string contains more than one character, only the first character is retained.
  • continue_subword_prefix (str, optional) - Prefix used for each subword that is not the beginning of a word.
  • end_of_word_suffix (str, optional) - The suffix used for the subword at each end of a word.

2.3 Word Separator Save and Load

Save the trained word breaker in the EsperBERTo folder:

!mkdir EsperBERTo
tokenizer.save_model("EsperBERTo")

The result is two word separator files:

  • EsperBERTo/vocab.json:vocab.json, list of common token s by frequency
  • EsperBERTo/merges.txt': merges.txt, merges list
{ "<s>": 0,"<pad>": 1,"</s>": 2,"<unk>": 3, "<mask>": 4,"!": 5,"\"": 6,"#": 7,
    "$": 8,"%": 9,"&": 10,"'": 11,"(": 12,")": 13, # ...}

# merges.txt
l a
Ġ k
o n
Ġ la
t a
Ġ e
Ġ d
Ġ p
# ...

tokenizer is optimized for Esperanto, with more words represented by a single and unsplit token. We also represent sequences in a more efficient way. In this corpus, the average length of the encoded sequence is about 30% shorter than when using a pre-trained GPT-2 tagger.

Load a word breaker to handle RoBERTa special tags:

from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer(
    "./EsperBERTo/vocab.json",
    "./EsperBERTo/merges.txt",
)
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

token_to_id: Converts a given token to its corresponding ID
BertProcessing parameter:

classtokenizers.processors.BertProcessing(self, sep, cls)

This postprocessor is responsible for adding the special tags required for the Bert model:

  • sep (Tuple[str, int]) - String representation with SEP token and tuple of its id
  • cls (Tuple[str, int]) - A tuple represented by a string marked with CLS and its id

Test results:

tokenizer.encode("Mi estas Julien.")
Encoding(num_tokens=7, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
tokenizer.encode("Mi estas Julien.").tokens
['<s>', 'Mi', 'Ġestas', 'ĠJuli', 'en', '.', '</s>']

3. Training language models from scratch

Reference resources run_language_modeling.py Files. Direct settings Trainer Choose a training method. Here's how training looks like RoBERTa Examples of models are: (compared to bert using dynamic masks, discarding NSP tasks, and larger training)

import torch
#Define model parameters
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

#Re-create tokenizer
from transformers import RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained("./EsperBERTo", max_len=512)

3.2 Initialization model

Since we start training from scratch, we only initialize from configuration, not from existing pre-training models or checkpoints.

from transformers import RobertaForMaskedLM
model = RobertaForMaskedLM(config=config)

model.num_parameters()

84095008# => 84 million parameters

3.3 Creating training sets

Since there is only one text file, no custom dataset is required. Preprocess with tokenizer directly after loading using LineByLineDataset.

%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./oscar.eo.txt",
    block_size=128,
)


CPU times: user 4min 54s, sys: 2.98 s, total: 4min 57s
Wall time: 1min 37s

Define data_collator: A data collator that helps us batch dataset samples. If the lengths of inputs are different, the inputs are dynamically populated to the maximum length of the batch.

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
class transformers.data.data_collator.DataCollatorForLanguageModeling(tokenizer: transformers.tokenization_utils_base.PreTrainedTokenizerBase, mlm: bool = True, mlm_probability: float = 0.15, pad_to_multiple_of: Optional[int] = None, tf_experimental_compile: bool = False, return_tensors: str = 'pt')
  • tokenizer (PreTrainedTokenizer or PreTrainedTokenizerFast) - Marker used to encode data.
  • mlm (bool, optional, defaults to True) - Whether to model using masking language. If set to False, the labels are the same as the inputs that ignore the padding tokens (by setting them to -100). Otherwise, the labels for non-masked tokens and masked tokens have a predicted value of -100. (If set to False, the labels are the same as inputs with the padding tokens ignored)(by setting them to -100). Otherwise, the labels are -100 for non-masked tokens and the value to predict for the masked token.)
  • mlm_probability (floating point number, optional, default 0.15) - When mlm is set to True (random) masks the probability of tags in the input.
  • pad_to_multiple_of (int, optional) - If set, the sequence is populated as a multiple of the value provided.

3.4 Initialize Trainer and train

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./EsperBERTo",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

Start training

%%time
trainer.train()

CPU times: user 1h 43min 36s, sys: 1h 3min 28s, total: 2h 47min 4s
Wall time: 2h 46min 46s
TrainOutput(global_step=15228, training_loss=5.762423221226405)

Save Model

trainer.save_model("./EsperBERTo")

5. Check trained models

In addition to viewing training and assessing loss reductions, predictions can be made using the FillMaskPipeline load model

from transformers import pipeline

fill_mask = pipeline("fill-mask",model="./EsperBERTo",tokenizer="./EsperBERTo")
# The sun <mask>.
# =>

fill_mask("La suno <mask>.")
[{'score': 0.02119220793247223,
  'sequence': '<s> La suno estas.</s>',
  'token': 316},
 {'score': 0.012403824366629124,
  'sequence': '<s> La suno situas.</s>',
  'token': 2340},
 {'score': 0.011061107739806175,
  'sequence': '<s> La suno estis.</s>',
  'token': 394},
 {'score': 0.008284995332360268,
  'sequence': '<s> La suno de.</s>',
  'token': 274},
 {'score': 0.006471084896475077,
  'sequence': '<s> La suno akvo.</s>',
  'token': 1833}]

Finally, when you have a good model, consider sharing it with the community:

Upload your model using CLI: transformers-cli upload
Write a README.md model card and add it to the repository under model_cards/. Ideally, your model card should include:

  • Model Description
  • Training parameters (datasets, preprocessing, hyperparameters)
  • Assessment Results
  • Expected uses and limitations
  • Other useful information🤓

Tags: Python NLP BERT

Posted on Fri, 01 Oct 2021 13:44:42 -0400 by Kathy