1. Pre-training methods for HF models
trainer training using tokenizer and MLM packages from HF home page
1. Loading datasets:
Selecting multilingual and multilingual datasets OSCAR corpus
# in this notebook we'll only get one of the files (the Oscar one) for the sake of simplicity and performance !wget -c https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt
2. Training tokenizer
Choosing byte-level BPE word breakers (similar to those used by GPT2) has the advantage of having few unlisted words'<unk> tokens'as WordPiece (character-level BPE word breakers) from BERT.
# Install transformers and tokenizers !pip install git+https://github.com/huggingface/transformers !pip list | grep -E 'transformers|tokenizers' # transformers version at notebook update --- 2.11.0 # tokenizers version at notebook update --- 0.8.0rc1
%%time from pathlib import Path from tokenizers import ByteLevelBPETokenizer paths = [str(x) for x in Path(".").glob("**/*.txt")] # tokenizer initialization tokenizer = ByteLevelBPETokenizer() # Customize training tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[ "<s>", "<pad>", "</s>", "<unk>", "<mask>", ])
The training parameters of the 2.2 word breaker are as follows:
#BPE Word Separator classtokenizers.trainers.BpeTrainer(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix=None, end_of_word_suffix=None)
- vocab_size (int, optional) - The size of the final vocabulary, including all tags and alphabets.
- min_frequency (int, optional) - For merging, the minimum frequency a pair should have.
- show_progress (bool, optional) - Whether a progress bar is displayed during training.
- Special_tokens (List [Union [str, AddedToken], optional) - List of special tags that the model should know.
- limit_alphabet (int, optional) - The maximum number of different characters to retain in the alphabet.
- initial_alphabet (List[str], optional) - A list of characters contained in the initial alphabet, even if they do not appear in the training dataset. If the string contains more than one character, only the first character is retained.
- continue_subword_prefix (str, optional) - Prefix used for each subword that is not the beginning of a word.
- end_of_word_suffix (str, optional) - The suffix used for the subword at each end of a word.
#WordPiece word breaker with the same parameters as the previous one classtokenizers.trainers.WordPieceTrainer(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix='##', end_of_word_suffix=None)
- vocab_size (int, optional) - The size of the final vocabulary, including all tags and alphabets.
- min_frequency (int, optional) - For merging, the minimum frequency a pair should have.
- show_progress (bool, optional) - Whether a progress bar is displayed during training.
- Special_tokens (List [Union [str, AddedToken], optional) - List of special tags that the model should know.
- limit_alphabet (int, optional) - The maximum number of different characters to retain in the alphabet.
- initial_alphabet (List[str], optional) - A list of characters contained in the initial alphabet, even if they do not appear in the training dataset. If the string contains more than one character, only the first character is retained.
- continue_subword_prefix (str, optional) - Prefix used for each subword that is not the beginning of a word.
- end_of_word_suffix (str, optional) - The suffix used for the subword at each end of a word.
2.3 Word Separator Save and Load
Save the trained word breaker in the EsperBERTo folder:
!mkdir EsperBERTo tokenizer.save_model("EsperBERTo")
The result is two word separator files:
- EsperBERTo/vocab.json:vocab.json, list of common token s by frequency
- EsperBERTo/merges.txt': merges.txt, merges list
{ "<s>": 0,"<pad>": 1,"</s>": 2,"<unk>": 3, "<mask>": 4,"!": 5,"\"": 6,"#": 7, "$": 8,"%": 9,"&": 10,"'": 11,"(": 12,")": 13, # ...} # merges.txt l a Ġ k o n Ġ la t a Ġ e Ġ d Ġ p # ...
tokenizer is optimized for Esperanto, with more words represented by a single and unsplit token. We also represent sequences in a more efficient way. In this corpus, the average length of the encoded sequence is about 30% shorter than when using a pre-trained GPT-2 tagger.
Load a word breaker to handle RoBERTa special tags:
from tokenizers.implementations import ByteLevelBPETokenizer from tokenizers.processors import BertProcessing tokenizer = ByteLevelBPETokenizer( "./EsperBERTo/vocab.json", "./EsperBERTo/merges.txt", )
tokenizer._tokenizer.post_processor = BertProcessing( ("</s>", tokenizer.token_to_id("</s>")), ("<s>", tokenizer.token_to_id("<s>")), ) tokenizer.enable_truncation(max_length=512)
token_to_id: Converts a given token to its corresponding ID
BertProcessing parameter:
classtokenizers.processors.BertProcessing(self, sep, cls)
This postprocessor is responsible for adding the special tags required for the Bert model:
- sep (Tuple[str, int]) - String representation with SEP token and tuple of its id
- cls (Tuple[str, int]) - A tuple represented by a string marked with CLS and its id
Test results:
tokenizer.encode("Mi estas Julien.") Encoding(num_tokens=7, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
tokenizer.encode("Mi estas Julien.").tokens ['<s>', 'Mi', 'Ġestas', 'ĠJuli', 'en', '.', '</s>']
3. Training language models from scratch
Reference resources run_language_modeling.py Files. Direct settings Trainer Choose a training method. Here's how training looks like RoBERTa Examples of models are: (compared to bert using dynamic masks, discarding NSP tasks, and larger training)
import torch #Define model parameters from transformers import RobertaConfig config = RobertaConfig( vocab_size=52_000, max_position_embeddings=514, num_attention_heads=12, num_hidden_layers=6, type_vocab_size=1, ) #Re-create tokenizer from transformers import RobertaTokenizerFast tokenizer = RobertaTokenizerFast.from_pretrained("./EsperBERTo", max_len=512)
3.2 Initialization model
Since we start training from scratch, we only initialize from configuration, not from existing pre-training models or checkpoints.
from transformers import RobertaForMaskedLM model = RobertaForMaskedLM(config=config) model.num_parameters() 84095008# => 84 million parameters
3.3 Creating training sets
Since there is only one text file, no custom dataset is required. Preprocess with tokenizer directly after loading using LineByLineDataset.
%%time from transformers import LineByLineTextDataset dataset = LineByLineTextDataset( tokenizer=tokenizer, file_path="./oscar.eo.txt", block_size=128, ) CPU times: user 4min 54s, sys: 2.98 s, total: 4min 57s Wall time: 1min 37s
Define data_collator: A data collator that helps us batch dataset samples. If the lengths of inputs are different, the inputs are dynamically populated to the maximum length of the batch.
from transformers import DataCollatorForLanguageModeling data_collator = DataCollatorForLanguageModeling( tokenizer=tokenizer, mlm=True, mlm_probability=0.15 )
class transformers.data.data_collator.DataCollatorForLanguageModeling(tokenizer: transformers.tokenization_utils_base.PreTrainedTokenizerBase, mlm: bool = True, mlm_probability: float = 0.15, pad_to_multiple_of: Optional[int] = None, tf_experimental_compile: bool = False, return_tensors: str = 'pt')
- tokenizer (PreTrainedTokenizer or PreTrainedTokenizerFast) - Marker used to encode data.
- mlm (bool, optional, defaults to True) - Whether to model using masking language. If set to False, the labels are the same as the inputs that ignore the padding tokens (by setting them to -100). Otherwise, the labels for non-masked tokens and masked tokens have a predicted value of -100. (If set to False, the labels are the same as inputs with the padding tokens ignored)(by setting them to -100). Otherwise, the labels are -100 for non-masked tokens and the value to predict for the masked token.)
- mlm_probability (floating point number, optional, default 0.15) - When mlm is set to True (random) masks the probability of tags in the input.
- pad_to_multiple_of (int, optional) - If set, the sequence is populated as a multiple of the value provided.
3.4 Initialize Trainer and train
from transformers import Trainer, TrainingArguments training_args = TrainingArguments( output_dir="./EsperBERTo", overwrite_output_dir=True, num_train_epochs=1, per_gpu_train_batch_size=64, save_steps=10_000, save_total_limit=2, prediction_loss_only=True, ) trainer = Trainer( model=model, args=training_args, data_collator=data_collator, train_dataset=dataset, )
Start training
%%time trainer.train() CPU times: user 1h 43min 36s, sys: 1h 3min 28s, total: 2h 47min 4s Wall time: 2h 46min 46s TrainOutput(global_step=15228, training_loss=5.762423221226405)
Save Model
trainer.save_model("./EsperBERTo")
5. Check trained models
In addition to viewing training and assessing loss reductions, predictions can be made using the FillMaskPipeline load model
from transformers import pipeline fill_mask = pipeline("fill-mask",model="./EsperBERTo",tokenizer="./EsperBERTo")
# The sun <mask>. # => fill_mask("La suno <mask>.")
[{'score': 0.02119220793247223, 'sequence': '<s> La suno estas.</s>', 'token': 316}, {'score': 0.012403824366629124, 'sequence': '<s> La suno situas.</s>', 'token': 2340}, {'score': 0.011061107739806175, 'sequence': '<s> La suno estis.</s>', 'token': 394}, {'score': 0.008284995332360268, 'sequence': '<s> La suno de.</s>', 'token': 274}, {'score': 0.006471084896475077, 'sequence': '<s> La suno akvo.</s>', 'token': 1833}]
Finally, when you have a good model, consider sharing it with the community:
Upload your model using CLI: transformers-cli upload
Write a README.md model card and add it to the repository under model_cards/. Ideally, your model card should include:
- Model Description
- Training parameters (datasets, preprocessing, hyperparameters)
- Assessment Results
- Expected uses and limitations
- Other useful information🤓