Hugging Face Homepage Course 3rd "Fine-tuning a pretrained model"

Fine-tuning pre-training model


This text was translated from Hugging Face Home Page Under Resources course

Description: Some articles translate token, Tokenizer, Tokenization into tokens, tokens, and tokenization.Although in a sense more accurate, but the author feel that it is not simple and direct enough, not enough image.Therefore, some parts of the text will translate the constituent words, word breakers and word breakers, while others will remain in English (it is possible that google is translated into tags and marked unnoticed).If you have other questions, you can leave a message or check the original text.

Introduction to this chapter

In Chapter 2, we discuss how to use word segmenters and pre-training models for prediction.But what if you want to fine-tune the pre-training model for your own dataset?This is the subject of this chapter!You will learn:

  • How to prepare large datasets from Model Hub
  • How to fine-tune a model using the high-level Trainer API
  • How to use a custom training loop
  • How to use🤗 The Accelerate library makes it easy to run the custom training loop on any distributed settings

To upload a trained checkpoint to Hugging Face Hub, you need a Huggingface.co account: Create an account.

Processing data

Continuing with the example in the previous chapter, here is how we train a sequence classifier, sequence classifier, one batch at a time, in PyTorch:

import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",]
#Set each batch to padding and truncation and return the PyTorch tensor
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

# This is new
batch["labels"] = torch.tensor([1, 1])

optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

Training the model on only two sentences will not produce good results.So larger datasets need to be prepared for training.

In this section, we will use the MRPC (Microsoft Research Interpretation Corpus) dataset introduced by William B. Dolan and Chris Brockett in a paper as an example.The dataset consists of 5,801 pairs of sentences with a label indicating whether they interpret paraphrases for each other (that is, whether the two sentences have the same meaning).Choose it because it's a small dataset, so it's easy to train.

Download dataset from Hub

Youtube video: Hugging Face Datasets Overview(pytorch)

Hub not only contains models;It also contains multiple datasets These datasets have many different languages.We recommend that you try to load and process new datasets after completing this section ( Reference Documents).

The MRPC dataset is composed of GLUE datum One of 10 datasets.The GLUE benchmark is an academic benchmark that measures the performance of ML models in 10 different text classification tasks.

🤗 The Datasets library provides a very simple command to download and cache datasets on Hub.We can download MRPC datasets like this:

from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets
DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

This results in a DatasetDict object containing a training set, a validation set, and a test set, with 3,668 sentence pairs in the training set, 408 pairs in the validation set, and 1,725 pairs in the test set.Each pair of sentences contains four columns of data:'sentence1','sentence2','label'and'idx'.

Load_The dataset command downloads and caches the dataset, defaulting to ~/.cache/huggingface/dataset.You can set the HF_The HOME environment variable customizes the cache folder.

Like a dictionary, raw_datasets can access sentence pairs through an index:

raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]
{'idx': 0,
 'label': 1,
 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}
import pandas as pd
validation=pd.DataFrame(raw_datasets['validation'])
validation


The visible label is already an integer and no further preprocessing is required.By raw_Train_The features property of the dataset knows the type of each column:

raw_train_dataset.features
{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
 'idx': Value(dtype='int32', id=None)}

Label is a ClassLabel type, and integer-to-label name mappings are stored in the names folder.label=1 means the pairs of sentences are paraphrases to each other, and label=0 means the pairs of sentences are inconsistent in meaning.

✏️Have a try!View element 15 of the training set and element 87 of the validation set.What are their labels?

Data Set Preprocessing

YouTube Video <Preprocessing sentence pairs>

With tokenizer, you can convert text into numbers that the model understands, and you can split two sentences for each pair of sentences like this:

from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

However, we cannot simply pass two sequences to the model and predict whether the two sentences are paraphrases to each other.We need to pair the two sequences and preprocess them appropriately.
Fortunately, tokenizer can also process sentences as expected by the BERT model:

inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs
{ 'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102],
  'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
  'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

token_type_ids is used to distinguish the first two sentences from the last.The other two have been mentioned before.

✏️Give it a try!Take the fifteenth element of the training set and separate two sentences.What is the difference between the two results?

If we will input_The ID in IDS is decoded back to the word:

tokenizer.convert_ids_to_tokens(inputs["input_ids"])

Available:

['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is', 'the', 'second', 'one', '.', '[SEP]']

So we see that when there are two sentences, the model expects input in the form of [CLS] Sentence 1 [SEP] Sentence 2 [SEP].And token_type_ids alignment:

['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is', 'the', 'second', 'one', '.', '[SEP]']
[      0,      0,    0,     0,       0,          0,   0,       0,      1,    1,     1,        1,     1,   1,       1]

You can see that token_in the input corresponds to [CLS]sentence1[SEP]Type_IDs are all 0, and part token_corresponding to sentence2[SEP]Type_IDs are all 1.

Note that if you choose a different checkpoint, your tokenizer post-processing input may not have token_type_ids.For example, if you use DistilBERT models, they will not be returned. (Because DistilBERT is a distillation model of BERT, the NSP - Next Sentence Prediction Task is removed.

Here, BERT pretrains using token type IDs, performs NSP tasks on top of the masked language modeling MLM goals, and models the relationships between sentence pairs.Briefly wrote some of the original text, but elsewhere in the tutorial)

As long as tokenizer and model use the same checkpoint, there is no need to worry about token_in the tagged inputType_Ids.

Sentence pair lists are passed to tokenizer to split the entire dataset.Therefore, one way to preprocess a training dataset is to:

tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)

This works, but its disadvantage is to return the dictionary (with our key: input_ids, attention_mask and token_type_ids, and the value of the list in the list).This method works only if there is enough RAM to store the entire dataset during tokenization (and🤗 The dataset in the Datasets library is an Apache Arrow file stored on disk, so you only need to save the sample you requested to load in memory.

To keep the data as a dataset, we'll use a more flexible Dataset.map method.This method can do more preprocessing than just tokenization.The map method applies the same function to every element in the dataset, so let's define a function to tokenize the input:

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

This function accepts a dictionary (like our dataset's items) and returns a key input_ids, attention_mask and token_Type_New dictionary for ids.

When batching, a dictionary contains multiple samples (each key acts as a list of sentences), in which case the call to the map function can use the batched=True option, which greatly speeds up the tokenization process.because🤗 The Tokenizer in the Tokenizers library is written in Rust and can be very fast when processing many inputs at once.

The padding parameter is omitted from the tokenization function because the efficiency of padding to the maximum length in the batch is higher than the maximum sequence length where all sequences are padding to the entire dataset.This saves a lot of time and processing power when the input sequence lengths are very inconsistent!

The following applies the tokenization method to the entire dataset.We use batched=True in map calls, so the function applies to the entire batch element of the dataset at once, not to each element individually.This makes preprocessing faster.

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

🤗 The Datasets library applies this processing by adding new fields to the dataset, each corresponding to each key in the dictionary returned by the preprocessing function.

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 408
    })
    test: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 1725
    })
})

, the Dataset.map function can set num_when preprocessingThe proc parameter is used for multithreading.We're not doing this here because🤗 The Tokenizers library has used multiple threads to tokenize our sample faster.If you don't use fast tokenizer supported by this library, can this speed up your preprocessing?

Our tokenize_function returns a containing input_ids, attention_mask and token_Type_The Dictionary of the IDS key, so these three fields are added to all splits sections of our dataset.Note that if our preprocessing function returns a new value for an existing key in the dataset where we apply map, we can also change the existing field.

Finally, when we batch the input sequences, we fill all the input sequences to the length of the longest sequence in the batch - what we call dynamic padding.

Dynamic padding dynamic filling technology

youtube video: <what is Dynamic padding>
In PyTorch, the function responsible for putting a batch of samples together is called collate function.This is a parameter that you can pass when building a DataLoader. The default value is a function that converts your samples into PyTorch tensors and connects them (recursively if your elements are lists, tuples, or dictionaries).This is not possible in our example because we will not all have the same size of input.We have deliberately postponed filling and applied it only to each batch if necessary, avoiding excessive input and large filling.This will greatly speed up the training, but be aware that if you train on a TPU, it may cause problems - the TPU prefers fixed shapes even if it requires additional filling.

To do this in practice, we must define a collat e function that will apply the correct amount of padding to the items of the datasets we want to batch together.Fortunately,🤗 The Transformers library provides this functionality to us through DataCollat orWithPadding.When you instantiate it, it requires a tokenizer (to know which fill tag to use, and whether the model wants to fill on the left or right side of the input) and does everything you need:

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

To test this small feature, select the samples from the training set that we want to batch together.Columns idx, sentence1, and sentence2 need to be deleted because they are not needed and they contain strings (you cannot create tensors).View the length of each input in the batch:

samples = tokenized_datasets["train"][:8]
samples = {
    k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]
}
[len(x) for x in samples["input_ids"]]
[50, 59, 47, 67, 59, 50, 62, 32]

We have sequences of different lengths.Dynamic padding means that the sequence in the batch should be padded to a length of 67.Without dynamic filling, all samples must be filled to the maximum length of the entire dataset, or the maximum length acceptable to the model.Let's take a closer look at our data_Whether the collator correctly dynamically populates the batch:

batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}
{'attention_mask': torch.Size([8, 67]),
 'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'labels': torch.Size([8])}

It seems all right!Now that we've changed from raw text to batches of batch data that our model can handle, we're ready to fine-tune them!

✏️Give it a try!Copy the preprocessing on the GLUE SST-2 dataset.The SST-2 dataset consists of a single sentence instead of pairs, but the rest should look the same.For more difficult challenges, try writing a preprocessing function that works for any GLUE task.

Tags: Pytorch Deep Learning NLP

Posted on Mon, 06 Sep 2021 13:07:59 -0400 by imstupid