# Call tokenizer for bert model tokenizer = ppnlp.transformers.BertTokenizer.from_pretrained('bert-base-chinese') inputs_1 = tokenizer('What a nice day today') print(inputs_1) inputs_2 = tokenizer('Will it rain tomorrow') print(inputs_2)
Call the bert pre training word segmentation tool. The bert base Chinese here, combined with the paper, should be the Chinese version of bert base.
The main work of tokenizer should be text vectorization, that is, convert each word in Chinese sentence into its corresponding digital code, so that the machine can understand it.
From the output result, the first id must be 101, the last id must be 102, and the remaining IDS correspond to a single Chinese character respectively. Comparing the two output results, it can be found that the digital code corresponding to "day" is 1921, which is fixed and will not change due to input.
# Output the first 10 samples of the training set for idx, example in enumerate(train_ds): if idx < 10: print(example)
enumerate here is a built-in function of python, which is used to combine a traversable data object (such as list, tuple or string) into an index sequence, and list data and data subscripts at the same time. It is generally used in the for loop. See
Python enumerate() function - rookie tutorial (runoob.com)https://www.runoob.com/python/python-func-enumerate.html This code is to output the first 10 samples in the training set and see the contents.
It can be found that the samples in the training set do not have qid, and the text is the evaluation of something. A label of 1 means that the text is a positive evaluation, and 0 means that the text is a negative evaluation.
# Super parameter EPOCHS = 10 # Number of rounds of training BATCH_SIZE = 8 # Batch size MAX_LEN = 300 # Maximum text length LR = 1e-5 # Learning rate WARMUP_STEPS = 100 # Warm up steps T_TOTAL = 1000 # General steps
Here are some parameters defined. I'm not very clear about the specific purpose.
# Convert the text content into the token id required by the model def convert_example(example, tokenizer, max_seq_length=512, is_test=False): """ Builds model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. And creates a mask from the two sequences passed to be used in a sequence-pair classification task. """ encoded_inputs = tokenizer(text=example["text"], max_seq_len=max_seq_length) input_ids = encoded_inputs["input_ids"] token_type_ids = encoded_inputs["token_type_ids"] if not is_test: label = np.array([example["label"]], dtype="int64") return input_ids, token_type_ids, label else: return input_ids, token_type_ids
For the understanding of this code, refer to the following document link
PaddleNLP Transformer API - PaddleNLP documentationhttps://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html?highlight=convert_example#id2Builds model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. And creates a mask from the two sequences passed to be used in a sequence-pair classification task.
This passage roughly means to build a model for sequence classification task from sequences or sequence pairs by connecting and adding special tags. And create a mask from the two sequences passed for the sequence pair classification task.
The tokenizer passed in by the function here should be the tokenizer, Max, introduced in Cell 3_ seq_ Len is the maximum sentence length that the word splitter can receive, input_ids is the digital code corresponding to Chinese characters, token_type_ids doesn't know what it means yet.
is_test is used to distinguish whether it is a test set. If not, an additional sample label will be returned.