NLP live lesson Day 5: emotion analysis pre training model SKEP
This project will comprehensively introduce the two sub tasks of emotion analysis task, sentence level emotion analysis and goal level emotion analysis.
At the same time, it demonstrates how to use the emotion analysis pre training model SKEP to complete the above two tasks, and introduces the pre training model SKEP and its use in paddelnlp in detail.
This project mainly includes "task introduction", "emotion analysis pre training model SKEP", "sentence level emotion analysis" and "goal level emotion analysis".
The AI Studio platform has Paddle and PaddleNLP installed by default, and the version is updated regularly. To manually update the pad, refer to Propeller installation instructions , install the latest version of the propeller frame in the corresponding environment.
Ensure that the latest version of PaddleNLP is installed using the following command:
!pip install --upgrade paddlenlp -i https://pypi.org/simple
Collecting paddlenlp [?25l Downloading https://files.pythonhosted.org/packages/b0/7d/6c24cda54d018d350ee342f715523ade7871660444ed95f3d3e753d6f388/paddlenlp-2.0.8-py3-none-any.whl (571kB) [K |████████████████████████████████| 573kB 291kB/s eta 0:00:01 [?25hRequirement already satisfied, skipping upgrade: colorama in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.4.4) Requirement already satisfied, skipping upgrade: colorlog in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (4.1.0) Requirement already satisfied, skipping upgrade: jieba in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.42.1) Requirement already satisfied, skipping upgrade: h5py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (2.9.0) Requirement already satisfied, skipping upgrade: multiprocess in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.70.11.1) Requirement already satisfied, skipping upgrade: seqeval in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.2.2) Requirement already satisfied, skipping upgrade: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from h5py->paddlenlp) (1.15.0) Requirement already satisfied, skipping upgrade: numpy>=1.7 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from h5py->paddlenlp) (1.20.3) Requirement already satisfied, skipping upgrade: dill>=0.3.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from multiprocess->paddlenlp) (0.3.3) Requirement already satisfied, skipping upgrade: scikit-learn>=0.21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp) (0.24.2) Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (2.1.0) Requirement already satisfied, skipping upgrade: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (0.14.1) Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (1.6.3) Installing collected packages: paddlenlp Found existing installation: paddlenlp 2.0.7 Uninstalling paddlenlp-2.0.7: Successfully uninstalled paddlenlp-2.0.7 Successfully installed paddlenlp-2.0.8
Part A. emotion analysis task
As we all know, human natural language contains rich emotional colors: expressing people's emotions (such as sadness and happiness), expressing people's feelings (such as burnout and depression), expressing people's preferences (such as like and hate), expressing people's personality characteristics and expressing people's position, etc. Affective analysis is applied in commodity preference, consumption decision-making, public opinion analysis and other scenarios. Using machines to automatically analyze these emotional tendencies will not only help enterprises understand consumers' feelings about their products, but also provide a basis for product improvement; At the same time, it also helps enterprises analyze the attitudes of business partners in order to make better business decisions.
The well-known emotion analysis task is to classify a paragraph of text, such as three classification problems with positive emotion polarity, negative emotion polarity and others:

Emotion analysis task
- Positive: it means positive emotions, such as happiness, surprise, expectation, etc.
- Negative: indicates negative emotions, such as sadness, sadness, anger, panic, etc.
- Other: other types of emotions.
In fact, the above familiar emotion analysis tasks are sentence level emotion analysis tasks.
Affective analysis tasks can be further divided into sentence level affective analysis, target level affective analysis and so on. In the following chapters, the two tasks and their application scenarios will be introduced in detail.
Part B. emotion analysis pre training model SKEP
In recent years, a large number of studies have shown that pre trained models (PTM) based on large corpus can learn general language representation, which is conducive to downstream NLP tasks, and can avoid training models from scratch. With the development of computing power, the emergence of depth model (i.e. Transformer) and the enhancement of training skills, PTM continues to develop from shallow to deep.
Emotional pre training model SKEP (sentimental knowledge enhanced pre training for sentimental analysis). SKEP uses emotional knowledge to enhance the pre training model and comprehensively surpasses SOTA in 14 typical tasks of Chinese and British emotional analysis. This work has been employed by ACL 2020. SKEP is an emotion pre training algorithm based on emotional knowledge enhancement proposed by Baidu research team. This algorithm uses unsupervised method to automatically mine emotional knowledge, and then uses emotional knowledge to construct pre training objectives, so that machines can learn to understand emotional semantics. SKEP provides a unified and powerful emotional semantic representation for all kinds of emotional analysis tasks.
Thesis address: https://arxiv.org/abs/2005.05635
Baidu research team further verified the effect of emotion pre training model SKEP on 14 Chinese and English data from three typical emotion analysis tasks, sentence level emotion classification, aspect level emotion classification and Opinion Role Labeling.
Specific experimental effect reference: https://github.com/baidu/Senta#skep
Part C sentence level emotion analysis & goal level emotion analysis
Part C.1 sentence level affective analysis
The emotional polarity classification of a given text is often used in film review analysis, online forum public opinion analysis and other scenes. For example:
The reason for choosing Zhujiang garden is convenience. There is an escalator directly to the seaside. There are all kinds of restaurants, food galleries, shopping malls, supermarkets and stalls around. The decoration of the hotel is average, but it is clean. The pool is on the roof of the lobby, so it's very small, but my daughter likes it. The breakfast is western style and quite rich. Service? Average 1 15.4 The keyboard of inch notebook is really cool. It is basically similar to that of desktop computer. I really like the digital keypad. It is very convenient to input numbers. It looks very beautiful and the workmanship is quite good 1 The room is too small. Everything else is average......... 0
Where 1 represents positive emotion and 0 represents negative emotion.

Sentence level emotion analysis task
Common data sets
ChnSenticorp data set is a common data set for public Chinese emotion analysis, which is a 2-category data set. PaddleNLP has built-in this dataset, which can be loaded with one click.
from paddlenlp.datasets import load_dataset train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"]) print(train_ds[0]) print(train_ds[1]) print(train_ds[2])
100%|██████████| 1909/1909 [00:00<00:00, 52488.26it/s] {'text': 'The reason for choosing Zhujiang garden is convenience. There is an escalator directly to the seaside. There are all kinds of restaurants, food galleries, shopping malls, supermarkets and stalls around. The decoration of the hotel is average, but it is clean. The pool is on the roof of the lobby, so it's very small, but my daughter likes it. The breakfast is western style and quite rich. Service? Average', 'label': 1, 'qid': ''} {'text': '15.4 The keyboard of inch notebook is really cool. It is basically similar to that of desktop computer. I really like the digital keypad. It is very convenient to input numbers. It looks very beautiful and the workmanship is quite good', 'label': 1, 'qid': ''} {'text': 'The room is too small. Everything else is average.........', 'label': 0, 'qid': ''}
SKEP model loading
PaddleNLP has implemented the SKEP pre training model, and SKEP loading can be realized through one line of code.
Sentence level affective analysis model is a common model for SKEP fine tune text classification. Firstly, the sentence semantic features are extracted by SKEP, and then the semantic features are classified.
from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer # Specify the model name and load the model with one click model = SkepForSequenceClassification.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch", num_classes=len(train_ds.label_list)) # Similarly, the corresponding Tokenizer is loaded with one click by specifying the model name, which is used to process text data, such as segmentation token and conversion token_id, etc. tokenizer = SkepTokenizer.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch")
[2021-08-31 02:47:43,539] [ INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/skep/skep_ernie_1.0_large_ch.pdparams and saved to /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch [2021-08-31 02:47:43,541] [ INFO] - Downloading skep_ernie_1.0_large_ch.pdparams from https://paddlenlp.bj.bcebos.com/models/transformers/skep/skep_ernie_1.0_large_ch.pdparams 100%|██████████| 1238309/1238309 [00:16<00:00, 73468.79it/s] [2021-08-31 02:48:11,365] [ INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/skep/skep_ernie_1.0_large_ch.vocab.txt and saved to /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch [2021-08-31 02:48:11,368] [ INFO] - Downloading skep_ernie_1.0_large_ch.vocab.txt from https://paddlenlp.bj.bcebos.com/models/transformers/skep/skep_ernie_1.0_large_ch.vocab.txt 100%|██████████| 55/55 [00:00<00:00, 2465.55it/s]
SkepForSequenceClassification can be used for sentence level emotion analysis and target level emotion analysis tasks. It obtains the representation of the input text through the pre training model SKEP, and then classifies the text representation.
-
pretrained_model_name_or_path: model name. "Skip_ernie_1.0_large_ch", "skip_ernie_2.0_large_en" are supported.
- "skep_ernie_1.0_large_ch": it refers to SKEP model in pre training Ernie_ 1.0_ large_ On the basis of CH, the Chinese pre training model is obtained by continuous pre training on massive Chinese data;
- "skep_ernie_2.0_large_en": it refers to SKEP model in pre training Ernie_ 2.0_ large_ An English pre training model based on en and continuous pre training on massive English data;
-
num_classes: number of data set classification categories.
For details about SKEP model implementation, refer to: https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/transformers/skep
data processing
Similarly, we need to process the original chnenticorp data into a data format that the model can read in.
The SKEP model processes Chinese text according to word granularity. We can use the built-in SkepTokenizer in paddelnlp to complete one click processing.
import os from functools import partial import numpy as np import paddle import paddle.nn.functional as F from paddlenlp.data import Stack, Tuple, Pad from utils import create_dataloader def convert_example(example, tokenizer, max_seq_length=512, is_test=False): """ Builds model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. And creates a mask from the two sequences passed to be used in a sequence-pair classification task. A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format: :: - single sequence: ``[CLS] X [SEP]`` - pair of sequences: ``[CLS] A [SEP] B [SEP]`` A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format: :: 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s). Args: example(obj:`list[str]`): List of input data, containing text and label if it have label. tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` which contains most of the methods. Users should refer to the superclass for more information regarding methods. max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. Returns: input_ids(obj:`list[int]`): The list of token ids. token_type_ids(obj: `list[int]`): List of sequence pair mask. label(obj:`int`, optional): The input label if not is_test. """ # The original data is processed into a format that can be read in by the model, enocded_inputs is a dict that contains inputs_ ids,token_type_ids and other fields encoded_inputs = tokenizer( text=example["text"], max_seq_len=max_seq_length) # input_ids: the corresponding token id in the vocabulary after the text is segmented into tokens input_ids = encoded_inputs["input_ids"] # token_type_ids: whether the current token belongs to sentence 1 or sentence 2, that is, the segment ids expressed in the above figure token_type_ids = encoded_inputs["token_type_ids"] if not is_test: # label: emotional polarity category label = np.array([example["label"]], dtype="int64") return input_ids, token_type_ids, label else: # qid: number of each data qid = np.array([example["qid"]], dtype="int64") return input_ids, token_type_ids, qid
# Batch data size batch_size = 32 # Maximum length of text sequence max_seq_length = 256 # Process the data into a data format that the model can read in trans_func = partial( convert_example, tokenizer=tokenizer, max_seq_length=max_seq_length) # Form data into batch data, such as # padding text sequences of different lengths to the maximum length of batch data # Stack each data label together batchify_fn = lambda samples, fn=Tuple( Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type_ids Stack() # labels ): [data for data in fn(samples)] train_data_loader = create_dataloader( train_ds, mode='train', batch_size=batch_size, batchify_fn=batchify_fn, trans_fn=trans_func) dev_data_loader = create_dataloader( dev_ds, mode='dev', batch_size=batch_size, batchify_fn=batchify_fn, trans_fn=trans_func)
Model training and evaluation
After defining the loss function, optimizer and evaluation index, you can start training.
Recommended super parameter settings:
- max_seq_length=256
- batch_size=48
- learning_rate=2e-5
- epochs=10
In actual operation, batch can be adjusted according to the size of video memory_ Size and max_seq_length size.
import time from utils import evaluate # Training rounds epochs = 1 # Folder to save model parameters during training ckpt_dir = "skep_ckpt" # len(train_data_loader) the number of step s required for a round of training num_training_steps = len(train_data_loader) * epochs # Adam optimizer optimizer = paddle.optimizer.AdamW( learning_rate=2e-5, parameters=model.parameters()) # Cross entropy loss function criterion = paddle.nn.loss.CrossEntropyLoss() # accuracy evaluation index metric = paddle.metric.Accuracy()
# Start training global_step = 0 tic_train = time.time() for epoch in range(1, epochs + 1): for step, batch in enumerate(train_data_loader, start=1): input_ids, token_type_ids, labels = batch # Feed data to model logits = model(input_ids, token_type_ids) # Calculate loss function value loss = criterion(logits, labels) # Predicted classification probability value probs = F.softmax(logits, axis=1) # Calculate acc correct = metric.compute(probs, labels) metric.update(correct) acc = metric.accumulate() global_step += 1 if global_step % 10 == 0: print( "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s" % (global_step, epoch, step, loss, acc, 10 / (time.time() - tic_train))) tic_train = time.time() # Reverse gradient return and update parameters loss.backward() optimizer.step() optimizer.clear_grad() if global_step % 100 == 0: save_dir = os.path.join(ckpt_dir, "model_%d" % global_step) if not os.path.exists(save_dir): os.makedirs(save_dir) # Model for evaluating current training evaluate(model, criterion, metric, dev_data_loader) # Save current model parameters, etc model.save_pretrained(save_dir) # Save the vocabulary of tokenizer, etc tokenizer.save_pretrained(save_dir)
global step 10, epoch: 1, batch: 10, loss: 0.57127, accu: 0.61562, speed: 0.75 step/s global step 20, epoch: 1, batch: 20, loss: 0.46069, accu: 0.71406, speed: 0.78 step/s global step 30, epoch: 1, batch: 30, loss: 0.14763, accu: 0.77500, speed: 0.74 step/s global step 40, epoch: 1, batch: 40, loss: 0.19220, accu: 0.81172, speed: 0.70 step/s global step 50, epoch: 1, batch: 50, loss: 0.16075, accu: 0.83125, speed: 0.72 step/s global step 60, epoch: 1, batch: 60, loss: 0.31677, accu: 0.84688, speed: 0.71 step/s global step 70, epoch: 1, batch: 70, loss: 0.36520, accu: 0.85804, speed: 0.78 step/s global step 80, epoch: 1, batch: 80, loss: 0.40315, accu: 0.86250, speed: 0.74 step/s global step 90, epoch: 1, batch: 90, loss: 0.25542, accu: 0.86944, speed: 0.71 step/s global step 100, epoch: 1, batch: 100, loss: 0.17580, accu: 0.87500, speed: 0.72 step/s eval loss: 0.23052, accu: 0.91750 global step 110, epoch: 1, batch: 110, loss: 0.15768, accu: 0.90938, speed: 0.27 step/s global step 120, epoch: 1, batch: 120, loss: 0.14812, accu: 0.91250, speed: 0.70 step/s global step 130, epoch: 1, batch: 130, loss: 0.16721, accu: 0.91146, speed: 0.71 step/s global step 140, epoch: 1, batch: 140, loss: 0.08221, accu: 0.91250, speed: 0.74 step/s global step 150, epoch: 1, batch: 150, loss: 0.08695, accu: 0.91750, speed: 0.73 step/s global step 160, epoch: 1, batch: 160, loss: 0.23525, accu: 0.91823, speed: 0.73 step/s global step 170, epoch: 1, batch: 170, loss: 0.14445, accu: 0.91830, speed: 0.70 step/s global step 180, epoch: 1, batch: 180, loss: 0.02297, accu: 0.92070, speed: 0.76 step/s global step 190, epoch: 1, batch: 190, loss: 0.22582, accu: 0.92083, speed: 0.70 step/s global step 200, epoch: 1, batch: 200, loss: 0.44752, accu: 0.91969, speed: 0.74 step/s eval loss: 0.19852, accu: 0.93000 global step 210, epoch: 1, batch: 210, loss: 0.29137, accu: 0.91563, speed: 0.27 step/s global step 220, epoch: 1, batch: 220, loss: 0.21413, accu: 0.93281, speed: 0.72 step/s global step 230, epoch: 1, batch: 230, loss: 0.22312, accu: 0.93333, speed: 0.71 step/s global step 240, epoch: 1, batch: 240, loss: 0.31168, accu: 0.93594, speed: 0.71 step/s global step 250, epoch: 1, batch: 250, loss: 0.25115, accu: 0.93375, speed: 0.72 step/s global step 260, epoch: 1, batch: 260, loss: 0.19861, accu: 0.93490, speed: 0.71 step/s global step 270, epoch: 1, batch: 270, loss: 0.12701, accu: 0.93661, speed: 0.71 step/s global step 280, epoch: 1, batch: 280, loss: 0.08258, accu: 0.93672, speed: 0.72 step/s global step 290, epoch: 1, batch: 290, loss: 0.15821, accu: 0.93715, speed: 0.69 step/s global step 300, epoch: 1, batch: 300, loss: 0.07628, accu: 0.93781, speed: 0.76 step/s eval loss: 0.16263, accu: 0.95417
Forecast submission results
The trained model can also predict the emotion of the text.
import numpy as np import paddle # Processing test set data trans_func = partial( convert_example, tokenizer=tokenizer, max_seq_length=max_seq_length, is_test=True) batchify_fn = lambda samples, fn=Tuple( Pad(axis=0, pad_val=tokenizer.pad_token_id), # input Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment Stack() # qid ): [data for data in fn(samples)] test_data_loader = create_dataloader( test_ds, mode='test', batch_size=batch_size, batchify_fn=batchify_fn, trans_fn=trans_func)
# Change the loaded parameter path according to the actual operation params_path = 'skep_ckp/model_500/model_state.pdparams' if params_path and os.path.isfile(params_path): # Load model parameters state_dict = paddle.load(params_path) model.set_dict(state_dict) print("Loaded parameters from %s" % params_path)
label_map = {0: '0', 1: '1'} results = [] # Switch the model to the evaluation mode and turn off random factors such as dropout model.eval() for batch in test_data_loader: input_ids, token_type_ids, qids = batch # Feed data to model logits = model(input_ids, token_type_ids) # Forecast classification probs = F.softmax(logits, axis=-1) idx = paddle.argmax(probs, axis=1).numpy() idx = idx.tolist() labels = [label_map[i] for i in idx] qids = qids.numpy().tolist() results.extend(zip(qids, labels))
res_dir = "./results" if not os.path.exists(res_dir): os.makedirs(res_dir) # Write prediction results with open(os.path.join(res_dir, "ChnSentiCorp.tsv"), 'w', encoding="utf8") as f: f.write("index\tprediction\n") for qid, label in results: f.write(str(qid[0])+"\t"+label+"\n")
Part C.2 goal level emotion analysis
In the e-commerce product analysis scenario, in addition to analyzing the emotional polarity of the overall commodity, it is also refined to take the specific "aspects" of the commodity as the analysis subject for emotional analysis (aspect level), as follows:
- This potato chip tastes a little salty and too spicy, but it tastes crisp.
There is a negative evaluation on the taste of potato chips (salty, too spicy), but a positive evaluation on the taste (very crisp).
- I like Hawaii very much, but the seafood here is too expensive.
There is a positive evaluation about Hawaii (like it), but there is a negative evaluation about Hawaiian seafood (the price is too expensive).

Goal level emotion analysis task
Common data sets
Thousand words data set Many task common data sets have been provided.
Download link of emotion analysis data set: https://aistudio.baidu.com/aistudio/competition/detail/50/?isFromLUGE=TRUE
SE-ABSA16_PHNS data set is a target level emotion analysis data set about mobile phones. This dataset has been built into PaddleNLP. The loading method is as follows:
train_ds, test_ds = load_dataset("seabsa16", "phns", splits=["train", "test"]) print(train_ds[0]) print(train_ds[1]) print(train_ds[2])
100%|██████████| 381/381 [00:00<00:00, 19566.67it/s] {'text': 'phone#design_features', 'text_pair ':' today, I was lucky to get the real Hong Kong version of the white iPhone 5. I tried it. Let's talk about my feelings: 1. The size and width of the real machine remain the same as 4/4s, and the length is about one centimeter longer, that is, a row of more icons as mentioned earlier. 2. The weight of the real machine is much lighter than that of the previous generation. Personally, it feels like the weight of the i9100. (it may take some time for friends who are used to the previous generation to adapt) 3. Since there is no version of SIM card at present, they can't insert the card. Friends who buy it should pay attention to that they can't simply cut the card, but need to go to the operator to replace the new generation of SIM card. 4. The screen display effect is indeed better than that of the previous generation. Whether from the perspective of clarity or different angles, the iPhone 5 is definitely higher. I think this may be the most meaningful upgrade compared with the previous generation. 5. The new data interface is smaller, better and more convenient than the previous generation. You will have this experience in the use process. 6. In terms of simple operations, the speed is faster than 4s, which can be felt without testing software, such as program call, photo shooting and browsing. However, at present, the price of Keng father in the parallel market, we'd better wait and see, and don't rush to sell. " label': 1} {'text': 'display#quality', 'text_pair ':' today, I was lucky to get the real Hong Kong version of the white iPhone 5. I tried it. Let's talk about my feelings: 1. The size and width of the real machine remain the same as 4/4s, and the length is about one centimeter longer, that is, a row of more icons as mentioned earlier. 2. The weight of the real machine is much lighter than that of the previous generation. Personally, it feels like the weight of the i9100. (it may take some time for friends who are used to the previous generation to adapt) 3. Since there is no version of SIM card at present, they can't insert the card. Friends who buy it should pay attention to that they can't simply cut the card, but need to go to the operator to replace the new generation of SIM card. 4. The screen display effect is indeed better than that of the previous generation. Whether from the perspective of clarity or different angles, the iPhone 5 is definitely higher. I think this may be the most meaningful upgrade compared with the previous generation. 5. The new data interface is smaller, better and more convenient than the previous generation. You will have this experience in the use process. 6. In terms of simple operations, the speed is faster than 4s, which can be felt without testing software, such as program call, photo shooting and browsing. However, at present, the price of Keng father in the parallel market, we'd better wait and see, and don't rush to sell. " label': 1} {'text': 'ports#connectivity', 'text_pair ':' today, I was lucky to get the real Hong Kong version of the white iPhone 5. I tried it. Let's talk about my feelings: 1. The size and width of the real machine remain the same as 4/4s, and the length is about one centimeter longer, that is, a row of more icons as mentioned earlier. 2. The weight of the real machine is much lighter than that of the previous generation. Personally, it feels like the weight of the i9100. (it may take some time for friends who are used to the previous generation to adapt) 3. Since there is no version of SIM card at present, they can't insert the card. Friends who buy it should pay attention to that they can't simply cut the card, but need to go to the operator to replace the new generation of SIM card. 4. The screen display effect is indeed better than that of the previous generation. Whether from the perspective of clarity or different angles, the iPhone 5 is definitely higher. I think this may be the most meaningful upgrade compared with the previous generation. 5. The new data interface is smaller, better and more convenient than the previous generation. You will have this experience in the use process. 6. In terms of simple operations, the speed is faster than 4s, which can be felt without testing software, such as program call, photo shooting and browsing. However, at present, the price of Keng father in the parallel market, we'd better wait and see, and don't rush to sell. " label': 1}
SKEP model loading
The target level emotion analysis model also uses the SkepForSequenceClassification model, but the input of the target level emotion analysis model is not only a sentence, but a sentence pair. One sentence describes the "aspect of the evaluation object" and the other describes the "comment on the aspect". As shown in the figure below.
# Specify the model name and load the model with one click model = SkepForSequenceClassification.from_pretrained( 'skep_ernie_1.0_large_ch', num_classes=len(train_ds.label_list)) # Specify the model name and load the tokenizer with one click tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch')
[2021-08-31 08:35:02,648] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.pdparams [2021-08-31 08:35:08,135] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.vocab.txt
data processing
Similarly, we need to convert the original SE_ABSA16_PHNS data is processed into a data format that the model can read in.
The SKEP model processes Chinese text according to word granularity. We can use the built-in SkepTokenizer in paddelnlp to complete one click processing.
from functools import partial import os import time import numpy as np import paddle import paddle.nn.functional as F from paddlenlp.data import Stack, Tuple, Pad def convert_example(example, tokenizer, max_seq_length=512, is_test=False, dataset_name="chnsenticorp"): """ Builds model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. And creates a mask from the two sequences passed to be used in a sequence-pair classification task. A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format: :: - single sequence: ``[CLS] X [SEP]`` - pair of sequences: ``[CLS] A [SEP] B [SEP]`` A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format: :: 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s). note: There is no need token type ids for skep_roberta_large_ch model. Args: example(obj:`list[str]`): List of input data, containing text and label if it have label. tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` which contains most of the methods. Users should refer to the superclass for more information regarding methods. max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. dataset_name((obj:`str`, defaults to "chnsenticorp"): The dataset name, "chnsenticorp" or "sst-2". Returns: input_ids(obj:`list[int]`): The list of token ids. token_type_ids(obj: `list[int]`): List of sequence pair mask. label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test. """ encoded_inputs = tokenizer( text=example["text"], text_pair=example["text_pair"], max_seq_len=max_seq_length) input_ids = encoded_inputs["input_ids"] token_type_ids = encoded_inputs["token_type_ids"] if not is_test: label = np.array([example["label"]], dtype="int64") return input_ids, token_type_ids, label else: return input_ids, token_type_ids
# Maximum text sequence length processed max_seq_length=256 # Batch data size batch_size=16 # Process the data into a data format that the model can read in trans_func = partial( convert_example, tokenizer=tokenizer, max_seq_length=max_seq_length) # Form data into batch data, such as # padding text sequences of different lengths to the maximum length of batch data # Stack each data label together batchify_fn = lambda samples, fn=Tuple( Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type_ids Stack(dtype="int64") # labels ): [data for data in fn(samples)] train_data_loader = create_dataloader( train_ds, mode='train', batch_size=batch_size, batchify_fn=batchify_fn, trans_fn=trans_func)
model training
After defining the loss function, optimizer and evaluation index, you can start training.
# Training rounds epochs = 3 # Total number of step s to be trained num_training_steps = len(train_data_loader) * epochs # optimizer optimizer = paddle.optimizer.AdamW( learning_rate=5e-5, parameters=model.parameters()) # Cross entropy loss criterion = paddle.nn.loss.CrossEntropyLoss() # Accuracy evaluation index metric = paddle.metric.Accuracy()
# Start training ckpt_dir = "skep_aspect" global_step = 0 tic_train = time.time() for epoch in range(1, epochs + 1): for step, batch in enumerate(train_data_loader, start=1): input_ids, token_type_ids, labels = batch # Feed data to model logits = model(input_ids, token_type_ids) # Calculate loss function value loss = criterion(logits, labels) # Prediction classification probability probs = F.softmax(logits, axis=1) # Calculate acc correct = metric.compute(probs, labels) metric.update(correct) acc = metric.accumulate() global_step += 1 if global_step % 10 == 0: print( "global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f, speed: %.2f step/s" % (global_step, epoch, step, loss, acc, 10 / (time.time() - tic_train))) tic_train = time.time() # Reverse gradient return and update parameters loss.backward() optimizer.step() optimizer.clear_grad() if global_step % 100 == 0: save_dir = os.path.join(ckpt_dir, "model_%d" % global_step) if not os.path.exists(save_dir): os.makedirs(save_dir) # Save model parameters model.save_pretrained(save_dir) # Save the vocabulary of tokenizer, etc tokenizer.save_pretrained(save_dir)
global step 10, epoch: 1, batch: 10, loss: 0.62798, acc: 0.60000, speed: 1.25 step/s global step 20, epoch: 1, batch: 20, loss: 0.67770, acc: 0.60625, speed: 1.25 step/s global step 30, epoch: 1, batch: 30, loss: 0.66418, acc: 0.60000, speed: 1.24 step/s global step 40, epoch: 1, batch: 40, loss: 0.67216, acc: 0.61562, speed: 1.25 step/s global step 50, epoch: 1, batch: 50, loss: 0.74741, acc: 0.59125, speed: 1.24 step/s global step 60, epoch: 1, batch: 60, loss: 0.62740, acc: 0.59479, speed: 1.24 step/s global step 70, epoch: 1, batch: 70, loss: 0.55040, acc: 0.60446, speed: 1.24 step/s global step 80, epoch: 1, batch: 80, loss: 0.59344, acc: 0.61719, speed: 1.25 step/s global step 90, epoch: 2, batch: 6, loss: 0.41540, acc: 0.61592, speed: 1.28 step/s global step 100, epoch: 2, batch: 16, loss: 0.53867, acc: 0.62249, speed: 1.23 step/s global step 110, epoch: 2, batch: 26, loss: 0.63824, acc: 0.61986, speed: 0.85 step/s global step 120, epoch: 2, batch: 36, loss: 0.64022, acc: 0.62552, speed: 1.23 step/s global step 130, epoch: 2, batch: 46, loss: 0.53340, acc: 0.62983, speed: 1.24 step/s global step 140, epoch: 2, batch: 56, loss: 0.47430, acc: 0.63665, speed: 1.23 step/s global step 150, epoch: 2, batch: 66, loss: 0.51793, acc: 0.63880, speed: 1.24 step/s global step 160, epoch: 2, batch: 76, loss: 0.60049, acc: 0.64185, speed: 1.24 step/s global step 170, epoch: 3, batch: 2, loss: 0.48817, acc: 0.63831, speed: 1.30 step/s global step 180, epoch: 3, batch: 12, loss: 0.78223, acc: 0.64246, speed: 1.23 step/s global step 190, epoch: 3, batch: 22, loss: 0.68268, acc: 0.64484, speed: 1.24 step/s global step 200, epoch: 3, batch: 32, loss: 0.44891, acc: 0.64730, speed: 1.24 step/s global step 210, epoch: 3, batch: 42, loss: 0.76054, acc: 0.64773, speed: 0.86 step/s global step 220, epoch: 3, batch: 52, loss: 0.59863, acc: 0.65068, speed: 1.24 step/s global step 230, epoch: 3, batch: 62, loss: 0.41296, acc: 0.65284, speed: 1.24 step/s global step 240, epoch: 3, batch: 72, loss: 0.51121, acc: 0.65664, speed: 1.23 step/s global step 250, epoch: 3, batch: 82, loss: 0.35234, acc: 0.65838, speed: 1.26 step/s
Forecast submission results
The trained model can also predict the emotion of the evaluation object.
@paddle.no_grad() def predict(model, data_loader, label_map): """ Given a prediction dataset, it gives the prediction results. Args: model(obj:`paddle.nn.Layer`): A model to classify texts. data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. label_map(obj:`dict`): The label id (key) to label str (value) map. """ model.eval() results = [] for batch in data_loader: input_ids, token_type_ids = batch logits = model(input_ids, token_type_ids) probs = F.softmax(logits, axis=1) idx = paddle.argmax(probs, axis=1).numpy() idx = idx.tolist() labels = [label_map[i] for i in idx] results.extend(labels) return results
# Processing test set data label_map = {0: '0', 1: '1'} trans_func = partial( convert_example, tokenizer=tokenizer, max_seq_length=max_seq_length, is_test=True) batchify_fn = lambda samples, fn=Tuple( Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type_ids ): [data for data in fn(samples)] test_data_loader = create_dataloader( test_ds, mode='test', batch_size=batch_size, batchify_fn=batchify_fn, trans_fn=trans_func)
# Change the loaded parameter path according to the actual operation params_path = 'skep_ckpt/model_900/model_state.pdparams' if params_path and os.path.isfile(params_path): # Load model parameters state_dict = paddle.load(params_path) model.set_dict(state_dict) print("Loaded parameters from %s" % params_path) results = predict(model, test_data_loader, label_map)
# Write prediction results with open(os.path.join("results", "SE-ABSA16_PHNS.tsv"), 'w', encoding="utf8") as f: f.write("index\tprediction\n") for idx, label in enumerate(results): f.write(str(idx)+"\t"+label+"\n")
Compress the forecast results into a zip file and submit Thousand words competition website
Note: NLPCC14-SC.tsv, se-absa16 in the results folder_ CAME.tsv,COTE_BD.tsv,COTE_MFW.tsv,COTE_DP.tsv and other documents are supplementary documents for smooth submission.
The results need to be improved.
#Compress the forecast results into a zip file and submit !zip -r results.zip results
The above implementation is based on PaddleNLP. It is not easy to open source. I hope you can support it~
Remember to give PaddleNLP Order a little Star ⭐, Keep track of the latest news and functions
GitHub address: https://github.com/PaddlePaddle/PaddleNLP