NLP punch in camp practice lesson 5: text emotion analysis

NLP live lesson Day 5: emotion analysis pre training model SKEP

This project will comprehensively introduce the two sub tasks of emotion analysis task, sentence level emotion analysis and goal level emotion analysis.

At the same time, it demonstrates how to use the emotion analysis pre training model SKEP to complete the above two tasks, and introduces the pre training model SKEP and its use in paddelnlp in detail.

This project mainly includes "task introduction", "emotion analysis pre training model SKEP", "sentence level emotion analysis" and "goal level emotion analysis".

The AI Studio platform has Paddle and PaddleNLP installed by default, and the version is updated regularly. To manually update the pad, refer to Propeller installation instructions , install the latest version of the propeller frame in the corresponding environment.

Ensure that the latest version of PaddleNLP is installed using the following command:

!pip install --upgrade paddlenlp -i https://pypi.org/simple 
Collecting paddlenlp
[?25l  Downloading https://files.pythonhosted.org/packages/b0/7d/6c24cda54d018d350ee342f715523ade7871660444ed95f3d3e753d6f388/paddlenlp-2.0.8-py3-none-any.whl (571kB)
[K     |████████████████████████████████| 573kB 291kB/s eta 0:00:01
[?25hRequirement already satisfied, skipping upgrade: colorama in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.4.4)
Requirement already satisfied, skipping upgrade: colorlog in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (4.1.0)
Requirement already satisfied, skipping upgrade: jieba in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.42.1)
Requirement already satisfied, skipping upgrade: h5py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (2.9.0)
Requirement already satisfied, skipping upgrade: multiprocess in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.70.11.1)
Requirement already satisfied, skipping upgrade: seqeval in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.2.2)
Requirement already satisfied, skipping upgrade: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from h5py->paddlenlp) (1.15.0)
Requirement already satisfied, skipping upgrade: numpy>=1.7 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from h5py->paddlenlp) (1.20.3)
Requirement already satisfied, skipping upgrade: dill>=0.3.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from multiprocess->paddlenlp) (0.3.3)
Requirement already satisfied, skipping upgrade: scikit-learn>=0.21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp) (0.24.2)
Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (2.1.0)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (0.14.1)
Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (1.6.3)
Installing collected packages: paddlenlp
  Found existing installation: paddlenlp 2.0.7
    Uninstalling paddlenlp-2.0.7:
      Successfully uninstalled paddlenlp-2.0.7
Successfully installed paddlenlp-2.0.8

Part A. emotion analysis task

As we all know, human natural language contains rich emotional colors: expressing people's emotions (such as sadness and happiness), expressing people's feelings (such as burnout and depression), expressing people's preferences (such as like and hate), expressing people's personality characteristics and expressing people's position, etc. Affective analysis is applied in commodity preference, consumption decision-making, public opinion analysis and other scenarios. Using machines to automatically analyze these emotional tendencies will not only help enterprises understand consumers' feelings about their products, but also provide a basis for product improvement; At the same time, it also helps enterprises analyze the attitudes of business partners in order to make better business decisions.

The well-known emotion analysis task is to classify a paragraph of text, such as three classification problems with positive emotion polarity, negative emotion polarity and others:


Emotion analysis task
  • Positive: it means positive emotions, such as happiness, surprise, expectation, etc.
  • Negative: indicates negative emotions, such as sadness, sadness, anger, panic, etc.
  • Other: other types of emotions.

In fact, the above familiar emotion analysis tasks are sentence level emotion analysis tasks.

Affective analysis tasks can be further divided into sentence level affective analysis, target level affective analysis and so on. In the following chapters, the two tasks and their application scenarios will be introduced in detail.

Part B. emotion analysis pre training model SKEP

In recent years, a large number of studies have shown that pre trained models (PTM) based on large corpus can learn general language representation, which is conducive to downstream NLP tasks, and can avoid training models from scratch. With the development of computing power, the emergence of depth model (i.e. Transformer) and the enhancement of training skills, PTM continues to develop from shallow to deep.

Emotional pre training model SKEP (sentimental knowledge enhanced pre training for sentimental analysis). SKEP uses emotional knowledge to enhance the pre training model and comprehensively surpasses SOTA in 14 typical tasks of Chinese and British emotional analysis. This work has been employed by ACL 2020. SKEP is an emotion pre training algorithm based on emotional knowledge enhancement proposed by Baidu research team. This algorithm uses unsupervised method to automatically mine emotional knowledge, and then uses emotional knowledge to construct pre training objectives, so that machines can learn to understand emotional semantics. SKEP provides a unified and powerful emotional semantic representation for all kinds of emotional analysis tasks.

Thesis address: https://arxiv.org/abs/2005.05635


Baidu research team further verified the effect of emotion pre training model SKEP on 14 Chinese and English data from three typical emotion analysis tasks, sentence level emotion classification, aspect level emotion classification and Opinion Role Labeling.

Specific experimental effect reference: https://github.com/baidu/Senta#skep

Part C sentence level emotion analysis & goal level emotion analysis

Part C.1 sentence level affective analysis

The emotional polarity classification of a given text is often used in film review analysis, online forum public opinion analysis and other scenes. For example:

The reason for choosing Zhujiang garden is convenience. There is an escalator directly to the seaside. There are all kinds of restaurants, food galleries, shopping malls, supermarkets and stalls around. The decoration of the hotel is average, but it is clean. The pool is on the roof of the lobby, so it's very small, but my daughter likes it. The breakfast is western style and quite rich. Service? Average	1
15.4 The keyboard of inch notebook is really cool. It is basically similar to that of desktop computer. I really like the digital keypad. It is very convenient to input numbers. It looks very beautiful and the workmanship is quite good	1
 The room is too small. Everything else is average.........	0

Where 1 represents positive emotion and 0 represents negative emotion.



Sentence level emotion analysis task

Common data sets

ChnSenticorp data set is a common data set for public Chinese emotion analysis, which is a 2-category data set. PaddleNLP has built-in this dataset, which can be loaded with one click.

from paddlenlp.datasets import load_dataset

train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"])

print(train_ds[0])
print(train_ds[1])
print(train_ds[2])
100%|██████████| 1909/1909 [00:00<00:00, 52488.26it/s]


{'text': 'The reason for choosing Zhujiang garden is convenience. There is an escalator directly to the seaside. There are all kinds of restaurants, food galleries, shopping malls, supermarkets and stalls around. The decoration of the hotel is average, but it is clean. The pool is on the roof of the lobby, so it's very small, but my daughter likes it. The breakfast is western style and quite rich. Service? Average', 'label': 1, 'qid': ''}
{'text': '15.4 The keyboard of inch notebook is really cool. It is basically similar to that of desktop computer. I really like the digital keypad. It is very convenient to input numbers. It looks very beautiful and the workmanship is quite good', 'label': 1, 'qid': ''}
{'text': 'The room is too small. Everything else is average.........', 'label': 0, 'qid': ''}

SKEP model loading

PaddleNLP has implemented the SKEP pre training model, and SKEP loading can be realized through one line of code.

Sentence level affective analysis model is a common model for SKEP fine tune text classification. Firstly, the sentence semantic features are extracted by SKEP, and then the semantic features are classified.

from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer

# Specify the model name and load the model with one click
model = SkepForSequenceClassification.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch", num_classes=len(train_ds.label_list))
# Similarly, the corresponding Tokenizer is loaded with one click by specifying the model name, which is used to process text data, such as segmentation token and conversion token_id, etc.
tokenizer = SkepTokenizer.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch")
[2021-08-31 02:47:43,539] [    INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/skep/skep_ernie_1.0_large_ch.pdparams and saved to /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch
[2021-08-31 02:47:43,541] [    INFO] - Downloading skep_ernie_1.0_large_ch.pdparams from https://paddlenlp.bj.bcebos.com/models/transformers/skep/skep_ernie_1.0_large_ch.pdparams
100%|██████████| 1238309/1238309 [00:16<00:00, 73468.79it/s]
[2021-08-31 02:48:11,365] [    INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/skep/skep_ernie_1.0_large_ch.vocab.txt and saved to /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch
[2021-08-31 02:48:11,368] [    INFO] - Downloading skep_ernie_1.0_large_ch.vocab.txt from https://paddlenlp.bj.bcebos.com/models/transformers/skep/skep_ernie_1.0_large_ch.vocab.txt
100%|██████████| 55/55 [00:00<00:00, 2465.55it/s]

SkepForSequenceClassification can be used for sentence level emotion analysis and target level emotion analysis tasks. It obtains the representation of the input text through the pre training model SKEP, and then classifies the text representation.

  • pretrained_model_name_or_path: model name. "Skip_ernie_1.0_large_ch", "skip_ernie_2.0_large_en" are supported.

    • "skep_ernie_1.0_large_ch": it refers to SKEP model in pre training Ernie_ 1.0_ large_ On the basis of CH, the Chinese pre training model is obtained by continuous pre training on massive Chinese data;
    • "skep_ernie_2.0_large_en": it refers to SKEP model in pre training Ernie_ 2.0_ large_ An English pre training model based on en and continuous pre training on massive English data;
  • num_classes: number of data set classification categories.

For details about SKEP model implementation, refer to: https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/transformers/skep

data processing

Similarly, we need to process the original chnenticorp data into a data format that the model can read in.

The SKEP model processes Chinese text according to word granularity. We can use the built-in SkepTokenizer in paddelnlp to complete one click processing.

import os
from functools import partial


import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad

from utils import create_dataloader

def convert_example(example,
                    tokenizer,
                    max_seq_length=512,
                    is_test=False):
    """
    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
    by concatenating and adding special tokens. And creates a mask from the two sequences passed 
    to be used in a sequence-pair classification task.
        
    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format:
    ::
        - single sequence: ``[CLS] X [SEP]``
        - pair of sequences: ``[CLS] A [SEP] B [SEP]``

    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format:
    ::

        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
        | first sequence    | second sequence |

    If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).


    Args:
        example(obj:`list[str]`): List of input data, containing text and label if it have label.
        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` 
            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. 
            Sequences longer than this will be truncated, sequences shorter will be padded.
        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.

    Returns:
        input_ids(obj:`list[int]`): The list of token ids.
        token_type_ids(obj: `list[int]`): List of sequence pair mask.
        label(obj:`int`, optional): The input label if not is_test.
    """
    # The original data is processed into a format that can be read in by the model, enocded_inputs is a dict that contains inputs_ ids,token_type_ids and other fields
    encoded_inputs = tokenizer(
        text=example["text"], max_seq_len=max_seq_length)

    # input_ids: the corresponding token id in the vocabulary after the text is segmented into tokens
    input_ids = encoded_inputs["input_ids"]
    # token_type_ids: whether the current token belongs to sentence 1 or sentence 2, that is, the segment ids expressed in the above figure
    token_type_ids = encoded_inputs["token_type_ids"]

    if not is_test:
        # label: emotional polarity category
        label = np.array([example["label"]], dtype="int64")
        return input_ids, token_type_ids, label
    else:
        # qid: number of each data
        qid = np.array([example["qid"]], dtype="int64")
        return input_ids, token_type_ids, qid
# Batch data size
batch_size = 32
# Maximum length of text sequence
max_seq_length = 256

# Process the data into a data format that the model can read in
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length)

# Form data into batch data, such as
# padding text sequences of different lengths to the maximum length of batch data
# Stack each data label together
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
    Stack()  # labels
): [data for data in fn(samples)]
train_data_loader = create_dataloader(
    train_ds,
    mode='train',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)
dev_data_loader = create_dataloader(
    dev_ds,
    mode='dev',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

Model training and evaluation

After defining the loss function, optimizer and evaluation index, you can start training.

Recommended super parameter settings:

  • max_seq_length=256
  • batch_size=48
  • learning_rate=2e-5
  • epochs=10

In actual operation, batch can be adjusted according to the size of video memory_ Size and max_seq_length size.

import time

from utils import evaluate

# Training rounds
epochs = 1
# Folder to save model parameters during training
ckpt_dir = "skep_ckpt"
# len(train_data_loader) the number of step s required for a round of training
num_training_steps = len(train_data_loader) * epochs

# Adam optimizer
optimizer = paddle.optimizer.AdamW(
    learning_rate=2e-5,
    parameters=model.parameters())
# Cross entropy loss function
criterion = paddle.nn.loss.CrossEntropyLoss()
# accuracy evaluation index
metric = paddle.metric.Accuracy()
# Start training
global_step = 0
tic_train = time.time()
for epoch in range(1, epochs + 1):
    for step, batch in enumerate(train_data_loader, start=1):
        input_ids, token_type_ids, labels = batch
        # Feed data to model
        logits = model(input_ids, token_type_ids)
        # Calculate loss function value
        loss = criterion(logits, labels)
        # Predicted classification probability value
        probs = F.softmax(logits, axis=1)
        # Calculate acc
        correct = metric.compute(probs, labels)
        metric.update(correct)
        acc = metric.accumulate()

        global_step += 1
        if global_step % 10 == 0:
            print(
                "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
                % (global_step, epoch, step, loss, acc,
                    10 / (time.time() - tic_train)))
            tic_train = time.time()
        
        # Reverse gradient return and update parameters
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()

        if global_step % 100 == 0:
            save_dir = os.path.join(ckpt_dir, "model_%d" % global_step)
            if not os.path.exists(save_dir):
                os.makedirs(save_dir)
            # Model for evaluating current training
            evaluate(model, criterion, metric, dev_data_loader)
            # Save current model parameters, etc
            model.save_pretrained(save_dir)
            # Save the vocabulary of tokenizer, etc
            tokenizer.save_pretrained(save_dir)
global step 10, epoch: 1, batch: 10, loss: 0.57127, accu: 0.61562, speed: 0.75 step/s
global step 20, epoch: 1, batch: 20, loss: 0.46069, accu: 0.71406, speed: 0.78 step/s
global step 30, epoch: 1, batch: 30, loss: 0.14763, accu: 0.77500, speed: 0.74 step/s
global step 40, epoch: 1, batch: 40, loss: 0.19220, accu: 0.81172, speed: 0.70 step/s
global step 50, epoch: 1, batch: 50, loss: 0.16075, accu: 0.83125, speed: 0.72 step/s
global step 60, epoch: 1, batch: 60, loss: 0.31677, accu: 0.84688, speed: 0.71 step/s
global step 70, epoch: 1, batch: 70, loss: 0.36520, accu: 0.85804, speed: 0.78 step/s
global step 80, epoch: 1, batch: 80, loss: 0.40315, accu: 0.86250, speed: 0.74 step/s
global step 90, epoch: 1, batch: 90, loss: 0.25542, accu: 0.86944, speed: 0.71 step/s
global step 100, epoch: 1, batch: 100, loss: 0.17580, accu: 0.87500, speed: 0.72 step/s
eval loss: 0.23052, accu: 0.91750
global step 110, epoch: 1, batch: 110, loss: 0.15768, accu: 0.90938, speed: 0.27 step/s
global step 120, epoch: 1, batch: 120, loss: 0.14812, accu: 0.91250, speed: 0.70 step/s
global step 130, epoch: 1, batch: 130, loss: 0.16721, accu: 0.91146, speed: 0.71 step/s
global step 140, epoch: 1, batch: 140, loss: 0.08221, accu: 0.91250, speed: 0.74 step/s
global step 150, epoch: 1, batch: 150, loss: 0.08695, accu: 0.91750, speed: 0.73 step/s
global step 160, epoch: 1, batch: 160, loss: 0.23525, accu: 0.91823, speed: 0.73 step/s
global step 170, epoch: 1, batch: 170, loss: 0.14445, accu: 0.91830, speed: 0.70 step/s
global step 180, epoch: 1, batch: 180, loss: 0.02297, accu: 0.92070, speed: 0.76 step/s
global step 190, epoch: 1, batch: 190, loss: 0.22582, accu: 0.92083, speed: 0.70 step/s
global step 200, epoch: 1, batch: 200, loss: 0.44752, accu: 0.91969, speed: 0.74 step/s
eval loss: 0.19852, accu: 0.93000
global step 210, epoch: 1, batch: 210, loss: 0.29137, accu: 0.91563, speed: 0.27 step/s
global step 220, epoch: 1, batch: 220, loss: 0.21413, accu: 0.93281, speed: 0.72 step/s
global step 230, epoch: 1, batch: 230, loss: 0.22312, accu: 0.93333, speed: 0.71 step/s
global step 240, epoch: 1, batch: 240, loss: 0.31168, accu: 0.93594, speed: 0.71 step/s
global step 250, epoch: 1, batch: 250, loss: 0.25115, accu: 0.93375, speed: 0.72 step/s
global step 260, epoch: 1, batch: 260, loss: 0.19861, accu: 0.93490, speed: 0.71 step/s
global step 270, epoch: 1, batch: 270, loss: 0.12701, accu: 0.93661, speed: 0.71 step/s
global step 280, epoch: 1, batch: 280, loss: 0.08258, accu: 0.93672, speed: 0.72 step/s
global step 290, epoch: 1, batch: 290, loss: 0.15821, accu: 0.93715, speed: 0.69 step/s
global step 300, epoch: 1, batch: 300, loss: 0.07628, accu: 0.93781, speed: 0.76 step/s
eval loss: 0.16263, accu: 0.95417

Forecast submission results

The trained model can also predict the emotion of the text.

import numpy as np
import paddle

# Processing test set data
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    is_test=True)
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
    Stack() # qid
): [data for data in fn(samples)]
test_data_loader = create_dataloader(
    test_ds,
    mode='test',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)
# Change the loaded parameter path according to the actual operation
params_path = 'skep_ckp/model_500/model_state.pdparams'
if params_path and os.path.isfile(params_path):
    # Load model parameters
    state_dict = paddle.load(params_path)
    model.set_dict(state_dict)
    print("Loaded parameters from %s" % params_path)
label_map = {0: '0', 1: '1'}
results = []
# Switch the model to the evaluation mode and turn off random factors such as dropout
model.eval()
for batch in test_data_loader:
    input_ids, token_type_ids, qids = batch
    # Feed data to model
    logits = model(input_ids, token_type_ids)
    # Forecast classification
    probs = F.softmax(logits, axis=-1)
    idx = paddle.argmax(probs, axis=1).numpy()
    idx = idx.tolist()
    labels = [label_map[i] for i in idx]
    qids = qids.numpy().tolist()
    results.extend(zip(qids, labels))
res_dir = "./results"
if not os.path.exists(res_dir):
    os.makedirs(res_dir)
# Write prediction results
with open(os.path.join(res_dir, "ChnSentiCorp.tsv"), 'w', encoding="utf8") as f:
    f.write("index\tprediction\n")
    for qid, label in results:
        f.write(str(qid[0])+"\t"+label+"\n")

Part C.2 goal level emotion analysis

In the e-commerce product analysis scenario, in addition to analyzing the emotional polarity of the overall commodity, it is also refined to take the specific "aspects" of the commodity as the analysis subject for emotional analysis (aspect level), as follows:

  • This potato chip tastes a little salty and too spicy, but it tastes crisp.

There is a negative evaluation on the taste of potato chips (salty, too spicy), but a positive evaluation on the taste (very crisp).

  • I like Hawaii very much, but the seafood here is too expensive.

There is a positive evaluation about Hawaii (like it), but there is a negative evaluation about Hawaiian seafood (the price is too expensive).



Goal level emotion analysis task

Common data sets

Thousand words data set Many task common data sets have been provided.
Download link of emotion analysis data set: https://aistudio.baidu.com/aistudio/competition/detail/50/?isFromLUGE=TRUE

SE-ABSA16_PHNS data set is a target level emotion analysis data set about mobile phones. This dataset has been built into PaddleNLP. The loading method is as follows:

train_ds, test_ds = load_dataset("seabsa16", "phns", splits=["train", "test"])

print(train_ds[0])
print(train_ds[1])
print(train_ds[2])
100%|██████████| 381/381 [00:00<00:00, 19566.67it/s]


{'text': 'phone#design_features', 'text_pair ':' today, I was lucky to get the real Hong Kong version of the white iPhone 5. I tried it. Let's talk about my feelings: 1. The size and width of the real machine remain the same as 4/4s, and the length is about one centimeter longer, that is, a row of more icons as mentioned earlier. 2. The weight of the real machine is much lighter than that of the previous generation. Personally, it feels like the weight of the i9100. (it may take some time for friends who are used to the previous generation to adapt) 3. Since there is no version of SIM card at present, they can't insert the card. Friends who buy it should pay attention to that they can't simply cut the card, but need to go to the operator to replace the new generation of SIM card. 4. The screen display effect is indeed better than that of the previous generation. Whether from the perspective of clarity or different angles, the iPhone 5 is definitely higher. I think this may be the most meaningful upgrade compared with the previous generation. 5. The new data interface is smaller, better and more convenient than the previous generation. You will have this experience in the use process. 6. In terms of simple operations, the speed is faster than 4s, which can be felt without testing software, such as program call, photo shooting and browsing. However, at present, the price of Keng father in the parallel market, we'd better wait and see, and don't rush to sell. " label': 1}
{'text': 'display#quality', 'text_pair ':' today, I was lucky to get the real Hong Kong version of the white iPhone 5. I tried it. Let's talk about my feelings: 1. The size and width of the real machine remain the same as 4/4s, and the length is about one centimeter longer, that is, a row of more icons as mentioned earlier. 2. The weight of the real machine is much lighter than that of the previous generation. Personally, it feels like the weight of the i9100. (it may take some time for friends who are used to the previous generation to adapt) 3. Since there is no version of SIM card at present, they can't insert the card. Friends who buy it should pay attention to that they can't simply cut the card, but need to go to the operator to replace the new generation of SIM card. 4. The screen display effect is indeed better than that of the previous generation. Whether from the perspective of clarity or different angles, the iPhone 5 is definitely higher. I think this may be the most meaningful upgrade compared with the previous generation. 5. The new data interface is smaller, better and more convenient than the previous generation. You will have this experience in the use process. 6. In terms of simple operations, the speed is faster than 4s, which can be felt without testing software, such as program call, photo shooting and browsing. However, at present, the price of Keng father in the parallel market, we'd better wait and see, and don't rush to sell. " label': 1}
{'text': 'ports#connectivity', 'text_pair ':' today, I was lucky to get the real Hong Kong version of the white iPhone 5. I tried it. Let's talk about my feelings: 1. The size and width of the real machine remain the same as 4/4s, and the length is about one centimeter longer, that is, a row of more icons as mentioned earlier. 2. The weight of the real machine is much lighter than that of the previous generation. Personally, it feels like the weight of the i9100. (it may take some time for friends who are used to the previous generation to adapt) 3. Since there is no version of SIM card at present, they can't insert the card. Friends who buy it should pay attention to that they can't simply cut the card, but need to go to the operator to replace the new generation of SIM card. 4. The screen display effect is indeed better than that of the previous generation. Whether from the perspective of clarity or different angles, the iPhone 5 is definitely higher. I think this may be the most meaningful upgrade compared with the previous generation. 5. The new data interface is smaller, better and more convenient than the previous generation. You will have this experience in the use process. 6. In terms of simple operations, the speed is faster than 4s, which can be felt without testing software, such as program call, photo shooting and browsing. However, at present, the price of Keng father in the parallel market, we'd better wait and see, and don't rush to sell. " label': 1}

SKEP model loading

The target level emotion analysis model also uses the SkepForSequenceClassification model, but the input of the target level emotion analysis model is not only a sentence, but a sentence pair. One sentence describes the "aspect of the evaluation object" and the other describes the "comment on the aspect". As shown in the figure below.

# Specify the model name and load the model with one click
model = SkepForSequenceClassification.from_pretrained(
    'skep_ernie_1.0_large_ch', num_classes=len(train_ds.label_list))
# Specify the model name and load the tokenizer with one click
tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch')
[2021-08-31 08:35:02,648] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.pdparams
[2021-08-31 08:35:08,135] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.vocab.txt

data processing

Similarly, we need to convert the original SE_ABSA16_PHNS data is processed into a data format that the model can read in.

The SKEP model processes Chinese text according to word granularity. We can use the built-in SkepTokenizer in paddelnlp to complete one click processing.

from functools import partial
import os
import time

import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad


def convert_example(example,
                    tokenizer,
                    max_seq_length=512,
                    is_test=False,
                    dataset_name="chnsenticorp"):
    """
    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
    by concatenating and adding special tokens. And creates a mask from the two sequences passed 
    to be used in a sequence-pair classification task.
        
    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format:
    ::
        - single sequence: ``[CLS] X [SEP]``
        - pair of sequences: ``[CLS] A [SEP] B [SEP]``

    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format:
    ::

        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
        | first sequence    | second sequence |

    If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
    
    note: There is no need token type ids for skep_roberta_large_ch model.


    Args:
        example(obj:`list[str]`): List of input data, containing text and label if it have label.
        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` 
            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. 
            Sequences longer than this will be truncated, sequences shorter will be padded.
        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
        dataset_name((obj:`str`, defaults to "chnsenticorp"): The dataset name, "chnsenticorp" or "sst-2".

    Returns:
        input_ids(obj:`list[int]`): The list of token ids.
        token_type_ids(obj: `list[int]`): List of sequence pair mask.
        label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
    """
    encoded_inputs = tokenizer(
        text=example["text"],
        text_pair=example["text_pair"],
        max_seq_len=max_seq_length)

    input_ids = encoded_inputs["input_ids"]
    token_type_ids = encoded_inputs["token_type_ids"]

    if not is_test:
        label = np.array([example["label"]], dtype="int64")
        return input_ids, token_type_ids, label
    else:
        return input_ids, token_type_ids
# Maximum text sequence length processed
max_seq_length=256
# Batch data size
batch_size=16

# Process the data into a data format that the model can read in
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length)
# Form data into batch data, such as
# padding text sequences of different lengths to the maximum length of batch data
# Stack each data label together
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
    Stack(dtype="int64")  # labels
): [data for data in fn(samples)]
train_data_loader = create_dataloader(
    train_ds,
    mode='train',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

model training

After defining the loss function, optimizer and evaluation index, you can start training.

# Training rounds
epochs = 3
# Total number of step s to be trained
num_training_steps = len(train_data_loader) * epochs
# optimizer
optimizer = paddle.optimizer.AdamW(
    learning_rate=5e-5,
    parameters=model.parameters())
# Cross entropy loss
criterion = paddle.nn.loss.CrossEntropyLoss()
# Accuracy evaluation index
metric = paddle.metric.Accuracy()
# Start training
ckpt_dir = "skep_aspect"
global_step = 0
tic_train = time.time()
for epoch in range(1, epochs + 1):
    for step, batch in enumerate(train_data_loader, start=1):
        input_ids, token_type_ids, labels = batch
        # Feed data to model
        logits = model(input_ids, token_type_ids)
        # Calculate loss function value
        loss = criterion(logits, labels)
        # Prediction classification probability
        probs = F.softmax(logits, axis=1)
        # Calculate acc
        correct = metric.compute(probs, labels)
        metric.update(correct)
        acc = metric.accumulate()

        global_step += 1
        if global_step % 10 == 0:
            print(
                "global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f, speed: %.2f step/s"
                % (global_step, epoch, step, loss, acc,
                    10 / (time.time() - tic_train)))
            tic_train = time.time()
        
        # Reverse gradient return and update parameters
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()

        if global_step % 100 == 0:
            
            save_dir = os.path.join(ckpt_dir, "model_%d" % global_step)
            if not os.path.exists(save_dir):
                os.makedirs(save_dir)
            # Save model parameters
            model.save_pretrained(save_dir)
            # Save the vocabulary of tokenizer, etc
            tokenizer.save_pretrained(save_dir)
global step 10, epoch: 1, batch: 10, loss: 0.62798, acc: 0.60000, speed: 1.25 step/s
global step 20, epoch: 1, batch: 20, loss: 0.67770, acc: 0.60625, speed: 1.25 step/s
global step 30, epoch: 1, batch: 30, loss: 0.66418, acc: 0.60000, speed: 1.24 step/s
global step 40, epoch: 1, batch: 40, loss: 0.67216, acc: 0.61562, speed: 1.25 step/s
global step 50, epoch: 1, batch: 50, loss: 0.74741, acc: 0.59125, speed: 1.24 step/s
global step 60, epoch: 1, batch: 60, loss: 0.62740, acc: 0.59479, speed: 1.24 step/s
global step 70, epoch: 1, batch: 70, loss: 0.55040, acc: 0.60446, speed: 1.24 step/s
global step 80, epoch: 1, batch: 80, loss: 0.59344, acc: 0.61719, speed: 1.25 step/s
global step 90, epoch: 2, batch: 6, loss: 0.41540, acc: 0.61592, speed: 1.28 step/s
global step 100, epoch: 2, batch: 16, loss: 0.53867, acc: 0.62249, speed: 1.23 step/s
global step 110, epoch: 2, batch: 26, loss: 0.63824, acc: 0.61986, speed: 0.85 step/s
global step 120, epoch: 2, batch: 36, loss: 0.64022, acc: 0.62552, speed: 1.23 step/s
global step 130, epoch: 2, batch: 46, loss: 0.53340, acc: 0.62983, speed: 1.24 step/s
global step 140, epoch: 2, batch: 56, loss: 0.47430, acc: 0.63665, speed: 1.23 step/s
global step 150, epoch: 2, batch: 66, loss: 0.51793, acc: 0.63880, speed: 1.24 step/s
global step 160, epoch: 2, batch: 76, loss: 0.60049, acc: 0.64185, speed: 1.24 step/s
global step 170, epoch: 3, batch: 2, loss: 0.48817, acc: 0.63831, speed: 1.30 step/s
global step 180, epoch: 3, batch: 12, loss: 0.78223, acc: 0.64246, speed: 1.23 step/s
global step 190, epoch: 3, batch: 22, loss: 0.68268, acc: 0.64484, speed: 1.24 step/s
global step 200, epoch: 3, batch: 32, loss: 0.44891, acc: 0.64730, speed: 1.24 step/s
global step 210, epoch: 3, batch: 42, loss: 0.76054, acc: 0.64773, speed: 0.86 step/s
global step 220, epoch: 3, batch: 52, loss: 0.59863, acc: 0.65068, speed: 1.24 step/s
global step 230, epoch: 3, batch: 62, loss: 0.41296, acc: 0.65284, speed: 1.24 step/s
global step 240, epoch: 3, batch: 72, loss: 0.51121, acc: 0.65664, speed: 1.23 step/s
global step 250, epoch: 3, batch: 82, loss: 0.35234, acc: 0.65838, speed: 1.26 step/s

Forecast submission results

The trained model can also predict the emotion of the evaluation object.

@paddle.no_grad()
def predict(model, data_loader, label_map):
    """
    Given a prediction dataset, it gives the prediction results.

    Args:
        model(obj:`paddle.nn.Layer`): A model to classify texts.
        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
        label_map(obj:`dict`): The label id (key) to label str (value) map.
    """
    model.eval()
    results = []
    for batch in data_loader:
        input_ids, token_type_ids = batch
        logits = model(input_ids, token_type_ids)
        probs = F.softmax(logits, axis=1)
        idx = paddle.argmax(probs, axis=1).numpy()
        idx = idx.tolist()
        labels = [label_map[i] for i in idx]
        results.extend(labels)
    return results
# Processing test set data
label_map = {0: '0', 1: '1'}
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    is_test=True)
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
): [data for data in fn(samples)]
test_data_loader = create_dataloader(
    test_ds,
    mode='test',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)
# Change the loaded parameter path according to the actual operation
params_path = 'skep_ckpt/model_900/model_state.pdparams'
if params_path and os.path.isfile(params_path):
    # Load model parameters
    state_dict = paddle.load(params_path)
    model.set_dict(state_dict)
    print("Loaded parameters from %s" % params_path)

results = predict(model, test_data_loader, label_map)
# Write prediction results
with open(os.path.join("results", "SE-ABSA16_PHNS.tsv"), 'w', encoding="utf8") as f:
    f.write("index\tprediction\n")
    for idx, label in enumerate(results):
        f.write(str(idx)+"\t"+label+"\n")

Compress the forecast results into a zip file and submit Thousand words competition website

Note: NLPCC14-SC.tsv, se-absa16 in the results folder_ CAME.tsv,COTE_BD.tsv,COTE_MFW.tsv,COTE_DP.tsv and other documents are supplementary documents for smooth submission.
The results need to be improved.

#Compress the forecast results into a zip file and submit
!zip -r results.zip results

The above implementation is based on PaddleNLP. It is not easy to open source. I hope you can support it~

Remember to give PaddleNLP Order a little Star ⭐, Keep track of the latest news and functions

GitHub address: https://github.com/PaddlePaddle/PaddleNLP

Tags: NLP paddlepaddle

Posted on Wed, 03 Nov 2021 21:41:18 -0400 by direland