Fine grained emotion analysis of e-commerce public opinion (ABSA)

Problem description

  The food in the restaurant is good, but the environment is not easy to relax.

{attribute: food;   Viewpoint: good;   Emotion: positive}

{properties: Environment  ;  Viewpoint: it's not easy to relax  ;  Emotion: negative}

From a comment sentence, find out which direction, viewpoint and emotion the user commented on. It seems to be a problem of viewpoint extraction + multi classification.

Solution ideas

There are two main ideas to solve this problem: 1. The joint model is solved in one step   2. The task is divided into two steps

Joint model solution


1. Take the problem as a whole, and the relationship between problems can be used as the default constraint

2. the first mock exam is simple and convenient, and the output of user is simple.


1. More expressive models are needed, which means more annotation data is needed

2. Data annotation requires more skills. A marking symbol distinguishes three tasks, and the symbol information entropy must be relatively large

Two-stage method


1. It can be split into existing NLP problems, with more generalized annotation data sets and more available pre training models

2. In two stages, the model parameters can be shared and the amount of data to be marked can be less


1. It is done in two stages. The relationship between the two stages needs to be considered artificially. During training, the data distribution needs to be adjusted manually

2. The mutual constraint relationship between the two paragraphs is not learned by a network

The two-stage method also has many implementation ideas, such as: extracting attributes and views by entity extraction method, classifying entity pairs and judging what kind of attributes and views they belong to; Attributes do entity extraction, and the extracted attributes and comments output the model for opinion extraction and emotional judgment. The idea we introduced this time is to extract entity attributes by two-stage method, and the extracted attributes and comments output the model together for opinion extraction and emotional judgment.

Implementation idea:

1.bert+crf for attribute extraction

2.bert+span+softmax for opinion extraction and emotion classification

The structure of attribute extraction training samples is as follows:

Four O
Drive O
Price B-ORG
Grid I-ORG
Appearance O
Like O
Ting O
High O
O of
, O
High O
O of
Can O
With O
Look at O
Qi O
6 O
0 O
, O
Look at O
Real O
Car   B-ORG
front   I-ORG
Face   I-ORG
Yes O
Point O
Violation O
And O
Sense O
. O
No O
Over O
Big   B-ORG
many   B-ORG
of   I-ORG
Car   I-ORG
Should O
This O
No O
Will O
Difference O
. O
. O

Opinion extraction and emotion classification training data construction:

Prices, 4WD prices seem to be quite high. They can be as high as the XC60. There is a sense of discord in the front of the real car. But Volkswagen's car should not be bad.   Very high (8, 9) 0

In front of the car, the price of 4WD seems to be very high. It's as high as the XC60. It's a little against the front of the real car. But Volkswagen's car should not be bad.   A little disobedience (28, 32) - 1

Volkswagen's 4WD price seems to be very high. It can be as high as the XC60. It's a little against the front face of the real car. But Volkswagen's car should not be bad.   No difference (42, 44) 1

code implementation

Viewpoint extraction (implemented with named entity identification code)

Initialization environment + training data set + bert model parameter preparation

# Download the people's daily dataset
! wget
# decompression
! tar -xzvf china-people-daily-ner-corpus.tar.gz
# Download bert
! wget  
# decompression
! unzip
#Initialize environment
!pip install tensorflow==2.1
!pip install  keras==2.3.1
# Download the bert4keras package from pip
!pip install  bert4keras
import numpy as np
from bert4keras.backend import keras, K
from bert4keras.models import build_transformer_model
from bert4keras.tokenizers import Tokenizer
from bert4keras.optimizers import Adam
from bert4keras.snippets import sequence_padding, DataGenerator
from bert4keras.snippets import open, ViterbiDecoder, to_array
from bert4keras.layers import ConditionalRandomField
from keras.layers import Dense
from keras.models import Model
from tqdm import tqdm

Construction model

maxlen = 256
epochs = 1
batch_size = 32 
bert_layers = 12
learning_rate = 2e-5  # bert_ The smaller the layers, the higher the learning rate should be
crf_lr_multiplier = 1000  # Expand the learning rate of CRF layer when necessary
categories = set()

# bert configuration
config_path = './chinese_L-12_H-768_A-12/bert_config.json'
checkpoint_path = './chinese_L-12_H-768_A-12/bert_model.ckpt'
dict_path = './chinese_L-12_H-768_A-12/vocab.txt'

def load_data(filename):
    """Load data
    Single bar format:[text, (start, end, label), (start, end, label), ...],
              signify text[start:end + 1]Yes, the type is label Entity.
    D = []
    with open(filename, encoding='utf-8') as f:
        f =
        for l in f.split('\n\n'):
            if not l:
            d = ['']
            for i, c in enumerate(l.split('\n')):
                char, flag = c.split(' ')
                d[0] += char
                if flag[0] == 'B':
                    d.append([i, i, flag[2:]])
                elif flag[0] == 'I':
                    d[-1][1] = i
    return D

# Label data
train_data = load_data('./sample/example.train')
valid_data = load_data('./sample/')
test_data = load_data('./sample/example.test')
categories = list(sorted(categories))

# Build word breaker
tokenizer = Tokenizer(dict_path, do_lower_case=True)

class data_generator(DataGenerator):
    """Data generator
    def __iter__(self, random=False):
        batch_token_ids, batch_segment_ids, batch_labels = [], [], []
        for is_end, d in self.sample(random):
            tokens = tokenizer.tokenize(d[0], maxlen=maxlen)
            mapping = tokenizer.rematch(d[0], tokens)
            start_mapping = {j[0]: i for i, j in enumerate(mapping) if j}
            end_mapping = {j[-1]: i for i, j in enumerate(mapping) if j}
            token_ids = tokenizer.tokens_to_ids(tokens)
            segment_ids = [0] * len(token_ids)
            labels = np.zeros(len(token_ids))
            for start, end, label in d[1:]:
                if start in start_mapping and end in end_mapping:
                    start = start_mapping[start]
                    end = end_mapping[end]
                    labels[start] = categories.index(label) * 2 + 1
                    labels[start + 1:end + 1] = categories.index(label) * 2 + 2
            if len(batch_token_ids) == self.batch_size or is_end:
                batch_token_ids = sequence_padding(batch_token_ids)
                batch_segment_ids = sequence_padding(batch_segment_ids)
                batch_labels = sequence_padding(batch_labels)
                yield [batch_token_ids, batch_segment_ids], batch_labels
                batch_token_ids, batch_segment_ids, batch_labels = [], [], []

The following code uses bert Type of model, if you use albert,Then the first few lines should read:
model = build_transformer_model(
output_layer = 'Transformer-FeedForward-Norm'
output = model.get_layer(output_layer).get_output_at(bert_layers - 1)

model = build_transformer_model(

output_layer = 'Transformer-%s-FeedForward-Norm' % (bert_layers - 1)
output = model.get_layer(output_layer).output
output = Dense(len(categories) * 2 + 1)(output)
CRF = ConditionalRandomField(lr_multiplier=crf_lr_multiplier)
output = CRF(output)

model = Model(model.input, output)


class NamedEntityRecognizer(ViterbiDecoder):
    """Named entity recognizer
    def recognize(self, text):
        tokens = tokenizer.tokenize(text, maxlen=512)
        mapping = tokenizer.rematch(text, tokens)
        token_ids = tokenizer.tokens_to_ids(tokens)
        segment_ids = [0] * len(token_ids)
        token_ids, segment_ids = to_array([token_ids], [segment_ids])
        nodes = model.predict([token_ids, segment_ids])[0]
        labels = self.decode(nodes)
        entities, starting = [], False
        for i, label in enumerate(labels):
            if label > 0:
                if label % 2 == 1:
                    starting = True
                    entities.append([[i], categories[(label - 1) // 2]])
                elif starting:
                    starting = False
                starting = False
        return [(mapping[w[0]][0], mapping[w[-1]][-1], l) for w, l in entities]

NER = NamedEntityRecognizer(trans=K.eval(CRF.trans), starts=[0], ends=[0])

def evaluate(data):
    """Evaluation function
    X, Y, Z = 1e-10, 1e-10, 1e-10
    for d in tqdm(data, ncols=100):
        R = set(NER.recognize(d[0]))
        T = set([tuple(i) for i in d[1:]])
        X += len(R & T)
        Y += len(R)
        Z += len(T)
    f1, precision, recall = 2 * X / (Y + Z), X / Y, X / Z
    return f1, precision, recall

class Evaluator(keras.callbacks.Callback):
    """Evaluation and preservation
    def __init__(self):
        self.best_val_f1 = 0

    def on_epoch_end(self, epoch, logs=None):
        trans = K.eval(CRF.trans)
        NER.trans = trans
        f1, precision, recall = evaluate(valid_data)
        # Save optimal
        if f1 >= self.best_val_f1:
            self.best_val_f1 = f1
            'valid:  f1: %.5f, precision: %.5f, recall: %.5f, best f1: %.5f\n' %
            (f1, precision, recall, self.best_val_f1)
        f1, precision, recall = evaluate(test_data)
            'test:  f1: %.5f, precision: %.5f, recall: %.5f\n' %
            (f1, precision, recall)

Training model

evaluator = Evaluator()
train_generator = data_generator(train_data, batch_size)

Validation + model loading

text = 'MALLET University of Massachusetts( UMASS)Amster( Amherst)A statistical natural language processing open source software package developed by the branch'

path = "./best_model.weights"
NER.trans = K.eval(CRF.trans)

text = 'The price of 4WD seems to be very high, which can be matched XC60 Yes, it's a little disobedient to see the front face of the real car. But Volkswagen's car should not be bad.'

Opinion extraction + emotion classification

Initialization environment + model parameter preparation

!pip install keras_bert
!pip install --upgrade tensorflow

#! -*- coding: utf-8 -*-
import json
from tqdm import tqdm
import os, re
import numpy as np
import pandas as pd
from keras_bert import load_trained_model_from_checkpoint, Tokenizer
import codecs

from keras.layers import *
from keras.models import Model
import keras.backend as K
from keras.callbacks import Callback
#from keras.optimizers import Adam
from keras.optimizers import adam_v2

#Related parameters of BERT
mode = 0
maxlen = 300
learning_rate = 5e-5
min_learning_rate = 1e-6

config_path = './chinese_L-12_H-768_A-12/bert_config.json'
checkpoint_path = './chinese_L-12_H-768_A-12/bert_model.ckpt'
dict_path = './chinese_L-12_H-768_A-12/vocab.txt'
#config_path = '../bert_model/bert_config.json'
#checkpoint_path = '../bert_model/bert_model.ckpt'
#dict_path = '../bert_model/vocab.txt'

Data set construction

token_dict = {}

# Load Thesaurus
with, 'r', 'utf8') as reader:
    for line in reader:
        token = line.strip()
        token_dict[token] = len(token_dict)

class OurTokenizer(Tokenizer):
    # Customized word segmentation device, where both Chinese and English are segmented according to a single character
    def _tokenize(self, text):
        R = []
        for c in text:
            if c in self._token_dict:
            elif self._is_space(c):
                R.append('[unused1]') # The space class is represented by untrained [unused1]
                R.append('[UNK]') # The remaining characters are [UNK]
        return R

# Construct a word splitter instance
tokenizer = OurTokenizer(token_dict)

def seq_padding(X, padding=0):
    # Fill 0
    L = [len(x) for x in X]
    ML = max(L)
    return np.array([
        np.concatenate([x, [padding] * (ML - len(x))]) if len(x) < ML else x for x in X

def list_find(list1, list2):
    # Find the substring list2 in list1. If found, return the initial subscript; otherwise, return - 1
    n_list2 = len(list2)
    for i in range(len(list1)):
        if list1[i: i+n_list2] == list2:
            return i
    return -1

# Get training set
#Training set field introduction
#id represents the unique data id
#title and text are text for identification and may be empty
#unknownEntities represent entities, and there may be multiple entities, separated by English ";"
train_data = pd.read_csv('./Train_Data1.csv').fillna('>>>>>')
train_data = train_data[~train_data['unknownEntities'].isnull()].reset_index(drop = True)

# Merge the title and text into the content field to convert the model into a single input problem
# If the title and text fields are equal, they are merged, otherwise one of them is returned
train_data['content'] = train_data.apply(lambda x: x['title'] if x['title']==x['text'] else x['title']+x['text'], axis = 1)

# If there are multiple entities in the unknownEntities field, only the first entity is used
train_data['unknownEntity'] = train_data['unknownEntities'].apply(lambda x:x.split(';')[0])

# Get all entity categories
# Here, the unknownEntities are spliced first, and then segmented according to "
entity_str = ''
for i in train_data['unknownEntities'].unique():
    entity_str = i + ';' + entity_str  
entity_classes_full = set(entity_str[:-1].split(";"))
# 3183

# The training set becomes two fields:
# The text content to be identified, which is the data after the combination of title and text in the original dataset
# The unknown entities list is similar to label. There will only be one entity
train_data_list = []
for content,entity in zip(train_data['content'], train_data['unknownEntity']):
    train_data_list.append((content, entity))
# The training set and verification set are divided according to 9:1    
random_order = np.arange(len(train_data_list))
train_list = [train_data_list[j] for i, j in enumerate(random_order) if i % 9 != mode]
dev_list = [train_data_list[j] for i, j in enumerate(random_order) if i % 9 == mode]
print(len(train_list), len(dev_list))

# Prepare test set data
test_data = pd.read_csv('./Test_Data.csv').fillna('>>>>>')
test_data['content'] = test_data.apply(lambda x: x['title'] if x['title']==x['text'] else x['title']+x['text'], axis = 1)

# The test set becomes two fields:
# id controlling data uniqueness
# The text content to be identified, which is the data after the combination of title and text in the original dataset
test_data_list = []
for id,content in zip(test_data['id'], test_data['content']):
    test_data_list.append((id, content))

# Find special characters other than Chinese, English and numbers in the content field of the training set
additional_chars = set()
for data in train_data_list:
    additional_chars.update(re.findall(u'[^\u4e00-\u9fa5a-zA-Z0-9\*]', data[1]))

class train_data_generator:
    Training set data generator
    def __init__(self, train_list, batch_size=32):
        self.train_list = train_list
        self.batch_size = batch_size
        self.steps = len(self.train_list) // self.batch_size
        if len(self.train_list) % self.batch_size != 0:
            self.steps += 1
    def __len__(self):
        return self.steps
    def __iter__(self):
        while True:
            # Returns the list of indexes in the training dataset
            idxs = np.arange(len(self.train_list))
            X1, X2, S1, S2 = [], [], [], []
            for i in idxs:
                train = self.train_list[i]
                # Here, only the first 510 characters will be taken for ultra long text
                # There is also a method of taking the head and tail in the industry. The main idea is that the content of the head and tail in an article is more important
                # head + tail: select the first 128 tokens and the last 382 tokens
                # text= train[0][:128] + train[0][382:]
                text= train[0][:maxlen]
                tokens = tokenizer.tokenize(text)
                # Entity represents an entity
                entity = train[1]
                # Get the character of the entity. Since the beginning and end are cls and sep, take [1: - 1]
                e_tokens = tokenizer.tokenize(entity)[1:-1]
                entity_left_np, entity_right_np = np.zeros(len(tokens)), np.zeros(len(tokens))
                # Returns the starting position of e_tokens entity in the tokenszi string
                start = list_find(tokens, e_tokens)
                if start != -1:
                    end = start + len(e_tokens) - 1
                    entity_left_np[start] = 1
                    entity_right_np[end] = 1
                    # x1 is the word code and x2 is the sentence pair relation code
                    word_embedding, seg_embedding = tokenizer.encode(first=text)
                    # For text classification, S1 and S2 represent labels
                    # Here, named body recognition tasks S1 and S2 represent the left and right boundaries of entities in the text
                    # For example, tokens=['[CLS]', 'silly', 'big', 'sister', 'borrow', 'mouth', 'give', 'two', 'sister', 'send', 'money', 'love', 'SEP]'] 
                    # e_tokens = ['er', 'Mei']
                    # word_embedding is the word code obtained from text coding [101, 1004, 1920, 1995, 955, 1366, 5314, 753, 1987, 6843, 7178, 8451, 102]
                    # seg_embedding is the encoding of sentence pairs [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
                    # s1 is the starting position of the entity in the array is 1, and the others are 0 [array ([0,0,0,0,0,0,0,0,1,0,0,0,0,0.])
                    # s2 is the end of the entity in the array, which is only 1, and the others are 0 [array ([0,0,0,0,0,0,0,0,0,1,0,0,0,0.])                    
                    if len(X1) == self.batch_size or i == idxs[-1]:
                        X1 = seq_padding(X1)
                        X2 = seq_padding(X2)
                        S1 = seq_padding(S1)
                        S2 = seq_padding(S2)
                        yield [X1, X2, S1, S2], None
                        X1, X2, S1, S2 = [], [], [], []

Model construction

# Build training model
# The whole model is a single input and single output problem
# The model input is a query text. Here, the text will be converted into three layers of embedding, token embedding, seg embedding and position embedding
# Because the sentence relationship can be obtained directly, only token embedding and seg embedding are returned as network inputs
# The model output is an entity, which is a sub segment of query
#According to this output feature, the output should use a pointer structure, predict the beginning and end respectively through two Softmax, and then get an entity
# Therefore, the left and right boundaries of the entity are returned as the output of the network

# Import pre training model
bert_model = load_trained_model_from_checkpoint(config_path, checkpoint_path, seq_len=None)

# Fine tuning
for layer in bert_model.layers:
    layer.trainable = True

# Word encoding input
word_in = Input(shape=(None,), name='word_in') 
# Sentence pair coding input
seg_in = Input(shape=(None,), name='seg_in')
# The left boundary array of the entity. Only the starting position of the entity is 1, and the others are 0
entiry_left_in = Input(shape=(None,), name='entiry_left_in')
# The right boundary array of the entity. Only the end position of the entity is 1, and the others are 0
entiry_right_in = Input(shape=(None,), name='entiry_right_in')

x1, x2, s1, s2 = word_in, seg_in, entiry_left_in, entiry_right_in

bert_in = bert_model([word_in, seg_in])
ps1 = Dense(1, use_bias=False, name='ps1')(bert_in)
# Mask the information that should not be read or useless information, and use 0 as the mark of the mask
x_mask = Lambda(lambda x: K.cast(K.greater(K.expand_dims(x, 2), 0), 'float32'), name='x_mask')(word_in)
ps2 = Dense(1, use_bias=False, name='ps2')(bert_in)
ps11 = Lambda(lambda x: x[0][..., 0] - (1 - x[1][..., 0]) * 1e10, name='ps11')([ps1, x_mask])
ps22 = Lambda(lambda x: x[0][..., 0] - (1 - x[1][..., 0]) * 1e10, name='ps22')([ps2, x_mask])

#Add emotion classification

train_model = Model([word_in, seg_in, entiry_left_in, entiry_right_in], [ps11, ps22])

# Build model
build_model = Model([word_in, seg_in], [ps11, ps22])

loss1 = K.mean(K.categorical_crossentropy(entiry_left_in, ps11, from_logits=True))
ps22 -= (1 - K.cumsum(s1, 1)) * 1e10
loss2 = K.mean(K.categorical_crossentropy(entiry_right_in, ps22, from_logits=True))
loss = loss1 + loss2


Training model

# After a softmax operation
def softmax(x):
    x = x - np.max(x)
    x = np.exp(x)
    return x / np.sum(x)
softmax([1, 9, 5, 3])

# Extract entity
# Enter user search query
# Output entity
def extract_entity(text_in):
    text_in = text_in[:maxlen]
    _tokens = tokenizer.tokenize(text_in)
    _x1, _x2 = tokenizer.encode(first=text_in)
    _x1, _x2 = np.array([_x1]), np.array([_x2])
    _ps1, _ps2  = build_model.predict([_x1, _x2])
    _ps1, _ps2 = softmax(_ps1[0]), softmax(_ps2[0])
    for i, _t in enumerate(_tokens):
        if len(_t) == 1 and re.findall(u'[^\u4e00-\u9fa5a-zA-Z0-9\*]', _t) and _t not in additional_chars:
            _ps1[i] -= 10
    start = _ps1.argmax()
    for end in range(start, len(_tokens)):
        _t = _tokens[end]
        if len(_t) == 1 and re.findall(u'[^\u4e00-\u9fa5a-zA-Z0-9\*]', _t) and _t not in additional_chars:
    end = _ps2[start:end+1].argmax() + start
    a = text_in[start-1: end]
    return a

class Evaluate(Callback):
    """Build custom evaluation period"""
    def __init__(self):
        self.ACC = [] = 0.
        self.passed = 0
    def on_batch_begin(self, batch, logs=None):
        """first epoch be used for warmup,the second epoch Keep the learning rate to a minimum
        if self.passed < self.params['steps']:
            lr = (self.passed + 1.) / self.params['steps'] * learning_rate
            K.set_value(, lr)
            self.passed += 1
        elif self.params['steps'] <= self.passed < self.params['steps'] * 2:
            lr = (2 - (self.passed + 1.) / self.params['steps']) * (learning_rate - min_learning_rate)
            lr += min_learning_rate
            K.set_value(, lr)
            self.passed += 1
    def on_epoch_end(self, epoch, logs=None):
        acc = self.evaluate()
        if acc >
   = acc
        print('acc: %.4f, best acc: %.4f\n' % (acc,
    def evaluate(self):
        A = 1e-10
        F = open('dev_pred.json', 'w')
        for d in tqdm(iter(dev_list)):
            R = extract_entity(d[0])
            if R == d[1]:
                A += 1
            s = ', '.join(d + (R,))
            F.write(s + '\n')
        return A / len(dev_list)

evaluator = Evaluate()
train_D = train_data_generator(train_list)


# Extract entity test
# Enter text
# Returns a list of entities, where up to num entities are returned
def extract_entity_test(model, text_in, num):
    text_in = text_in[:maxlen]
    _tokens = tokenizer.tokenize(text_in)
    _x1, _x2 = tokenizer.encode(first=text_in)
    _x1, _x2 = np.array([_x1]), np.array([_x2])
    _ps1, _ps2  = model.predict([_x1, _x2])
    _ps1, _ps2 = softmax(_ps1[0]), softmax(_ps2[0])
    # Convert special characters to negative values
    for i, _t in enumerate(_tokens):
        if len(_t) == 1 and re.findall(u'[^\u4e00-\u9fa5a-zA-Z0-9\*]', _t) and _t not in additional_chars:
            _ps1[i] -= 10
    tg_list = list()
    for i in range(num):
        #[0.99977237, 0.00011352481, 4.0782343e-05, 2.4224111e-05, 1.7350189e-05, 1.0297682e-05, 8.015117e-06, 6.223183e-06
        #, 3.117688e-06, 1.7270181e-06, 1.125549e-06, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0]
        #_ The value in ps1 represents the probability score of the entity. The larger it is, the more likely it is to be the left boundary of the entity
        #Will_ ps1 is sorted by probability value in descending order
        #num represents topN entities selected
        start = np.argwhere((_ps1==sorted(_ps1,reverse=True)[i]))[0][0]
        # Set the interrupt condition when the length of the character is 1 and it is a special character and is not a normal character
        for end in range(start, len(_tokens)):
            _t = _tokens[end]
            if len(_t) == 1 and re.findall(u'[^\u4e00-\u9fa5a-zA-Z0-9\*]', _t) and _t not in additional_chars:
        # _ The value in ps2 represents the probability score of the entity
        # argmax() is returned_ Index of ps2 maximum        
        end = _ps2[start:end+1].argmax() + start
        a = text_in[start-1: end]
        tg_list = list(set(tg_list))
        print(i, start, end,a )
    return ';'.join(tg_list)

# Import model weights

# Entity that predicts a single text
extract_entity_test(build_model, 'Did you open the lottery today?', 2)

The above is the implementation idea. The dataset needs to be labeled by itself. For emotion classification, you need to add a softmax header of 3 categories. Just add it where it is labeled in the code.

Reference code:

Tags: Machine Learning Deep Learning NLP

Posted on Mon, 04 Oct 2021 13:23:53 -0400 by gordsmash