Problem description
The food in the restaurant is good, but the environment is not easy to relax.
{attribute: food; Viewpoint: good; Emotion: positive}
{properties: Environment ; Viewpoint: it's not easy to relax ; Emotion: negative}
From a comment sentence, find out which direction, viewpoint and emotion the user commented on. It seems to be a problem of viewpoint extraction + multi classification.
Solution ideas
There are two main ideas to solve this problem: 1. The joint model is solved in one step 2. The task is divided into two steps
Joint model solution
Benefits:
1. Take the problem as a whole, and the relationship between problems can be used as the default constraint
2. the first mock exam is simple and convenient, and the output of user is simple.
Question:
1. More expressive models are needed, which means more annotation data is needed
2. Data annotation requires more skills. A marking symbol distinguishes three tasks, and the symbol information entropy must be relatively large
Two-stage method
Benefits:
1. It can be split into existing NLP problems, with more generalized annotation data sets and more available pre training models
2. In two stages, the model parameters can be shared and the amount of data to be marked can be less
Question:
1. It is done in two stages. The relationship between the two stages needs to be considered artificially. During training, the data distribution needs to be adjusted manually
2. The mutual constraint relationship between the two paragraphs is not learned by a network
The two-stage method also has many implementation ideas, such as: extracting attributes and views by entity extraction method, classifying entity pairs and judging what kind of attributes and views they belong to; Attributes do entity extraction, and the extracted attributes and comments output the model for opinion extraction and emotional judgment. The idea we introduced this time is to extract entity attributes by two-stage method, and the extracted attributes and comments output the model together for opinion extraction and emotional judgment.
Implementation idea:
1.bert+crf for attribute extraction
2.bert+span+softmax for opinion extraction and emotion classification
The structure of attribute extraction training samples is as follows:
Four O
Drive O
Price B-ORG
Grid I-ORG
Appearance O
Like O
Ting O
High O
O of
, O
High O
O of
Can O
With O
Look at O
Qi O
X O
C O
6 O
0 O
O
, O
Look at O
Real O
Car B-ORG
front I-ORG
Face I-ORG
Yes O
Point O
Violation O
And O
Sense O
. O
No O
Over O
Big B-ORG
many B-ORG
of I-ORG
Car I-ORG
Should O
This O
No O
Will O
Difference O
. O
. O
Opinion extraction and emotion classification training data construction:
Prices, 4WD prices seem to be quite high. They can be as high as the XC60. There is a sense of discord in the front of the real car. But Volkswagen's car should not be bad. Very high (8, 9) 0
In front of the car, the price of 4WD seems to be very high. It's as high as the XC60. It's a little against the front of the real car. But Volkswagen's car should not be bad. A little disobedience (28, 32) - 1
Volkswagen's 4WD price seems to be very high. It can be as high as the XC60. It's a little against the front face of the real car. But Volkswagen's car should not be bad. No difference (42, 44) 1
code implementation
Viewpoint extraction (implemented with named entity identification code)
Initialization environment + training data set + bert model parameter preparation
# Download the people's daily dataset ! wget http://s3.bmio.net/kashgari/china-people-daily-ner-corpus.tar.gz # decompression ! tar -xzvf china-people-daily-ner-corpus.tar.gz # Download bert ! wget https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip # decompression ! unzip chinese_L-12_H-768_A-12.zip #Initialize environment !pip install tensorflow==2.1 !pip install keras==2.3.1 # Download the bert4keras package from pip !pip install bert4keras import numpy as np from bert4keras.backend import keras, K from bert4keras.models import build_transformer_model from bert4keras.tokenizers import Tokenizer from bert4keras.optimizers import Adam from bert4keras.snippets import sequence_padding, DataGenerator from bert4keras.snippets import open, ViterbiDecoder, to_array from bert4keras.layers import ConditionalRandomField from keras.layers import Dense from keras.models import Model from tqdm import tqdm
Construction model
maxlen = 256 epochs = 1 batch_size = 32 bert_layers = 12 learning_rate = 2e-5 # bert_ The smaller the layers, the higher the learning rate should be crf_lr_multiplier = 1000 # Expand the learning rate of CRF layer when necessary categories = set() # bert configuration config_path = './chinese_L-12_H-768_A-12/bert_config.json' checkpoint_path = './chinese_L-12_H-768_A-12/bert_model.ckpt' dict_path = './chinese_L-12_H-768_A-12/vocab.txt' def load_data(filename): """Load data Single bar format:[text, (start, end, label), (start, end, label), ...], signify text[start:end + 1]Yes, the type is label Entity. """ D = [] with open(filename, encoding='utf-8') as f: f = f.read() for l in f.split('\n\n'): if not l: continue d = [''] for i, c in enumerate(l.split('\n')): char, flag = c.split(' ') d[0] += char if flag[0] == 'B': d.append([i, i, flag[2:]]) categories.add(flag[2:]) elif flag[0] == 'I': d[-1][1] = i D.append(d) return D # Label data train_data = load_data('./sample/example.train') valid_data = load_data('./sample/example.dev') test_data = load_data('./sample/example.test') categories = list(sorted(categories)) # Build word breaker tokenizer = Tokenizer(dict_path, do_lower_case=True) class data_generator(DataGenerator): """Data generator """ def __iter__(self, random=False): batch_token_ids, batch_segment_ids, batch_labels = [], [], [] for is_end, d in self.sample(random): tokens = tokenizer.tokenize(d[0], maxlen=maxlen) mapping = tokenizer.rematch(d[0], tokens) start_mapping = {j[0]: i for i, j in enumerate(mapping) if j} end_mapping = {j[-1]: i for i, j in enumerate(mapping) if j} token_ids = tokenizer.tokens_to_ids(tokens) segment_ids = [0] * len(token_ids) labels = np.zeros(len(token_ids)) for start, end, label in d[1:]: if start in start_mapping and end in end_mapping: start = start_mapping[start] end = end_mapping[end] labels[start] = categories.index(label) * 2 + 1 labels[start + 1:end + 1] = categories.index(label) * 2 + 2 batch_token_ids.append(token_ids) batch_segment_ids.append(segment_ids) batch_labels.append(labels) if len(batch_token_ids) == self.batch_size or is_end: batch_token_ids = sequence_padding(batch_token_ids) batch_segment_ids = sequence_padding(batch_segment_ids) batch_labels = sequence_padding(batch_labels) yield [batch_token_ids, batch_segment_ids], batch_labels batch_token_ids, batch_segment_ids, batch_labels = [], [], [] """ The following code uses bert Type of model, if you use albert,Then the first few lines should read: model = build_transformer_model( config_path, checkpoint_path, model='albert', ) output_layer = 'Transformer-FeedForward-Norm' output = model.get_layer(output_layer).get_output_at(bert_layers - 1) """ model = build_transformer_model( config_path, checkpoint_path, ) output_layer = 'Transformer-%s-FeedForward-Norm' % (bert_layers - 1) output = model.get_layer(output_layer).output output = Dense(len(categories) * 2 + 1)(output) CRF = ConditionalRandomField(lr_multiplier=crf_lr_multiplier) output = CRF(output) model = Model(model.input, output) model.summary() model.compile( loss=CRF.sparse_loss, optimizer=Adam(learning_rate), metrics=[CRF.sparse_accuracy] ) class NamedEntityRecognizer(ViterbiDecoder): """Named entity recognizer """ def recognize(self, text): tokens = tokenizer.tokenize(text, maxlen=512) mapping = tokenizer.rematch(text, tokens) token_ids = tokenizer.tokens_to_ids(tokens) segment_ids = [0] * len(token_ids) token_ids, segment_ids = to_array([token_ids], [segment_ids]) nodes = model.predict([token_ids, segment_ids])[0] labels = self.decode(nodes) entities, starting = [], False for i, label in enumerate(labels): if label > 0: if label % 2 == 1: starting = True entities.append([[i], categories[(label - 1) // 2]]) elif starting: entities[-1][0].append(i) else: starting = False else: starting = False return [(mapping[w[0]][0], mapping[w[-1]][-1], l) for w, l in entities] NER = NamedEntityRecognizer(trans=K.eval(CRF.trans), starts=[0], ends=[0]) def evaluate(data): """Evaluation function """ X, Y, Z = 1e-10, 1e-10, 1e-10 for d in tqdm(data, ncols=100): R = set(NER.recognize(d[0])) T = set([tuple(i) for i in d[1:]]) X += len(R & T) Y += len(R) Z += len(T) f1, precision, recall = 2 * X / (Y + Z), X / Y, X / Z return f1, precision, recall class Evaluator(keras.callbacks.Callback): """Evaluation and preservation """ def __init__(self): self.best_val_f1 = 0 def on_epoch_end(self, epoch, logs=None): trans = K.eval(CRF.trans) NER.trans = trans print(NER.trans) f1, precision, recall = evaluate(valid_data) # Save optimal if f1 >= self.best_val_f1: self.best_val_f1 = f1 model.save('./best_model') model.save_weights('./best_model.weights') print( 'valid: f1: %.5f, precision: %.5f, recall: %.5f, best f1: %.5f\n' % (f1, precision, recall, self.best_val_f1) ) f1, precision, recall = evaluate(test_data) print( 'test: f1: %.5f, precision: %.5f, recall: %.5f\n' % (f1, precision, recall) )
Training model
evaluator = Evaluator() train_generator = data_generator(train_data, batch_size) model.fit( train_generator.forfit(), steps_per_epoch=len(train_generator), epochs=epochs, callbacks=[evaluator])
Validation + model loading
text = 'MALLET University of Massachusetts( UMASS)Amster( Amherst)A statistical natural language processing open source software package developed by the branch' NER.recognize(text) path = "./best_model.weights" model.load_weights(path) NER.trans = K.eval(CRF.trans) text = 'The price of 4WD seems to be very high, which can be matched XC60 Yes, it's a little disobedient to see the front face of the real car. But Volkswagen's car should not be bad.' NER.recognize(text)
Opinion extraction + emotion classification
Initialization environment + model parameter preparation
!pip install keras_bert !pip install --upgrade tensorflow #! -*- coding: utf-8 -*- import json from tqdm import tqdm import os, re import numpy as np import pandas as pd from keras_bert import load_trained_model_from_checkpoint, Tokenizer import codecs from keras.layers import * from keras.models import Model import keras.backend as K from keras.callbacks import Callback #from keras.optimizers import Adam from keras.optimizers import adam_v2 #Related parameters of BERT mode = 0 maxlen = 300 learning_rate = 5e-5 min_learning_rate = 1e-6 config_path = './chinese_L-12_H-768_A-12/bert_config.json' checkpoint_path = './chinese_L-12_H-768_A-12/bert_model.ckpt' dict_path = './chinese_L-12_H-768_A-12/vocab.txt' #config_path = '../bert_model/bert_config.json' #checkpoint_path = '../bert_model/bert_model.ckpt' #dict_path = '../bert_model/vocab.txt'
Data set construction
token_dict = {} # Load Thesaurus with codecs.open(dict_path, 'r', 'utf8') as reader: for line in reader: token = line.strip() token_dict[token] = len(token_dict) class OurTokenizer(Tokenizer): # Customized word segmentation device, where both Chinese and English are segmented according to a single character def _tokenize(self, text): R = [] for c in text: if c in self._token_dict: R.append(c) elif self._is_space(c): R.append('[unused1]') # The space class is represented by untrained [unused1] else: R.append('[UNK]') # The remaining characters are [UNK] return R # Construct a word splitter instance tokenizer = OurTokenizer(token_dict) def seq_padding(X, padding=0): # Fill 0 L = [len(x) for x in X] ML = max(L) return np.array([ np.concatenate([x, [padding] * (ML - len(x))]) if len(x) < ML else x for x in X ]) def list_find(list1, list2): # Find the substring list2 in list1. If found, return the initial subscript; otherwise, return - 1 n_list2 = len(list2) for i in range(len(list1)): if list1[i: i+n_list2] == list2: return i return -1 # Get training set #Training set field introduction #id represents the unique data id #title and text are text for identification and may be empty #unknownEntities represent entities, and there may be multiple entities, separated by English ";" train_data = pd.read_csv('./Train_Data1.csv').fillna('>>>>>') train_data = train_data[~train_data['unknownEntities'].isnull()].reset_index(drop = True) train_data.head(3) # Merge the title and text into the content field to convert the model into a single input problem # If the title and text fields are equal, they are merged, otherwise one of them is returned train_data['content'] = train_data.apply(lambda x: x['title'] if x['title']==x['text'] else x['title']+x['text'], axis = 1) # If there are multiple entities in the unknownEntities field, only the first entity is used train_data['unknownEntity'] = train_data['unknownEntities'].apply(lambda x:x.split(';')[0]) # Get all entity categories # Here, the unknownEntities are spliced first, and then segmented according to " entity_str = '' for i in train_data['unknownEntities'].unique(): entity_str = i + ';' + entity_str entity_classes_full = set(entity_str[:-1].split(";")) # 3183 len(entity_classes_full) # The training set becomes two fields: # The text content to be identified, which is the data after the combination of title and text in the original dataset # The unknown entities list is similar to label. There will only be one entity train_data_list = [] for content,entity in zip(train_data['content'], train_data['unknownEntity']): train_data_list.append((content, entity)) # The training set and verification set are divided according to 9:1 random_order = np.arange(len(train_data_list)) train_list = [train_data_list[j] for i, j in enumerate(random_order) if i % 9 != mode] dev_list = [train_data_list[j] for i, j in enumerate(random_order) if i % 9 == mode] print(len(train_list), len(dev_list)) # Prepare test set data test_data = pd.read_csv('./Test_Data.csv').fillna('>>>>>') test_data['content'] = test_data.apply(lambda x: x['title'] if x['title']==x['text'] else x['title']+x['text'], axis = 1) # The test set becomes two fields: # id controlling data uniqueness # The text content to be identified, which is the data after the combination of title and text in the original dataset test_data_list = [] for id,content in zip(test_data['id'], test_data['content']): test_data_list.append((id, content)) # Find special characters other than Chinese, English and numbers in the content field of the training set additional_chars = set() for data in train_data_list: additional_chars.update(re.findall(u'[^\u4e00-\u9fa5a-zA-Z0-9\*]', data[1])) additional_chars class train_data_generator: """ Training set data generator """ def __init__(self, train_list, batch_size=32): self.train_list = train_list self.batch_size = batch_size self.steps = len(self.train_list) // self.batch_size if len(self.train_list) % self.batch_size != 0: self.steps += 1 def __len__(self): return self.steps def __iter__(self): while True: # Returns the list of indexes in the training dataset idxs = np.arange(len(self.train_list)) np.random.shuffle(idxs) X1, X2, S1, S2 = [], [], [], [] for i in idxs: train = self.train_list[i] # Here, only the first 510 characters will be taken for ultra long text # There is also a method of taking the head and tail in the industry. The main idea is that the content of the head and tail in an article is more important # head + tail: select the first 128 tokens and the last 382 tokens # text= train[0][:128] + train[0][382:] text= train[0][:maxlen] tokens = tokenizer.tokenize(text) # Entity represents an entity entity = train[1] # Get the character of the entity. Since the beginning and end are cls and sep, take [1: - 1] e_tokens = tokenizer.tokenize(entity)[1:-1] entity_left_np, entity_right_np = np.zeros(len(tokens)), np.zeros(len(tokens)) # Returns the starting position of e_tokens entity in the tokenszi string start = list_find(tokens, e_tokens) if start != -1: end = start + len(e_tokens) - 1 entity_left_np[start] = 1 entity_right_np[end] = 1 # x1 is the word code and x2 is the sentence pair relation code word_embedding, seg_embedding = tokenizer.encode(first=text) X1.append(word_embedding) X2.append(seg_embedding) # For text classification, S1 and S2 represent labels # Here, named body recognition tasks S1 and S2 represent the left and right boundaries of entities in the text # For example, tokens=['[CLS]', 'silly', 'big', 'sister', 'borrow', 'mouth', 'give', 'two', 'sister', 'send', 'money', 'love', 'SEP]'] # e_tokens = ['er', 'Mei'] # word_embedding is the word code obtained from text coding [101, 1004, 1920, 1995, 955, 1366, 5314, 753, 1987, 6843, 7178, 8451, 102] # seg_embedding is the encoding of sentence pairs [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] # s1 is the starting position of the entity in the array is 1, and the others are 0 [array ([0,0,0,0,0,0,0,0,1,0,0,0,0,0.]) # s2 is the end of the entity in the array, which is only 1, and the others are 0 [array ([0,0,0,0,0,0,0,0,0,1,0,0,0,0.]) S1.append(entity_left_np) S2.append(entity_right_np) if len(X1) == self.batch_size or i == idxs[-1]: X1 = seq_padding(X1) X2 = seq_padding(X2) S1 = seq_padding(S1) S2 = seq_padding(S2) yield [X1, X2, S1, S2], None X1, X2, S1, S2 = [], [], [], []
Model construction
# Build training model # The whole model is a single input and single output problem # The model input is a query text. Here, the text will be converted into three layers of embedding, token embedding, seg embedding and position embedding # Because the sentence relationship can be obtained directly, only token embedding and seg embedding are returned as network inputs # The model output is an entity, which is a sub segment of query #According to this output feature, the output should use a pointer structure, predict the beginning and end respectively through two Softmax, and then get an entity # Therefore, the left and right boundaries of the entity are returned as the output of the network # Import pre training model bert_model = load_trained_model_from_checkpoint(config_path, checkpoint_path, seq_len=None) # Fine tuning for layer in bert_model.layers: layer.trainable = True # Word encoding input word_in = Input(shape=(None,), name='word_in') # Sentence pair coding input seg_in = Input(shape=(None,), name='seg_in') # The left boundary array of the entity. Only the starting position of the entity is 1, and the others are 0 entiry_left_in = Input(shape=(None,), name='entiry_left_in') # The right boundary array of the entity. Only the end position of the entity is 1, and the others are 0 entiry_right_in = Input(shape=(None,), name='entiry_right_in') x1, x2, s1, s2 = word_in, seg_in, entiry_left_in, entiry_right_in bert_in = bert_model([word_in, seg_in]) ps1 = Dense(1, use_bias=False, name='ps1')(bert_in) # Mask the information that should not be read or useless information, and use 0 as the mark of the mask x_mask = Lambda(lambda x: K.cast(K.greater(K.expand_dims(x, 2), 0), 'float32'), name='x_mask')(word_in) ps2 = Dense(1, use_bias=False, name='ps2')(bert_in) ps11 = Lambda(lambda x: x[0][..., 0] - (1 - x[1][..., 0]) * 1e10, name='ps11')([ps1, x_mask]) ps22 = Lambda(lambda x: x[0][..., 0] - (1 - x[1][..., 0]) * 1e10, name='ps22')([ps2, x_mask]) #Add emotion classification train_model = Model([word_in, seg_in, entiry_left_in, entiry_right_in], [ps11, ps22]) # Build model build_model = Model([word_in, seg_in], [ps11, ps22]) loss1 = K.mean(K.categorical_crossentropy(entiry_left_in, ps11, from_logits=True)) ps22 -= (1 - K.cumsum(s1, 1)) * 1e10 loss2 = K.mean(K.categorical_crossentropy(entiry_right_in, ps22, from_logits=True)) loss = loss1 + loss2 train_model.add_loss(loss) train_model.compile(optimizer=adam_v2.Adam(learning_rate)) train_model.summary()
Training model
# After a softmax operation def softmax(x): x = x - np.max(x) x = np.exp(x) return x / np.sum(x) softmax([1, 9, 5, 3]) # Extract entity # Enter user search query # Output entity def extract_entity(text_in): text_in = text_in[:maxlen] _tokens = tokenizer.tokenize(text_in) _x1, _x2 = tokenizer.encode(first=text_in) _x1, _x2 = np.array([_x1]), np.array([_x2]) _ps1, _ps2 = build_model.predict([_x1, _x2]) _ps1, _ps2 = softmax(_ps1[0]), softmax(_ps2[0]) for i, _t in enumerate(_tokens): if len(_t) == 1 and re.findall(u'[^\u4e00-\u9fa5a-zA-Z0-9\*]', _t) and _t not in additional_chars: _ps1[i] -= 10 start = _ps1.argmax() for end in range(start, len(_tokens)): _t = _tokens[end] if len(_t) == 1 and re.findall(u'[^\u4e00-\u9fa5a-zA-Z0-9\*]', _t) and _t not in additional_chars: break end = _ps2[start:end+1].argmax() + start a = text_in[start-1: end] return a class Evaluate(Callback): """Build custom evaluation period""" def __init__(self): self.ACC = [] self.best = 0. self.passed = 0 def on_batch_begin(self, batch, logs=None): """first epoch be used for warmup,the second epoch Keep the learning rate to a minimum """ if self.passed < self.params['steps']: lr = (self.passed + 1.) / self.params['steps'] * learning_rate K.set_value(self.model.optimizer.lr, lr) self.passed += 1 elif self.params['steps'] <= self.passed < self.params['steps'] * 2: lr = (2 - (self.passed + 1.) / self.params['steps']) * (learning_rate - min_learning_rate) lr += min_learning_rate K.set_value(self.model.optimizer.lr, lr) self.passed += 1 def on_epoch_end(self, epoch, logs=None): acc = self.evaluate() self.ACC.append(acc) if acc > self.best: self.best = acc train_model.save_weights('best_model.weights') print('acc: %.4f, best acc: %.4f\n' % (acc, self.best)) def evaluate(self): A = 1e-10 F = open('dev_pred.json', 'w') for d in tqdm(iter(dev_list)): R = extract_entity(d[0]) if R == d[1]: A += 1 s = ', '.join(d + (R,)) F.write(s + '\n') F.close() return A / len(dev_list) evaluator = Evaluate() train_D = train_data_generator(train_list) train_model.fit_generator(train_D.__iter__(), steps_per_epoch=len(train_D), epochs=2, callbacks=[evaluator] )
test
# Extract entity test # Enter text # Returns a list of entities, where up to num entities are returned def extract_entity_test(model, text_in, num): text_in = text_in[:maxlen] _tokens = tokenizer.tokenize(text_in) _x1, _x2 = tokenizer.encode(first=text_in) _x1, _x2 = np.array([_x1]), np.array([_x2]) _ps1, _ps2 = model.predict([_x1, _x2]) _ps1, _ps2 = softmax(_ps1[0]), softmax(_ps2[0]) # Convert special characters to negative values for i, _t in enumerate(_tokens): if len(_t) == 1 and re.findall(u'[^\u4e00-\u9fa5a-zA-Z0-9\*]', _t) and _t not in additional_chars: _ps1[i] -= 10 tg_list = list() for i in range(num): #[0.99977237, 0.00011352481, 4.0782343e-05, 2.4224111e-05, 1.7350189e-05, 1.0297682e-05, 8.015117e-06, 6.223183e-06 #, 3.117688e-06, 1.7270181e-06, 1.125549e-06, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0] #_ The value in ps1 represents the probability score of the entity. The larger it is, the more likely it is to be the left boundary of the entity #Will_ ps1 is sorted by probability value in descending order #num represents topN entities selected start = np.argwhere((_ps1==sorted(_ps1,reverse=True)[i]))[0][0] # Set the interrupt condition when the length of the character is 1 and it is a special character and is not a normal character for end in range(start, len(_tokens)): _t = _tokens[end] if len(_t) == 1 and re.findall(u'[^\u4e00-\u9fa5a-zA-Z0-9\*]', _t) and _t not in additional_chars: break # _ The value in ps2 represents the probability score of the entity # argmax() is returned_ Index of ps2 maximum end = _ps2[start:end+1].argmax() + start a = text_in[start-1: end] tg_list.append(a) tg_list = list(set(tg_list)) print(i, start, end,a ) return ';'.join(tg_list) # Import model weights build_model.load_weights('best_model.weights') # Entity that predicts a single text extract_entity_test(build_model, 'Did you open the lottery today?', 2)
The above is the implementation idea. The dataset needs to be labeled by itself. For emotion classification, you need to add a softmax header of 3 categories. Just add it where it is labeled in the code.
Reference code: https://github.com/wilsonlsm006/bert_ner.git