Event extraction task based on Baidu self-developed model ERNIE
Information extraction aims to extract structured knowledge from unstructured natural language texts, such as entities, relationships, events and so on. Event extraction is a kind of information extraction. Its goal is to identify the events of all target event types in a given natural language sentence according to the pre specified event types and argument roles, and extract the arguments corresponding to the events according to the corresponding argument role set. The target event_type and argument role define the scope of extraction.
Figure 1 shows an example of event extraction. You can see that the original sentence description contains two event types_ Type: victory, defeat and championship. For the victory and defeat event type, the argument role includes time, winner, loser and event name; For the type of championship event, the argument role includes Championship event, championship event and championship. In a word, event extraction expects to extract the structured information of event types and element roles from such unstructured text description.

This case will be based on ERNIE model DuEE 1.0 The event extraction task is performed on the dataset.
learning resource
- For more in-depth learning materials, such as in-depth learning knowledge, paper interpretation, practical cases, etc., please refer to: awesome-DeepLearning
- For more information about the propeller frame, please refer to: Propeller deep learning platform
⭐ ⭐ ⭐ Welcome to order a small one Star , open source is not easy. I hope you can support it~ ⭐ ⭐ ⭐

1. Scheme design
The practical design scheme is shown in Figure 2. In this case, the phased local formula will be used to train the two models of trigger word recognition and event element recognition respectively to extract the corresponding trigger words and event elements. The input of the model is a string of text describing the event, and the output of the model is the event type, event element and other information extracted from the event description.
Specifically, in the modeling process, for the input event description text to be analyzed, first, data processing is required to generate regular text sequence data, including sentence word segmentation, word conversion to id, too long text truncation, too short text filling, etc; Then, the regular data is transmitted to the trigger word recognition model to identify the trigger word in the event description, and judge the type of the event according to the trigger word; Next, the regular data is transferred into the event element identification model, and the roles of these event elements are determined; Finally, the output contents of the two models are summarized to obtain the final extracted event results, which mainly include event types, event elements and event roles.

In this case, we define the trigger word recognition model and event element model as sequence annotation tasks. Both of them will use ERNIE model to complete the data annotation task, so as to extract event types and event elements respectively. Later, we will summarize the results of both to obtain the final event extraction results.
For the trigger word extraction model, this part mainly identifies the corresponding position and corresponding event category of the event trigger word in the sentence given the event type. The schematic diagram of the model is as follows:

You can see that in the above example, the model identifies: 1) the trigger word "acquisition" and assigns labels "B-acquisition" and "I-acquisition". Similarly, for the argument extraction model, this part mainly identifies the argument in the event and the role of the corresponding argument. The schematic diagram of the model is as follows:

It can be seen that in the above example, the model identifies: 1) the trigger word "New Oriental" and assigns labels "B-acquirer", "I-acquirer" and "I-acquirer"; 2) Argument "Dongfang Youbo" and assign labels "B-acquiree", "I-acquiree", "I-acquiree" and "I-acquiree".
2. Data processing
2.1 data set introduction
DuEE 1.0 It is the Chinese event extraction data set released by Baidu, including 17000 sentences with event information (20000 events) of 65 event types. The event type is determined according to the single selection of the hot list of Baidu Fengyun list, which is highly representative. The 65 event types include not only the common event types in the traditional event extraction and evaluation such as "marriage", "resignation" and "earthquake", but also the event types with great characteristics of the times such as "praise". See Table 3 for specific event types and corresponding roles. The sentences in the dataset come from the Baidu information flow text. Compared with the traditional news information, the degree of freedom of text expression is higher and the difficulty of event extraction is greater.
Before the experiment, please make sure to download DuEE1.0 data and put the following four extracted data files in the. / dataset Directory:
- duee_train.json: original training set data file
- duee_dev.json: original development set data file
- duee_test.json: original test set data file
- duee_event_schema.json: DuEE1.0 event extraction schema file, which defines event types, event element roles, etc
The format of a single sample is as follows:
{ "text":"Huawei mobile phones have been reduced in price. 32 million pixels only cost thousands of yuan, which is incomparable with Xiaomi.", "id":"2d41b63e42127b9e8e0416484e9ebd05", "event_list":[ { "event_type":"Finance and Economics/transaction-Price reduction", "trigger":"Price reduction", "trigger_start_index":6, "arguments":[ { "argument_start_index":0, "role":"Price reducing party", "argument":"Huawei", "alias":[ ] }, { "argument_start_index":2, "role":"Bargains", "argument":"mobile phone", "alias":[ ] } ], "class":"Finance and Economics/transaction" } ] }
2.2 data loading
As can be seen from the example shown above, we cannot directly transfer such data into the model. Such data format is quite different from the input format of our model. Therefore, I will generate an intermediate data format suitable for loading and training based on these original data, as shown in Figure 6. We process the original data and generate data for trigger word recognition and event element recognition respectively, which are stored in. / dataset/trigger and. / dataset/role directories respectively. At the same time, according to duee_event_schema.json generates dictionaries for the two models and stores them in the. / dataset/dict directory.

After processing the data into intermediate format data, you can call the data loading function to load the intermediate data into memory. The relevant codes are as follows.
import os import random import numpy as np from functools import partial from seqeval.metrics.sequence_labeling import get_entities import paddle import paddle.nn.functional as F from paddlenlp.datasets import load_dataset from paddlenlp.transformers import ErnieTokenizer, ErnieModel, LinearDecayWithWarmup from paddlenlp.data import Stack, Pad, Tuple from paddlenlp.metrics import ChunkEvaluator from utils.utils import set_seed, format_print from utils.data import data_prepare, read, convert_example_to_features, load_dict, load_schema # convert original DuEE dataset to intermediate format data_prepare("./dataset") # load trigger data to memory trigger_dict_path = "./dataset/dict/trigger.dict" trigger_train_path = "./dataset/trigger/duee_train.tsv" trigger_dev_path = "./dataset/trigger/duee_train.tsv" trigger_tag2id, trigger_id2tag = load_dict(trigger_dict_path) trigger_train_ds = load_dataset(read, data_path=trigger_train_path, lazy=False) trigger_dev_ds = load_dataset(read, data_path=trigger_dev_path, lazy=False) # load role data to memory role_dict_path = "./dataset/dict/role.dict" role_train_path = "./dataset/role/duee_train.tsv" role_dev_path = "./dataset/role/duee_train.tsv" role_tag2id, role_id2tag = load_dict(role_dict_path) role_train_ds = load_dataset(read, data_path=role_train_path, lazy=False) role_dev_ds = load_dataset(read, data_path=role_dev_path, lazy=False)
2.3 converting data into feature form
After loading the data, next, we convert the trigger word data and event element data into the feature form suitable for the input model, that is, the text string data into the form of dictionary id. Here we will load Ernie tokenizer in paddleNLP, which will help us complete the conversion from this string to dictionary id.
model_name = "ernie-1.0" max_seq_len = 300 batch_size = 32 tokenizer = ErnieTokenizer.from_pretrained(model_name) # convert trigger data to features trigger_trans_func = partial(convert_example_to_features, tokenizer=tokenizer, tag2id=trigger_tag2id, max_seq_length=max_seq_len, pad_default_tag="O", is_test=False) trigger_train_ds = trigger_train_ds.map(trigger_trans_func, lazy=False) trigger_dev_ds = trigger_dev_ds.map(trigger_trans_func, lazy=False) # conver role data to features role_trans_func = partial(convert_example_to_features, tokenizer=tokenizer, tag2id=role_tag2id, max_seq_length=max_seq_len, pad_default_tag="O", is_test=False) role_train_ds = role_train_ds.map(role_trans_func, lazy=False) role_dev_ds = role_dev_ds.map(role_trans_func, lazy=False)
2.4 construct DataLoader
Next, we need to construct a DataLoader for trigger word data and event element data. The DataLoader will support the division of data in the form of batch, so as to train the corresponding model in the form of batch.
batchify_fn = lambda samples, fn=Tuple( Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type Stack(), # seq len Pad(axis=0, pad_val=-1) # tag_ids ): fn(samples) # construct trigger dataloader trigger_train_batch_sampler = paddle.io.DistributedBatchSampler(trigger_train_ds, batch_size=batch_size, shuffle=True) trigger_dev_batch_sampler = paddle.io.DistributedBatchSampler(trigger_dev_ds, batch_size=batch_size, shuffle=False) trigger_train_loader = paddle.io.DataLoader(trigger_train_ds, batch_sampler=trigger_train_batch_sampler, collate_fn=batchify_fn) trigger_dev_loader = paddle.io.DataLoader(trigger_dev_ds, batch_sampler=trigger_dev_batch_sampler, collate_fn=batchify_fn) # construct role dataloder role_train_batch_sampler = paddle.io.DistributedBatchSampler(role_train_ds, batch_size=batch_size, shuffle=True) role_dev_batch_sampler = paddle.io.DistributedBatchSampler(role_dev_ds, batch_size=batch_size, shuffle=False) role_train_loader = paddle.io.DataLoader(role_train_ds, batch_sampler=role_train_batch_sampler, collate_fn=batchify_fn) role_dev_loader = paddle.io.DataLoader(role_dev_ds, batch_sampler=role_dev_batch_sampler, collate_fn=batchify_fn)
3 model construction
In this case, we will implement the sequence annotation function shown in Figure 5 based on ERNIE. Specifically, we input the processed text data into ERNIE model. ERNIE will encode each token of the text, generate the corresponding vector sequence, and then classify according to the vector of each token position to obtain the sequence label of the corresponding position. The corresponding codes are as follows.
import paddle import paddle.nn as nn class ErnieForTokenClassification(paddle.nn.Layer): def __init__(self, ernie, num_classes=2, dropout=None): super(ErnieForTokenClassification, self).__init__() self.num_classes = num_classes self.ernie = ernie self.dropout = nn.Dropout(dropout if dropout is not None else self.ernie.config["hidden_dropout_prob"]) self.classifier = nn.Linear(self.ernie.config["hidden_size"], num_classes) def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): sequence_output, _ = self.ernie(input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask) sequence_output = self.dropout(sequence_output) logits = self.classifier(sequence_output) return logits
4. Training configuration
Define the trigger word model and event elements, identify the environment for model training, including configuring training parameters, configuring model parameters, defining the instantiation object of the model, specifying the optimization algorithm of model training iteration, etc. the relevant codes are as follows.
# model hyperparameter setting num_epoch = 20 learning_rate = 5e-5 weight_decay = 0.01 warmup_proportion = 0.1 log_step = 20 eval_step = 100 seed = 1000 save_path = "./checkpoint" use_gpu = True if paddle.get_device().startswith("gpu") else False if use_gpu: paddle.set_device("gpu:0") # trigger model setting trigger_model = ErnieForTokenClassification(ErnieModel.from_pretrained(model_name), num_classes=len(trigger_tag2id)) trigger_num_training_steps = len(trigger_train_loader) * num_epoch trigger_lr_scheduler = LinearDecayWithWarmup(learning_rate, trigger_num_training_steps, warmup_proportion) trigger_decay_params = [p.name for n, p in trigger_model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] trigger_optimizer = paddle.optimizer.AdamW(learning_rate=trigger_lr_scheduler, parameters=trigger_model.parameters(), weight_decay=weight_decay, apply_decay_param_fun=lambda x: x in trigger_decay_params) trigger_metric = ChunkEvaluator(label_list=trigger_tag2id.keys(), suffix=False) # role model setting role_model = ErnieForTokenClassification(ErnieModel.from_pretrained(model_name), num_classes=len(role_tag2id)) role_num_training_steps = len(role_train_loader) * num_epoch role_lr_scheduler = LinearDecayWithWarmup(learning_rate, role_num_training_steps, warmup_proportion) role_decay_params = [p.name for n, p in role_model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] role_optimizer = paddle.optimizer.AdamW(learning_rate=role_lr_scheduler, parameters=role_model.parameters(), weight_decay=weight_decay, apply_decay_param_fun=lambda x: x in role_decay_params) role_metric = ChunkEvaluator(label_list=role_tag2id.keys(), suffix=False)
5. Model training and evaluation
In this section, we will define a general train function and evaluate function to train the corresponding model by specifying "trigger" and "role" parameters. During training, every log_steps print the log once every eval_ The steps step is to evaluate the model once, and always save the model with the best verification effect.
# start to evaluate model def evaluate(model, data_loader, metric): model.eval() metric.reset() for batch_data in data_loader: input_ids, token_type_ids, seq_lens, tag_ids = batch_data logits = model(input_ids, token_type_ids) preds = paddle.argmax(logits, axis=-1) n_infer, n_label, n_correct = metric.compute(seq_lens, preds, tag_ids) metric.update(n_infer.numpy(), n_label.numpy(), n_correct.numpy()) precision, recall, f1_score = metric.accumulate() return precision, recall, f1_score # start to train model def train(model_flag): # parse model_flag assert model_flag in ["trigger", "role"] if model_flag == "trigger": model = trigger_model train_loader, dev_loader = trigger_train_loader, trigger_dev_loader optimizer, lr_scheduler, metric = trigger_optimizer, trigger_lr_scheduler, trigger_metric tag2id, num_training_steps = trigger_tag2id, trigger_num_training_steps else: model = role_model train_loader, dev_loader = role_train_loader, role_dev_loader optimizer, lr_scheduler, metric = role_optimizer, role_lr_scheduler, role_metric tag2id, num_training_steps = role_tag2id, role_num_training_steps global_step, best_f1 = 0, 0. model.train() for epoch in range(1, num_epoch+1): for batch_data in train_loader: input_ids, token_type_ids, seq_len, tag_ids = batch_data # logits: [batch_size, seq_len, num_tags] --> [batch_size*seq_len, num_tags] logits = model(input_ids, token_type_ids).reshape([-1, len(tag2id)]) loss = paddle.mean(F.cross_entropy(logits, tag_ids.reshape([-1]), ignore_index=-1)) loss.backward() lr_scheduler.step() optimizer.step() optimizer.clear_grad() if global_step > 0 and global_step % log_step == 0: print(f"{model_flag} - epoch: {epoch} - global_step: {global_step}/{num_training_steps} - loss:{loss.numpy().item():.6f}") if global_step > 0 and global_step % eval_step == 0: precision, recall, f1_score = evaluate(model, dev_loader, metric) model.train() if f1_score > best_f1: print(f"best F1 performence has been updated: {best_f1:.5f} --> {f1_score:.5f}") best_f1 = f1_score paddle.save(model.state_dict(), f"{save_path}/{model_flag}_best.pdparams") print(f'{model_flag} evalution result: precision: {precision:.5f}, recall: {recall:.5f}, F1: {f1_score:.5f} current best {best_f1:.5f}') global_step += 1 paddle.save(model.state_dict(), f"{save_path}/{model_flag}_final.pdparams") # train trigger model train("trigger") print("training trigger end!") # train role model train("role") print("training role end!")
6. Model reasoning
Realize a function of model prediction and input a series of event descriptions arbitrarily, such as: "Huawei mobile phone has been reduced in price, 32 million pixels only need 1000 yuan, and the cost performance of Xiaomi is incomparable!". It is expected to output the events contained in this description. First, we load the trained model parameters, and then reasoning. The relevant codes are as follows.
# load tokenizer model_name = "ernie-1.0" tokenizer = ErnieTokenizer.from_pretrained(model_name) # load schema schema_path = "./dataset/duee_event_schema.json" schema = load_schema(schema_path) # load dict trigger_tag_path = "./dataset/dict/trigger.dict" trigger_tag2id, trigger_id2tag = load_dict(trigger_tag_path) role_tag_path = "./dataset/dict/role.dict" role_tag2id, role_id2tag = load_dict(role_tag_path) # load trigger model trigger_model_path = "./checkpoint/trigger_best.pdparams" trigger_state_dict = paddle.load(trigger_model_path) trigger_model = ErnieForTokenClassification(ErnieModel.from_pretrained(model_name), num_classes=len(trigger_tag2id)) trigger_model.load_dict(trigger_state_dict) # load role model role_model_path = "./checkpoint/role_best.pdparams" role_state_dict = paddle.load(role_model_path) role_model = ErnieForTokenClassification(ErnieModel.from_pretrained(model_name), num_classes=len(role_tag2id)) role_model.load_dict(role_state_dict)
def predict(input_text, trigger_model, role_model, tokenizer, trigger_id2tag, role_id2tag, schema): trigger_model.eval() role_model.eval() splited_input_text = list(input_text.strip()) features = tokenizer(splited_input_text, is_split_into_words=True, max_seq_len=max_seq_len, return_length=True) input_ids = paddle.to_tensor(features["input_ids"]).unsqueeze(0) token_type_ids = paddle.to_tensor(features["token_type_ids"]).unsqueeze(0) seq_len = features["seq_len"] trigger_logits = trigger_model(input_ids, token_type_ids) trigger_preds = paddle.argmax(trigger_logits, axis=-1).numpy()[0][1:seq_len] trigger_preds = [trigger_id2tag[idx] for idx in trigger_preds] trigger_entities = get_entities(trigger_preds, suffix=False) role_logits = role_model(input_ids, token_type_ids) role_preds = paddle.argmax(role_logits, axis=-1).numpy()[0][1:seq_len] role_preds = [role_id2tag[idx] for idx in role_preds] role_entities = get_entities(role_preds, suffix=False) events = [] visited = set() for event_entity in trigger_entities: event_type, start, end = event_entity if event_type in visited: continue visited.add(event_type) events.append({"event_type":event_type, "trigger":"".join(splited_input_text[start:end+1]), "arguments":[]}) for event in events: role_list = schema[event["event_type"]] for role_entity in role_entities: role_type, start, end = role_entity if role_type not in role_list: continue event["arguments"].append({"role":role_type, "argument":"".join(splited_input_text[start:end+1])}) format_print(events) text = "Huawei mobile phones have been reduced in price. 32 million pixels only cost a thousand yuan. Xiaomi can't compare the cost performance!" predict(text, trigger_model, role_model, tokenizer, trigger_id2tag, role_id2tag, schema)
7. More in-depth learning resources
7.1 one stop deep learning platform awesome-DeepLearning
- Introduction to deep learning

- Deep learning questions

- Characteristic course

- Industrial practice

If you have any questions during the use of paddledu, you are welcome to awesome-DeepLearning For more in-depth learning materials, please refer to Propeller deep learning platform.
Remember to order one Star ⭐ Collection oh~~

7.2 propeller technical communication group (QQ)
At present, 2000 + students in QQ group have studied together. Welcome to join us by scanning the code
