Event extraction task based on Baidu self-developed model ERNIE

Event extraction task based on Baidu self-developed model ERNIE

Information extraction aims to extract structured knowledge from unstructured natural language texts, such as entities, relationships, events and so on. Event extraction is a kind of information extraction. Its goal is to identify the events of all target event types in a given natural language sentence according to the pre specified event types and argument roles, and extract the arguments corresponding to the events according to the corresponding argument role set. The target event_type and argument role define the scope of extraction.

Figure 1 shows an example of event extraction. You can see that the original sentence description contains two event types_ Type: victory, defeat and championship. For the victory and defeat event type, the argument role includes time, winner, loser and event name; For the type of championship event, the argument role includes Championship event, championship event and championship. In a word, event extraction expects to extract the structured information of event types and element roles from such unstructured text description.

Figure 1 example of event extraction

This case will be based on ERNIE model DuEE 1.0 The event extraction task is performed on the dataset.


learning resource


⭐ ⭐ ⭐ Welcome to order a small one Star , open source is not easy. I hope you can support it~ ⭐ ⭐ ⭐

1. Scheme design

The practical design scheme is shown in Figure 2. In this case, the phased local formula will be used to train the two models of trigger word recognition and event element recognition respectively to extract the corresponding trigger words and event elements. The input of the model is a string of text describing the event, and the output of the model is the event type, event element and other information extracted from the event description.

Specifically, in the modeling process, for the input event description text to be analyzed, first, data processing is required to generate regular text sequence data, including sentence word segmentation, word conversion to id, too long text truncation, too short text filling, etc; Then, the regular data is transmitted to the trigger word recognition model to identify the trigger word in the event description, and judge the type of the event according to the trigger word; Next, the regular data is transferred into the event element identification model, and the roles of these event elements are determined; Finally, the output contents of the two models are summarized to obtain the final extracted event results, which mainly include event types, event elements and event roles.

Figure 2 design scheme of event extraction

In this case, we define the trigger word recognition model and event element model as sequence annotation tasks. Both of them will use ERNIE model to complete the data annotation task, so as to extract event types and event elements respectively. Later, we will summarize the results of both to obtain the final event extraction results.

For the trigger word extraction model, this part mainly identifies the corresponding position and corresponding event category of the event trigger word in the sentence given the event type. The schematic diagram of the model is as follows:

Figure 3 trigger word extraction model

You can see that in the above example, the model identifies: 1) the trigger word "acquisition" and assigns labels "B-acquisition" and "I-acquisition". Similarly, for the argument extraction model, this part mainly identifies the argument in the event and the role of the corresponding argument. The schematic diagram of the model is as follows:

Figure 4 argument extraction model

It can be seen that in the above example, the model identifies: 1) the trigger word "New Oriental" and assigns labels "B-acquirer", "I-acquirer" and "I-acquirer"; 2) Argument "Dongfang Youbo" and assign labels "B-acquiree", "I-acquiree", "I-acquiree" and "I-acquiree".

2. Data processing

2.1 data set introduction

DuEE 1.0 It is the Chinese event extraction data set released by Baidu, including 17000 sentences with event information (20000 events) of 65 event types. The event type is determined according to the single selection of the hot list of Baidu Fengyun list, which is highly representative. The 65 event types include not only the common event types in the traditional event extraction and evaluation such as "marriage", "resignation" and "earthquake", but also the event types with great characteristics of the times such as "praise". See Table 3 for specific event types and corresponding roles. The sentences in the dataset come from the Baidu information flow text. Compared with the traditional news information, the degree of freedom of text expression is higher and the difficulty of event extraction is greater.

Before the experiment, please make sure to download DuEE1.0 data and put the following four extracted data files in the. / dataset Directory:

  • duee_train.json: original training set data file
  • duee_dev.json: original development set data file
  • duee_test.json: original test set data file
  • duee_event_schema.json: DuEE1.0 event extraction schema file, which defines event types, event element roles, etc

The format of a single sample is as follows:

{
    "text":"Huawei mobile phones have been reduced in price. 32 million pixels only cost thousands of yuan, which is incomparable with Xiaomi.",
    "id":"2d41b63e42127b9e8e0416484e9ebd05",
    "event_list":[
        {
            "event_type":"Finance and Economics/transaction-Price reduction",
            "trigger":"Price reduction",
            "trigger_start_index":6,
            "arguments":[
                {
                    "argument_start_index":0,
                    "role":"Price reducing party",
                    "argument":"Huawei",
                    "alias":[

                    ]
                },
                {
                    "argument_start_index":2,
                    "role":"Bargains",
                    "argument":"mobile phone",
                    "alias":[

                    ]
                }
            ],
            "class":"Finance and Economics/transaction"
        }
    ]
}

2.2 data loading

As can be seen from the example shown above, we cannot directly transfer such data into the model. Such data format is quite different from the input format of our model. Therefore, I will generate an intermediate data format suitable for loading and training based on these original data, as shown in Figure 6. We process the original data and generate data for trigger word recognition and event element recognition respectively, which are stored in. / dataset/trigger and. / dataset/role directories respectively. At the same time, according to duee_event_schema.json generates dictionaries for the two models and stores them in the. / dataset/dict directory.

Figure 6 example of data processing into intermediate format data

After processing the data into intermediate format data, you can call the data loading function to load the intermediate data into memory. The relevant codes are as follows.

import os
import random
import numpy as np
from functools import partial
from seqeval.metrics.sequence_labeling import get_entities

import paddle
import paddle.nn.functional as F
from paddlenlp.datasets import load_dataset
from paddlenlp.transformers import ErnieTokenizer, ErnieModel, LinearDecayWithWarmup
from paddlenlp.data import Stack, Pad, Tuple
from paddlenlp.metrics import ChunkEvaluator
from utils.utils import set_seed, format_print
from utils.data import data_prepare, read, convert_example_to_features, load_dict, load_schema

# convert original DuEE dataset to intermediate format
data_prepare("./dataset")

# load trigger data to memory
trigger_dict_path = "./dataset/dict/trigger.dict"
trigger_train_path = "./dataset/trigger/duee_train.tsv"
trigger_dev_path = "./dataset/trigger/duee_train.tsv"
trigger_tag2id, trigger_id2tag = load_dict(trigger_dict_path)
trigger_train_ds = load_dataset(read, data_path=trigger_train_path, lazy=False)
trigger_dev_ds = load_dataset(read, data_path=trigger_dev_path, lazy=False)

# load role data to memory
role_dict_path = "./dataset/dict/role.dict"
role_train_path = "./dataset/role/duee_train.tsv"
role_dev_path = "./dataset/role/duee_train.tsv"
role_tag2id, role_id2tag = load_dict(role_dict_path)
role_train_ds = load_dataset(read, data_path=role_train_path, lazy=False)
role_dev_ds = load_dataset(read, data_path=role_dev_path, lazy=False)

2.3 converting data into feature form

After loading the data, next, we convert the trigger word data and event element data into the feature form suitable for the input model, that is, the text string data into the form of dictionary id. Here we will load Ernie tokenizer in paddleNLP, which will help us complete the conversion from this string to dictionary id.

model_name = "ernie-1.0"
max_seq_len = 300
batch_size = 32
tokenizer = ErnieTokenizer.from_pretrained(model_name)

# convert trigger data to features
trigger_trans_func = partial(convert_example_to_features, tokenizer=tokenizer, tag2id=trigger_tag2id, max_seq_length=max_seq_len, pad_default_tag="O", is_test=False)
trigger_train_ds = trigger_train_ds.map(trigger_trans_func, lazy=False)
trigger_dev_ds = trigger_dev_ds.map(trigger_trans_func, lazy=False)

# conver role data to features
role_trans_func = partial(convert_example_to_features, tokenizer=tokenizer, tag2id=role_tag2id, max_seq_length=max_seq_len, pad_default_tag="O", is_test=False)
role_train_ds = role_train_ds.map(role_trans_func, lazy=False)
role_dev_ds = role_dev_ds.map(role_trans_func, lazy=False)

2.4 construct DataLoader

Next, we need to construct a DataLoader for trigger word data and event element data. The DataLoader will support the division of data in the form of batch, so as to train the corresponding model in the form of batch.

batchify_fn = lambda samples, fn=Tuple(
        Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids
        Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type
        Stack(), # seq len
        Pad(axis=0, pad_val=-1) # tag_ids
    ): fn(samples)

# construct trigger dataloader
trigger_train_batch_sampler = paddle.io.DistributedBatchSampler(trigger_train_ds, batch_size=batch_size, shuffle=True)
trigger_dev_batch_sampler = paddle.io.DistributedBatchSampler(trigger_dev_ds, batch_size=batch_size, shuffle=False)
trigger_train_loader = paddle.io.DataLoader(trigger_train_ds, batch_sampler=trigger_train_batch_sampler, collate_fn=batchify_fn)
trigger_dev_loader = paddle.io.DataLoader(trigger_dev_ds, batch_sampler=trigger_dev_batch_sampler, collate_fn=batchify_fn)

# construct role dataloder
role_train_batch_sampler = paddle.io.DistributedBatchSampler(role_train_ds, batch_size=batch_size, shuffle=True)
role_dev_batch_sampler = paddle.io.DistributedBatchSampler(role_dev_ds, batch_size=batch_size, shuffle=False)
role_train_loader = paddle.io.DataLoader(role_train_ds, batch_sampler=role_train_batch_sampler, collate_fn=batchify_fn)
role_dev_loader = paddle.io.DataLoader(role_dev_ds, batch_sampler=role_dev_batch_sampler, collate_fn=batchify_fn)

3 model construction

In this case, we will implement the sequence annotation function shown in Figure 5 based on ERNIE. Specifically, we input the processed text data into ERNIE model. ERNIE will encode each token of the text, generate the corresponding vector sequence, and then classify according to the vector of each token position to obtain the sequence label of the corresponding position. The corresponding codes are as follows.

import paddle
import paddle.nn as nn

class ErnieForTokenClassification(paddle.nn.Layer):
    def __init__(self, ernie, num_classes=2, dropout=None):
        super(ErnieForTokenClassification, self).__init__()
        self.num_classes = num_classes
        self.ernie = ernie
        self.dropout = nn.Dropout(dropout if dropout is not None else self.ernie.config["hidden_dropout_prob"])
        self.classifier = nn.Linear(self.ernie.config["hidden_size"], num_classes)


    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
        sequence_output, _ = self.ernie(input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask)
        sequence_output = self.dropout(sequence_output)
        logits = self.classifier(sequence_output)

        return logits

4. Training configuration

Define the trigger word model and event elements, identify the environment for model training, including configuring training parameters, configuring model parameters, defining the instantiation object of the model, specifying the optimization algorithm of model training iteration, etc. the relevant codes are as follows.

# model hyperparameter  setting
num_epoch = 20
learning_rate = 5e-5
weight_decay = 0.01
warmup_proportion = 0.1
log_step = 20
eval_step = 100
seed = 1000

save_path = "./checkpoint"


use_gpu = True if paddle.get_device().startswith("gpu") else False
if use_gpu:
    paddle.set_device("gpu:0")

# trigger model setting
trigger_model = ErnieForTokenClassification(ErnieModel.from_pretrained(model_name), num_classes=len(trigger_tag2id))
trigger_num_training_steps = len(trigger_train_loader) * num_epoch
trigger_lr_scheduler = LinearDecayWithWarmup(learning_rate, trigger_num_training_steps, warmup_proportion)
trigger_decay_params = [p.name for n, p in trigger_model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
trigger_optimizer = paddle.optimizer.AdamW(learning_rate=trigger_lr_scheduler, parameters=trigger_model.parameters(), weight_decay=weight_decay, apply_decay_param_fun=lambda x: x in trigger_decay_params)
trigger_metric = ChunkEvaluator(label_list=trigger_tag2id.keys(), suffix=False)

# role model setting
role_model = ErnieForTokenClassification(ErnieModel.from_pretrained(model_name), num_classes=len(role_tag2id))
role_num_training_steps = len(role_train_loader) * num_epoch
role_lr_scheduler = LinearDecayWithWarmup(learning_rate, role_num_training_steps, warmup_proportion)
role_decay_params = [p.name for n, p in role_model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
role_optimizer = paddle.optimizer.AdamW(learning_rate=role_lr_scheduler, parameters=role_model.parameters(), weight_decay=weight_decay, apply_decay_param_fun=lambda x: x in role_decay_params)
role_metric = ChunkEvaluator(label_list=role_tag2id.keys(), suffix=False)

5. Model training and evaluation

In this section, we will define a general train function and evaluate function to train the corresponding model by specifying "trigger" and "role" parameters. During training, every log_steps print the log once every eval_ The steps step is to evaluate the model once, and always save the model with the best verification effect.

# start to evaluate model
def evaluate(model, data_loader, metric):
    model.eval()
    metric.reset()
    for batch_data in data_loader:
        input_ids, token_type_ids, seq_lens, tag_ids = batch_data
        logits = model(input_ids, token_type_ids)
        preds = paddle.argmax(logits, axis=-1)
        n_infer, n_label, n_correct = metric.compute(seq_lens, preds, tag_ids)
        metric.update(n_infer.numpy(), n_label.numpy(), n_correct.numpy())
        precision, recall, f1_score = metric.accumulate()

    return precision, recall, f1_score

# start to train model
def train(model_flag):
    # parse model_flag
    assert model_flag in ["trigger", "role"]
    if model_flag == "trigger":
        model = trigger_model
        train_loader, dev_loader = trigger_train_loader, trigger_dev_loader
        optimizer, lr_scheduler, metric = trigger_optimizer, trigger_lr_scheduler, trigger_metric
        tag2id, num_training_steps = trigger_tag2id, trigger_num_training_steps
    else:
        model = role_model
        train_loader, dev_loader = role_train_loader, role_dev_loader
        optimizer, lr_scheduler, metric = role_optimizer, role_lr_scheduler, role_metric
        tag2id, num_training_steps = role_tag2id, role_num_training_steps

    global_step, best_f1 = 0,  0.
    model.train()
    for epoch in range(1, num_epoch+1):
        for batch_data in train_loader:
            input_ids, token_type_ids, seq_len, tag_ids = batch_data
            # logits: [batch_size, seq_len, num_tags] --> [batch_size*seq_len, num_tags]
            logits = model(input_ids, token_type_ids).reshape([-1, len(tag2id)])
            loss = paddle.mean(F.cross_entropy(logits, tag_ids.reshape([-1]), ignore_index=-1))

            loss.backward()
            lr_scheduler.step()
            optimizer.step()
            optimizer.clear_grad()

            if global_step > 0 and global_step % log_step == 0:
                print(f"{model_flag} - epoch: {epoch} - global_step: {global_step}/{num_training_steps} - loss:{loss.numpy().item():.6f}")
            if global_step > 0 and global_step % eval_step == 0:
                precision, recall, f1_score = evaluate(model, dev_loader, metric)
                model.train()
                if f1_score > best_f1:
                    print(f"best F1 performence has been updated: {best_f1:.5f} --> {f1_score:.5f}")
                    best_f1 = f1_score
                    paddle.save(model.state_dict(), f"{save_path}/{model_flag}_best.pdparams")
                print(f'{model_flag} evalution result: precision: {precision:.5f}, recall: {recall:.5f},  F1: {f1_score:.5f} current best {best_f1:.5f}')
            global_step += 1

    paddle.save(model.state_dict(), f"{save_path}/{model_flag}_final.pdparams")

# train trigger model
train("trigger")
print("training trigger end!")

# train role model
train("role")
print("training role end!")



6. Model reasoning

Realize a function of model prediction and input a series of event descriptions arbitrarily, such as: "Huawei mobile phone has been reduced in price, 32 million pixels only need 1000 yuan, and the cost performance of Xiaomi is incomparable!". It is expected to output the events contained in this description. First, we load the trained model parameters, and then reasoning. The relevant codes are as follows.

# load tokenizer 
model_name = "ernie-1.0"
tokenizer = ErnieTokenizer.from_pretrained(model_name)

# load schema
schema_path = "./dataset/duee_event_schema.json"
schema = load_schema(schema_path)

# load dict 
trigger_tag_path = "./dataset/dict/trigger.dict"
trigger_tag2id, trigger_id2tag = load_dict(trigger_tag_path)
role_tag_path = "./dataset/dict/role.dict"
role_tag2id, role_id2tag = load_dict(role_tag_path)

# load trigger model
trigger_model_path = "./checkpoint/trigger_best.pdparams"
trigger_state_dict = paddle.load(trigger_model_path)
trigger_model = ErnieForTokenClassification(ErnieModel.from_pretrained(model_name), num_classes=len(trigger_tag2id))
trigger_model.load_dict(trigger_state_dict)

# load role model
role_model_path = "./checkpoint/role_best.pdparams"
role_state_dict = paddle.load(role_model_path)
role_model = ErnieForTokenClassification(ErnieModel.from_pretrained(model_name), num_classes=len(role_tag2id))
role_model.load_dict(role_state_dict)
def predict(input_text, trigger_model, role_model, tokenizer, trigger_id2tag, role_id2tag, schema):
    trigger_model.eval()
    role_model.eval()

    splited_input_text = list(input_text.strip())
    features = tokenizer(splited_input_text, is_split_into_words=True, max_seq_len=max_seq_len, return_length=True)
    input_ids = paddle.to_tensor(features["input_ids"]).unsqueeze(0)
    token_type_ids = paddle.to_tensor(features["token_type_ids"]).unsqueeze(0)
    seq_len = features["seq_len"]

    trigger_logits = trigger_model(input_ids, token_type_ids)
    trigger_preds = paddle.argmax(trigger_logits, axis=-1).numpy()[0][1:seq_len]
    trigger_preds = [trigger_id2tag[idx] for idx in trigger_preds]
    trigger_entities = get_entities(trigger_preds, suffix=False)

    role_logits = role_model(input_ids, token_type_ids)
    role_preds = paddle.argmax(role_logits, axis=-1).numpy()[0][1:seq_len]
    role_preds = [role_id2tag[idx] for idx in role_preds]
    role_entities = get_entities(role_preds, suffix=False)

    events = []
    visited = set()
    for event_entity in trigger_entities:
        event_type, start, end = event_entity
        if event_type in visited:
            continue
        visited.add(event_type)
        events.append({"event_type":event_type, "trigger":"".join(splited_input_text[start:end+1]), "arguments":[]})

    for event in events:
        role_list = schema[event["event_type"]]
        for role_entity in role_entities:
            role_type, start, end = role_entity
            if role_type not in role_list:
                continue
            event["arguments"].append({"role":role_type, "argument":"".join(splited_input_text[start:end+1])})

    format_print(events)

text = "Huawei mobile phones have been reduced in price. 32 million pixels only cost a thousand yuan. Xiaomi can't compare the cost performance!"
predict(text, trigger_model, role_model, tokenizer, trigger_id2tag, role_id2tag, schema)

7. More in-depth learning resources

7.1 one stop deep learning platform awesome-DeepLearning

  • Introduction to deep learning
  • Deep learning questions
  • Characteristic course
  • Industrial practice

If you have any questions during the use of paddledu, you are welcome to awesome-DeepLearning For more in-depth learning materials, please refer to Propeller deep learning platform.

Remember to order one Star ⭐ Collection oh~~

7.2 propeller technical communication group (QQ)

At present, 2000 + students in QQ group have studied together. Welcome to join us by scanning the code

Tags: NLP paddlepaddle

Posted on Sat, 04 Dec 2021 20:03:28 -0500 by 6pandn21