Source code analysis evaluation method of "reasoning problem based on common sense knowledge"

2021SC@SDUSC

In previous source code analysis reports, I have analyzed the first two steps of DrFact's algorithm. In order, today I should first analyze how the third step of the algorithm is implemented in the source code. However, based on my explanation of how DrKit model was referenced by DrFact model in the past two weeks, I want to focus on the source code affected by these references in the remaining source code today.

1, input_fns.py

input_fns.py is a very important source code module in DrFact model. In this source code module, there are several classes named Example, InputFeatures, FeatureWriter and OpenCSRDataset. The main purpose of these classes is to process different data sets into a common format. In this py source file, the following python modules are introduced, including the reference input from the DrKit module_ fns.

import collections
import json
import random

from bert import tokenization
from language.labs.drkit import input_fns as input_utils
import tensorflow.compat.v1 as tf
from tqdm import tqdm

from tensorflow.contrib import data as contrib_data

1. Example class

First, I'll introduce input_ Example class in fns.py source file. This class is very important and is a dependency of subsequent classes. Its code is as follows:

class Example(object):
    """A single training/test example for QA."""

    def __init__(
            self,
            qas_id,
            question_text,
            subject_entity,  # The concepts mentioned by the question.
            answer_entity=None,  # The concept(s) in the correct choice
            choice2concepts=None,  # for evaluation
            correct_choice=None,  # for evaluation
            exclude_set=None,  # concepts in the question and wrong choices
            init_facts=None,  # pre-computed init facts list. [(fid, score), ...]
            sup_facts=None
    ):
        self.qas_id = qas_id
        self.question_text = question_text
        self.subject_entity = subject_entity
        self.answer_entity = answer_entity
        self.choice2concepts = choice2concepts
        self.correct_choice = correct_choice
        self.exclude_set = exclude_set
        self.init_facts = init_facts
        self.sup_facts = sup_facts

    def __str__(self):
        return self.__repr__()

    def __repr__(self):
        s = ""
        s += "qas_id: %s" % (tokenization.printable_text(self.qas_id))
        s += ", question_text: %s" % (
            tokenization.printable_text(self.question_text))
        return s

It can be seen that the function of the Example class is to store a single training / test case in a Q & A.

2. InputFeature class

The second class to be introduced is the InputFeature class. The specific code of this class is as follows:

class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self,
                 qas_id,
                 qry_tokens,
                 qry_input_ids,
                 qry_input_mask,
                 qry_entity_id,
                 answer_entity=None,
                 exclude_set=None,
                 init_facts=None,
                 sup_fact_1hop=None,
                 sup_fact_2hop=None, ):
        self.qas_id = qas_id
        self.qry_tokens = qry_tokens
        self.qry_input_ids = qry_input_ids
        self.qry_input_mask = qry_input_mask
        self.qry_entity_id = qry_entity_id
        self.answer_entity = answer_entity
        self.exclude_set = exclude_set
        self.init_facts = init_facts
        self.sup_fact_1hop = sup_fact_1hop
        self.sup_fact_2hop = sup_fact_2hop

It can be seen that the InputFeature class is used to store a single feature set in a dataset.

3. FeatureWriter class

The third class to be introduced is the FeatureWriter class. The specific code of this class is as follows:

class FeatureWriter(object):
    """Writes InputFeature to TF example file."""

    def __init__(self, filename, is_training, has_bridge):
        self.filename = filename
        self.is_training = is_training
        self.has_bridge = has_bridge
        self.num_features = 0
        self._writer = tf.python_io.TFRecordWriter(filename)

    def process_feature(self, feature):
        """Write a InputFeature to the TFRecordWriter as a tf.train.Example."""
        # The feature object is actually of Example class.
        self.num_features += 1

        def create_int_feature(values):
            feature = tf.train.Feature(
                int64_list=tf.train.Int64List(value=list(values)))
            return feature

        def create_float_feature(values):
            feature = tf.train.Feature(
                float_list=tf.train.FloatList(value=list(values)))
            return feature

        def create_bytes_feature(value):
            return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

        features = collections.OrderedDict()
        features["qas_ids"] = create_bytes_feature(feature.qas_id)
        features["qry_input_ids"] = create_int_feature(feature.qry_input_ids)
        features["qry_input_mask"] = create_int_feature(feature.qry_input_mask)
        features["qry_entity_id"] = create_int_feature(feature.qry_entity_id)
        # Prepare Init Facts as features
        init_fact_ids = [x[0] for x in feature.init_facts]
        init_fact_scores = [x[1] for x in feature.init_facts]

        features["init_fact_ids"] = create_int_feature(init_fact_ids)
        features["init_fact_scores"] = create_float_feature(init_fact_scores)

        if self.is_training:
            features["answer_entities"] = create_int_feature(feature.answer_entity)
            features["exclude_set"] = create_int_feature(feature.exclude_set)
            # TODO: add a hp (10) to limit the num of sup facts
            max_sup_fact_num = None  # or None
            sup_fact_1hop_ids = list(set([x[0] for x in feature.sup_fact_1hop]))[:max_sup_fact_num]
            sup_fact_2hop_ids = list(set([x[0] for x in feature.sup_fact_2hop]))[:max_sup_fact_num]
            features["sup_fact_1hop_ids"] = create_int_feature(sup_fact_1hop_ids)
            features["sup_fact_2hop_ids"] = create_int_feature(sup_fact_2hop_ids)

        tf_example = tf.train.Example(features=tf.train.Features(feature=features))
        self._writer.write(tf_example.SerializeToString())

    def close(self):
        self._writer.close()

It can be seen that in addition to the constructor, there are two functions in this class, namely process_ The feature () function and the close() function.

4.OpenCSRDataset class

The last class to be introduced is the OpenCSRDataset class. The specific code of this class is as follows:

class OpenCSRDataset(object):
    """Reads the open commonsense reasoning dataset and converts to TFRecords."""

    def __init__(self, in_file, tokenizer, subject_mention_probability,
                 max_qry_length, is_training, entity2id, tfrecord_filename):
        """Initialize dataset."""
        del subject_mention_probability

        self.gt_file = in_file
        self.max_qry_length = max_qry_length
        self.is_training = is_training

        # Read examples from JSON file.
        self.examples = self.read_examples(in_file, entity2id)
        self.num_examples = len(self.examples)

        if is_training:
            # Pre-shuffle the input to avoid having to make a very large shuffle
            # buffer in in the `input_fn`.
            rng = random.Random(12345)
            rng.shuffle(self.examples)

        # Write to TFRecords file.
        writer = FeatureWriter(
            filename=tfrecord_filename,
            is_training=self.is_training,
            has_bridge=False)
        convert_examples_to_features(
            examples=self.examples,
            tokenizer=tokenizer,
            max_query_length=self.max_qry_length,
            entity2id=entity2id,
            output_fn=writer.process_feature)
        writer.close()

        # Create input_fn.
        names_to_features = {
            "qas_ids": tf.FixedLenFeature([], tf.string),
            "qry_input_ids": tf.FixedLenFeature([self.max_qry_length], tf.int64),
            "qry_input_mask": tf.FixedLenFeature([self.max_qry_length], tf.int64),
            "qry_entity_id": tf.VarLenFeature(tf.int64),
            "init_fact_ids": tf.VarLenFeature(tf.int64),
            "init_fact_scores": tf.VarLenFeature(tf.float32),
        }
        if is_training:
            names_to_features["answer_entities"] = tf.VarLenFeature(tf.int64)
            names_to_features["exclude_set"] = tf.VarLenFeature(tf.int64)
            names_to_features["sup_fact_1hop_ids"] = tf.VarLenFeature(tf.int64)
            names_to_features["sup_fact_2hop_ids"] = tf.VarLenFeature(tf.int64)

        self.input_fn = input_fn_builder(
            input_file=tfrecord_filename,
            is_training=self.is_training,
            drop_remainder=False,
            names_to_features=names_to_features)

    def read_examples(self, queries_file, entity2id):
        """Read a json file into a list of Example."""
        self.max_qry_answers = 0
        num_qrys_without_answer, num_qrys_without_all_answers = 0, 0
        num_qrys_without_entity, num_qrys_without_all_entities = 0, 0
        tf.logging.info("Reading examples from %s", queries_file)
        with tf.gfile.Open(queries_file, "r") as reader:
            examples = []
            one_hop_num = 0
            for line in tqdm(reader, desc="Reading from %s" % reader.name):
                item = json.loads(line.strip())

                qas_id = item["_id"]
                question_text = item["question"]

                question_entities = []
                for entity in item["entities"]:
                    if entity["kb_id"].lower() in entity2id:
                        question_entities.append(entity["kb_id"].lower())
                if not question_entities:
                    num_qrys_without_entity += 1
                    if self.is_training:
                        continue
                if len(question_entities) != len(item["entities"]):
                    num_qrys_without_all_entities += 1

                    # make up the format
                answer_concepts = list(set([c["kb_id"] for c in item["all_answer_concepts"]]))  # TODO: decomp?
                choice2concepts = {}
                choice2concepts[item["answer"]] = answer_concepts
                # choice2concepts = item["choice2concepts"]
                answer_txt = item["answer"]
                assert answer_txt in choice2concepts
                answer_entities = []

                if self.is_training:
                    # Training time, we use all concepts in the correct choice.
                    answer_concepts = list(
                        set([c["kb_id"] for c in item["all_answer_concepts_decomp"]]))  # TODO: decomp?
                    choice2concepts[item["answer"]] = answer_concepts
                    for answer_concept in answer_concepts:
                        if answer_concept in entity2id:
                            # TODO: add an arg for decide if only use the longest concept.
                            answer_entities.append(entity2id[answer_concept])
                else:
                    # Test time, we use unique concepts in the correct choice.
                    for answer_concept in choice2concepts[answer_txt]:
                        if answer_concept in entity2id:
                            # TODO: add an arg for decide if only use the longest concept.
                            answer_entities.append(entity2id[answer_concept])

                if len(answer_entities) > self.max_qry_answers:
                    self.max_qry_answers = len(answer_entities)
                    tf.logging.warn("%s has %d linked entities", qas_id,
                                    len(question_entities))

                if not answer_entities:
                    num_qrys_without_answer += 1
                    if self.is_training:
                        continue
                if len(answer_entities) < len(item["answer_concepts"]):
                    num_qrys_without_all_answers += 1

                # Define the exclude_entities as the question entities,
                # and the concepts mentioned by wrong choices.
                exclude_entities = question_entities[:]
                # for choice, concepts in choice2concepts.items():
                #   if choice == answer_txt:
                #     continue
                #   for non_answer_concept in concepts:
                #     if non_answer_concept in entity2id:
                #       exclude_entities.append(non_answer_concept.lower())
                init_facts = item["init_facts"]
                sup_facts = item["sup_facts"]

                if sup_facts[0] == sup_facts[1]:
                    one_hop_num += 1
                example = Example(
                    qas_id=qas_id,
                    question_text=question_text,
                    subject_entity=question_entities,
                    answer_entity=answer_entities,
                    correct_choice=answer_txt,
                    choice2concepts=choice2concepts,
                    exclude_set=exclude_entities,
                    init_facts=init_facts,
                    sup_facts=sup_facts
                )
                examples.append(example)

        tf.logging.info("Number of valid questions = %d", len(examples))
        tf.logging.info("Number of one-hop questions = %d", one_hop_num)
        tf.logging.info("Ratio of one-hop questions = %.2f", one_hop_num / len(examples))
        tf.logging.info("Questions without any answer = %d",
                        num_qrys_without_answer)
        tf.logging.info("Questions without all answers = %d",
                        num_qrys_without_all_answers)
        tf.logging.info("Questions without any entity = %d",
                        num_qrys_without_entity)
        tf.logging.info("Questions without all entities = %d",
                        num_qrys_without_all_entities)
        tf.logging.info("Maximum answers per question = %d", self.max_qry_answers)

        return examples

Thus, the purpose of this class is to store OpenCSR datasets. In addition to the constructor, this class also has a read_ The examples () method is mainly used to read the data samples.

Tags: Python NLP

Posted on Sun, 05 Dec 2021 13:12:33 -0500 by kriek