Source code analysis evaluation method of "reasoning problem based on common sense knowledge"

2021SC@SDUSC In previous source code analysis reports, I ...
1, input_fns.py

2021SC@SDUSC

In previous source code analysis reports, I have analyzed the first two steps of DrFact's algorithm. In order, today I should first analyze how the third step of the algorithm is implemented in the source code. However, based on my explanation of how DrKit model was referenced by DrFact model in the past two weeks, I want to focus on the source code affected by these references in the remaining source code today.

1, input_fns.py

input_fns.py is a very important source code module in DrFact model. In this source code module, there are several classes named Example, InputFeatures, FeatureWriter and OpenCSRDataset. The main purpose of these classes is to process different data sets into a common format. In this py source file, the following python modules are introduced, including the reference input from the DrKit module_ fns.

import collections import json import random from bert import tokenization from language.labs.drkit import input_fns as input_utils import tensorflow.compat.v1 as tf from tqdm import tqdm from tensorflow.contrib import data as contrib_data

1. Example class

First, I'll introduce input_ Example class in fns.py source file. This class is very important and is a dependency of subsequent classes. Its code is as follows:

class Example(object): """A single training/test example for QA.""" def __init__( self, qas_id, question_text, subject_entity, # The concepts mentioned by the question. answer_entity=None, # The concept(s) in the correct choice choice2concepts=None, # for evaluation correct_choice=None, # for evaluation exclude_set=None, # concepts in the question and wrong choices init_facts=None, # pre-computed init facts list. [(fid, score), ...] sup_facts=None ): self.qas_id = qas_id self.question_text = question_text self.subject_entity = subject_entity self.answer_entity = answer_entity self.choice2concepts = choice2concepts self.correct_choice = correct_choice self.exclude_set = exclude_set self.init_facts = init_facts self.sup_facts = sup_facts def __str__(self): return self.__repr__() def __repr__(self): s = "" s += "qas_id: %s" % (tokenization.printable_text(self.qas_id)) s += ", question_text: %s" % ( tokenization.printable_text(self.question_text)) return s

It can be seen that the function of the Example class is to store a single training / test case in a Q & A.

2. InputFeature class

The second class to be introduced is the InputFeature class. The specific code of this class is as follows:

class InputFeatures(object): """A single set of features of data.""" def __init__(self, qas_id, qry_tokens, qry_input_ids, qry_input_mask, qry_entity_id, answer_entity=None, exclude_set=None, init_facts=None, sup_fact_1hop=None, sup_fact_2hop=None, ): self.qas_id = qas_id self.qry_tokens = qry_tokens self.qry_input_ids = qry_input_ids self.qry_input_mask = qry_input_mask self.qry_entity_id = qry_entity_id self.answer_entity = answer_entity self.exclude_set = exclude_set self.init_facts = init_facts self.sup_fact_1hop = sup_fact_1hop self.sup_fact_2hop = sup_fact_2hop

It can be seen that the InputFeature class is used to store a single feature set in a dataset.

3. FeatureWriter class

The third class to be introduced is the FeatureWriter class. The specific code of this class is as follows:

class FeatureWriter(object): """Writes InputFeature to TF example file.""" def __init__(self, filename, is_training, has_bridge): self.filename = filename self.is_training = is_training self.has_bridge = has_bridge self.num_features = 0 self._writer = tf.python_io.TFRecordWriter(filename) def process_feature(self, feature): """Write a InputFeature to the TFRecordWriter as a tf.train.Example.""" # The feature object is actually of Example class. self.num_features += 1 def create_int_feature(values): feature = tf.train.Feature( int64_list=tf.train.Int64List(value=list(values))) return feature def create_float_feature(values): feature = tf.train.Feature( float_list=tf.train.FloatList(value=list(values))) return feature def create_bytes_feature(value): return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value])) features = collections.OrderedDict() features["qas_ids"] = create_bytes_feature(feature.qas_id) features["qry_input_ids"] = create_int_feature(feature.qry_input_ids) features["qry_input_mask"] = create_int_feature(feature.qry_input_mask) features["qry_entity_id"] = create_int_feature(feature.qry_entity_id) # Prepare Init Facts as features init_fact_ids = [x[0] for x in feature.init_facts] init_fact_scores = [x[1] for x in feature.init_facts] features["init_fact_ids"] = create_int_feature(init_fact_ids) features["init_fact_scores"] = create_float_feature(init_fact_scores) if self.is_training: features["answer_entities"] = create_int_feature(feature.answer_entity) features["exclude_set"] = create_int_feature(feature.exclude_set) # TODO: add a hp (10) to limit the num of sup facts max_sup_fact_num = None # or None sup_fact_1hop_ids = list(set([x[0] for x in feature.sup_fact_1hop]))[:max_sup_fact_num] sup_fact_2hop_ids = list(set([x[0] for x in feature.sup_fact_2hop]))[:max_sup_fact_num] features["sup_fact_1hop_ids"] = create_int_feature(sup_fact_1hop_ids) features["sup_fact_2hop_ids"] = create_int_feature(sup_fact_2hop_ids) tf_example = tf.train.Example(features=tf.train.Features(feature=features)) self._writer.write(tf_example.SerializeToString()) def close(self): self._writer.close()

It can be seen that in addition to the constructor, there are two functions in this class, namely process_ The feature () function and the close() function.

4.OpenCSRDataset class

The last class to be introduced is the OpenCSRDataset class. The specific code of this class is as follows:

class OpenCSRDataset(object): """Reads the open commonsense reasoning dataset and converts to TFRecords.""" def __init__(self, in_file, tokenizer, subject_mention_probability, max_qry_length, is_training, entity2id, tfrecord_filename): """Initialize dataset.""" del subject_mention_probability self.gt_file = in_file self.max_qry_length = max_qry_length self.is_training = is_training # Read examples from JSON file. self.examples = self.read_examples(in_file, entity2id) self.num_examples = len(self.examples) if is_training: # Pre-shuffle the input to avoid having to make a very large shuffle # buffer in in the `input_fn`. rng = random.Random(12345) rng.shuffle(self.examples) # Write to TFRecords file. writer = FeatureWriter( filename=tfrecord_filename, is_training=self.is_training, has_bridge=False) convert_examples_to_features( examples=self.examples, tokenizer=tokenizer, max_query_length=self.max_qry_length, entity2id=entity2id, output_fn=writer.process_feature) writer.close() # Create input_fn. names_to_features = { "qas_ids": tf.FixedLenFeature([], tf.string), "qry_input_ids": tf.FixedLenFeature([self.max_qry_length], tf.int64), "qry_input_mask": tf.FixedLenFeature([self.max_qry_length], tf.int64), "qry_entity_id": tf.VarLenFeature(tf.int64), "init_fact_ids": tf.VarLenFeature(tf.int64), "init_fact_scores": tf.VarLenFeature(tf.float32), } if is_training: names_to_features["answer_entities"] = tf.VarLenFeature(tf.int64) names_to_features["exclude_set"] = tf.VarLenFeature(tf.int64) names_to_features["sup_fact_1hop_ids"] = tf.VarLenFeature(tf.int64) names_to_features["sup_fact_2hop_ids"] = tf.VarLenFeature(tf.int64) self.input_fn = input_fn_builder( input_file=tfrecord_filename, is_training=self.is_training, drop_remainder=False, names_to_features=names_to_features) def read_examples(self, queries_file, entity2id): """Read a json file into a list of Example.""" self.max_qry_answers = 0 num_qrys_without_answer, num_qrys_without_all_answers = 0, 0 num_qrys_without_entity, num_qrys_without_all_entities = 0, 0 tf.logging.info("Reading examples from %s", queries_file) with tf.gfile.Open(queries_file, "r") as reader: examples = [] one_hop_num = 0 for line in tqdm(reader, desc="Reading from %s" % reader.name): item = json.loads(line.strip()) qas_id = item["_id"] question_text = item["question"] question_entities = [] for entity in item["entities"]: if entity["kb_id"].lower() in entity2id: question_entities.append(entity["kb_id"].lower()) if not question_entities: num_qrys_without_entity += 1 if self.is_training: continue if len(question_entities) != len(item["entities"]): num_qrys_without_all_entities += 1 # make up the format answer_concepts = list(set([c["kb_id"] for c in item["all_answer_concepts"]])) # TODO: decomp? choice2concepts = {} choice2concepts[item["answer"]] = answer_concepts # choice2concepts = item["choice2concepts"] answer_txt = item["answer"] assert answer_txt in choice2concepts answer_entities = [] if self.is_training: # Training time, we use all concepts in the correct choice. answer_concepts = list( set([c["kb_id"] for c in item["all_answer_concepts_decomp"]])) # TODO: decomp? choice2concepts[item["answer"]] = answer_concepts for answer_concept in answer_concepts: if answer_concept in entity2id: # TODO: add an arg for decide if only use the longest concept. answer_entities.append(entity2id[answer_concept]) else: # Test time, we use unique concepts in the correct choice. for answer_concept in choice2concepts[answer_txt]: if answer_concept in entity2id: # TODO: add an arg for decide if only use the longest concept. answer_entities.append(entity2id[answer_concept]) if len(answer_entities) > self.max_qry_answers: self.max_qry_answers = len(answer_entities) tf.logging.warn("%s has %d linked entities", qas_id, len(question_entities)) if not answer_entities: num_qrys_without_answer += 1 if self.is_training: continue if len(answer_entities) < len(item["answer_concepts"]): num_qrys_without_all_answers += 1 # Define the exclude_entities as the question entities, # and the concepts mentioned by wrong choices. exclude_entities = question_entities[:] # for choice, concepts in choice2concepts.items(): # if choice == answer_txt: # continue # for non_answer_concept in concepts: # if non_answer_concept in entity2id: # exclude_entities.append(non_answer_concept.lower()) init_facts = item["init_facts"] sup_facts = item["sup_facts"] if sup_facts[0] == sup_facts[1]: one_hop_num += 1 example = Example( qas_id=qas_id, question_text=question_text, subject_entity=question_entities, answer_entity=answer_entities, correct_choice=answer_txt, choice2concepts=choice2concepts, exclude_set=exclude_entities, init_facts=init_facts, sup_facts=sup_facts ) examples.append(example) tf.logging.info("Number of valid questions = %d", len(examples)) tf.logging.info("Number of one-hop questions = %d", one_hop_num) tf.logging.info("Ratio of one-hop questions = %.2f", one_hop_num / len(examples)) tf.logging.info("Questions without any answer = %d", num_qrys_without_answer) tf.logging.info("Questions without all answers = %d", num_qrys_without_all_answers) tf.logging.info("Questions without any entity = %d", num_qrys_without_entity) tf.logging.info("Questions without all entities = %d", num_qrys_without_all_entities) tf.logging.info("Maximum answers per question = %d", self.max_qry_answers) return examples

Thus, the purpose of this class is to store OpenCSR datasets. In addition to the constructor, this class also has a read_ The examples () method is mainly used to read the data samples.

5 December 2021, 13:12 | Views: 7643

Add new comment

For adding a comment, please log in
or create account

0 comments