step by step

This article refers to the original- http://bjbsair.com/2020-03-25/tech-info/6302/ Today, let's take a look at the effect of netred Attention. From ACL's paper, Hierarchical Attention Networks for Document Classification

**Overview of the paper
**

In recent years, in the field of NLP, it seems that the most popular is RNN, LSTM, GRU, attention and their variant combination framework. In this paper, the author analyzes the structure of the text, uses the structure of two-way GRU, and adjusts the attention: considering the attention of the word level and the attention of the sentence level, respectively, models the importance of words in sentences and sentences in documents. It's quite reasonable to think about it carefully. A document is made up of countless sentences, and a sentence is made up of countless words, fully considering the internal structure of the document.

The figure above is the overall framework of the text classification model in the paper. It can be seen that it is mainly divided into four parts:

  • word encoder (BiGRU layer)
  • word attention (Attention layer)
  • sentence encoder (BiGRU layer)
  • sentence attention (Attention layer)

Let's first review the principle of GRU:

GRU is a variant of RNN, which uses gate mechanism to record the status of current sequence. There are two types of gates in GRU: reset gate and update gate. The two gates control together to determine how much information the current state has to update.

reset gate is used to determine how much past information is used to generate candidate states. If Rt is 0, all previous states are forgotten:

According to the score of reset gate, candidate states can be calculated:

update gate is used to determine how much past information is retained and how much new information is added:

Finally, the calculation formula of hidden layer state is determined by update gate, candidate state and previous state

Next, let's review the Attention principle:

Well, let's take a look at the model in the paper:

1,word encoder layer

First, the words in each sentence are embedded and converted into word vectors. Then, the words are input into the two-way GRU network. Combined with the context information, the hidden state output corresponding to the words is obtained

2,word attention layer

The purpose of attention mechanism is to find out the most important words in a sentence and give them a larger proportion.

First, the output of the word encoder step is input into a single-layer perceptron and the result is taken as its implicit representation

Then, in order to measure the importance of words, a randomly initialized context vector at the word level is defined, and its similarity with each word in the sentence is calculated. After a softmax operation, a normalized attention weight matrix is obtained, which represents the weight of the t word in sentence i

Therefore, the vector Si of a sentence can be regarded as the weighted sum of the vectors of words in the sentence. The word level context vector here is randomly initialized and can be learned in the process of training. We can regard it as a high-level representation of query: "which words in a sentence contain more important information?"

3,sentence encoder

Through the above steps, we get the vector representation of each sentence, and then we can get the document vector in a similar way

4,sentence attention

Similar to word level attention, the author proposes a sentence level context vector to measure the importance of a sentence in the whole text.

5,softmax

The above v vector is the final document representation we get, and then input a fully connected softmax layer for classification.

6. Model effect

code implementation

Definition model

class HAN(object):  
    def __init__(self, max_sentence_num, max_sentence_length, num_classes, vocab_size,  
                 embedding_size, learning_rate, decay_steps, decay_rate,  
                 hidden_size, l2_lambda, grad_clip, is_training=False,  
                 initializer=tf.random_normal_initializer(stddev=0.1)):  
        self.vocab_size = vocab_size  
        self.max_sentence_num = max_sentence_num  
        self.max_sentence_length = max_sentence_length  
        self.num_classes = num_classes  
        self.embedding_size = embedding_size  
        self.hidden_size = hidden_size  
        self.learning_rate = learning_rate  
        self.decay_rate = decay_rate  
        self.decay_steps = decay_steps  
        self.l2_lambda = l2_lambda  
        self.grad_clip = grad_clip  
        self.initializer = initializer  
        self.global_step = tf.Variable(0, trainable=False, name='global_step')  
        # placeholder  
        self.input_x = tf.placeholder(tf.int32, [None, max_sentence_num, max_sentence_length], name='input_x')  
        self.input_y = tf.placeholder(tf.int32, [None, num_classes], name='input_y')  
        self.dropout_keep_prob = tf.placeholder(tf.float32, name='dropout_keep_prob')  
        if not is_training:  
            return  
        word_embedding = self.word2vec()  
        sen_vec = self.sen2vec(word_embedding)  
        doc_vec = self.doc2vec(sen_vec)  
        self.logits = self.inference(doc_vec)  
        self.loss_val = self.loss(self.input_y, self.logits)  
        self.train_op = self.train()  
        self.prediction = tf.argmax(self.logits, axis=1, name='prediction')  
        self.pred_min = tf.reduce_min(self.prediction)  
        self.pred_max = tf.reduce_max(self.prediction)  
        self.pred_cnt = tf.bincount(tf.cast(self.prediction, dtype=tf.int32))  
        self.label_cnt = tf.bincount(tf.cast(tf.argmax(self.input_y, axis=1), dtype=tf.int32))  
        self.accuracy = self.accuracy(self.logits, self.input_y)
  • END -

The End- http://bjbsair.com/2020-03-25/tech-info/6302/ Today, let's take a look at the effect of netred Attention. From ACL's paper, Hierarchical Attention Networks for Document Classification

**Overview of the paper
**

In recent years, in the field of NLP, it seems that the most popular is RNN, LSTM, GRU, attention and their variant combination framework. In this paper, the author analyzes the structure of the text, uses the structure of two-way GRU, and adjusts the attention: considering the attention of the word level and the attention of the sentence level, respectively, models the importance of words in sentences and sentences in documents. It's quite reasonable to think about it carefully. A document is made up of countless sentences, and a sentence is made up of countless words, fully considering the internal structure of the document.

The figure above is the overall framework of the text classification model in the paper. It can be seen that it is mainly divided into four parts:

  • word encoder (BiGRU layer)
  • word attention (Attention layer)
  • sentence encoder (BiGRU layer)
  • sentence attention (Attention layer)

Let's first review the principle of GRU:

GRU is a variant of RNN, which uses gate mechanism to record the status of current sequence. There are two types of gates in GRU: reset gate and update gate. The two gates control together to determine how much information the current state has to update.

reset gate is used to determine how much past information is used to generate candidate states. If Rt is 0, all previous states are forgotten:

According to the score of reset gate, candidate states can be calculated:

update gate is used to determine how much past information is retained and how much new information is added:

Finally, the calculation formula of hidden layer state is determined by update gate, candidate state and previous state

Next, let's review the Attention principle:

Well, let's take a look at the model in the paper:

1,word encoder layer

First, the words in each sentence are embedded and converted into word vectors. Then, the words are input into the two-way GRU network. Combined with the context information, the hidden state output corresponding to the words is obtained

2,word attention layer

The purpose of attention mechanism is to find out the most important words in a sentence and give them a larger proportion.

First, the output of the word encoder step is input into a single-layer perceptron and the result is taken as its implicit representation

Then, in order to measure the importance of words, a randomly initialized context vector at the word level is defined, and its similarity with each word in the sentence is calculated. After a softmax operation, a normalized attention weight matrix is obtained, which represents the weight of the t word in sentence i

Therefore, the vector Si of a sentence can be regarded as the weighted sum of the vectors of words in the sentence. The word level context vector here is randomly initialized and can be learned in the process of training. We can regard it as a high-level representation of query: "which words in a sentence contain more important information?"

3,sentence encoder

Through the above steps, we get the vector representation of each sentence, and then we can get the document vector in a similar way

4,sentence attention

Similar to word level attention, the author proposes a sentence level context vector to measure the importance of a sentence in the whole text.

5,softmax

The above v vector is the final document representation we get, and then input a fully connected softmax layer for classification.

6. Model effect

code implementation

Definition model

class HAN(object):  
    def __init__(self, max_sentence_num, max_sentence_length, num_classes, vocab_size,  
                 embedding_size, learning_rate, decay_steps, decay_rate,  
                 hidden_size, l2_lambda, grad_clip, is_training=False,  
                 initializer=tf.random_normal_initializer(stddev=0.1)):  
        self.vocab_size = vocab_size  
        self.max_sentence_num = max_sentence_num  
        self.max_sentence_length = max_sentence_length  
        self.num_classes = num_classes  
        self.embedding_size = embedding_size  
        self.hidden_size = hidden_size  
        self.learning_rate = learning_rate  
        self.decay_rate = decay_rate  
        self.decay_steps = decay_steps  
        self.l2_lambda = l2_lambda  
        self.grad_clip = grad_clip  
        self.initializer = initializer  
        self.global_step = tf.Variable(0, trainable=False, name='global_step')  
        # placeholder  
        self.input_x = tf.placeholder(tf.int32, [None, max_sentence_num, max_sentence_length], name='input_x')  
        self.input_y = tf.placeholder(tf.int32, [None, num_classes], name='input_y')  
        self.dropout_keep_prob = tf.placeholder(tf.float32, name='dropout_keep_prob')  
        if not is_training:  
            return  
        word_embedding = self.word2vec()  
        sen_vec = self.sen2vec(word_embedding)  
        doc_vec = self.doc2vec(sen_vec)  
        self.logits = self.inference(doc_vec)  
        self.loss_val = self.loss(self.input_y, self.logits)  
        self.train_op = self.train()  
        self.prediction = tf.argmax(self.logits, axis=1, name='prediction')  
        self.pred_min = tf.reduce_min(self.prediction)  
        self.pred_max = tf.reduce_max(self.prediction)  
        self.pred_cnt = tf.bincount(tf.cast(self.prediction, dtype=tf.int32))  
        self.label_cnt = tf.bincount(tf.cast(tf.argmax(self.input_y, axis=1), dtype=tf.int32))  
        self.accuracy = self.accuracy(self.logits, self.input_y)
  • END -

The End- http://bjbsair.com/2020-03-25/tech-info/6302/ Today, let's take a look at the effect of netred Attention. From ACL's paper, Hierarchical Attention Networks for Document Classification

**Overview of the paper
**

In recent years, in the field of NLP, it seems that the most popular is RNN, LSTM, GRU, attention and their variant combination framework. In this paper, the author analyzes the structure of the text, uses the structure of two-way GRU, and adjusts the attention: considering the attention of the word level and the attention of the sentence level, respectively, models the importance of words in sentences and sentences in documents. It's quite reasonable to think about it carefully. A document is made up of countless sentences, and a sentence is made up of countless words, fully considering the internal structure of the document.

The figure above is the overall framework of the text classification model in the paper. It can be seen that it is mainly divided into four parts:

  • word encoder (BiGRU layer)
  • word attention (Attention layer)
  • sentence encoder (BiGRU layer)
  • sentence attention (Attention layer)

Let's first review the principle of GRU:

GRU is a variant of RNN, which uses gate mechanism to record the status of current sequence. There are two types of gates in GRU: reset gate and update gate. The two gates control together to determine how much information the current state has to update.

reset gate is used to determine how much past information is used to generate candidate states. If Rt is 0, all previous states are forgotten:

According to the score of reset gate, candidate states can be calculated:

update gate is used to determine how much past information is retained and how much new information is added:

Finally, the calculation formula of hidden layer state is determined by update gate, candidate state and previous state

Next, let's review the Attention principle:

Well, let's take a look at the model in the paper:

1,word encoder layer

First, the words in each sentence are embedded and converted into word vectors. Then, the words are input into the two-way GRU network. Combined with the context information, the hidden state output corresponding to the words is obtained

2,word attention layer

The purpose of attention mechanism is to find out the most important words in a sentence and give them a larger proportion.

First, the output of the word encoder step is input into a single-layer perceptron and the result is taken as its implicit representation

Then, in order to measure the importance of words, a randomly initialized context vector at the word level is defined, and its similarity with each word in the sentence is calculated. After a softmax operation, a normalized attention weight matrix is obtained, which represents the weight of the t word in sentence i

Therefore, the vector Si of a sentence can be regarded as the weighted sum of the vectors of words in the sentence. The word level context vector here is randomly initialized and can be learned in the process of training. We can regard it as a high-level representation of query: "which words in a sentence contain more important information?"

3,sentence encoder

Through the above steps, we get the vector representation of each sentence, and then we can get the document vector in a similar way

4,sentence attention

Similar to word level attention, the author proposes a sentence level context vector to measure the importance of a sentence in the whole text.

5,softmax

The above v vector is the final document representation we get, and then input a fully connected softmax layer for classification.

6. Model effect

code implementation

Definition model

class HAN(object):  
    def __init__(self, max_sentence_num, max_sentence_length, num_classes, vocab_size,  
                 embedding_size, learning_rate, decay_steps, decay_rate,  
                 hidden_size, l2_lambda, grad_clip, is_training=False,  
                 initializer=tf.random_normal_initializer(stddev=0.1)):  
        self.vocab_size = vocab_size  
        self.max_sentence_num = max_sentence_num  
        self.max_sentence_length = max_sentence_length  
        self.num_classes = num_classes  
        self.embedding_size = embedding_size  
        self.hidden_size = hidden_size  
        self.learning_rate = learning_rate  
        self.decay_rate = decay_rate  
        self.decay_steps = decay_steps  
        self.l2_lambda = l2_lambda  
        self.grad_clip = grad_clip  
        self.initializer = initializer  
        self.global_step = tf.Variable(0, trainable=False, name='global_step')  
        # placeholder  
        self.input_x = tf.placeholder(tf.int32, [None, max_sentence_num, max_sentence_length], name='input_x')  
        self.input_y = tf.placeholder(tf.int32, [None, num_classes], name='input_y')  
        self.dropout_keep_prob = tf.placeholder(tf.float32, name='dropout_keep_prob')  
        if not is_training:  
            return  
        word_embedding = self.word2vec()  
        sen_vec = self.sen2vec(word_embedding)  
        doc_vec = self.doc2vec(sen_vec)  
        self.logits = self.inference(doc_vec)  
        self.loss_val = self.loss(self.input_y, self.logits)  
        self.train_op = self.train()  
        self.prediction = tf.argmax(self.logits, axis=1, name='prediction')  
        self.pred_min = tf.reduce_min(self.prediction)  
        self.pred_max = tf.reduce_max(self.prediction)  
        self.pred_cnt = tf.bincount(tf.cast(self.prediction, dtype=tf.int32))  
        self.label_cnt = tf.bincount(tf.cast(tf.argmax(self.input_y, axis=1), dtype=tf.int32))  
        self.accuracy = self.accuracy(self.logits, self.input_y)
  • END -

The End- http://bjbsair.com/2020-03-25/tech-info/6302/ Today, let's take a look at the effect of netred Attention. From ACL's paper, Hierarchical Attention Networks for Document Classification

**Overview of the paper
**

In recent years, in the field of NLP, it seems that the most popular is RNN, LSTM, GRU, attention and their variant combination framework. In this paper, the author analyzes the structure of the text, uses the structure of two-way GRU, and adjusts the attention: considering the attention of the word level and the attention of the sentence level, respectively, models the importance of words in sentences and sentences in documents. It's quite reasonable to think about it carefully. A document is made up of countless sentences, and a sentence is made up of countless words, fully considering the internal structure of the document.

The figure above is the overall framework of the text classification model in the paper. It can be seen that it is mainly divided into four parts:

  • word encoder (BiGRU layer)
  • word attention (Attention layer)
  • sentence encoder (BiGRU layer)
  • sentence attention (Attention layer)

Let's first review the principle of GRU:

GRU is a variant of RNN, which uses gate mechanism to record the status of current sequence. There are two types of gates in GRU: reset gate and update gate. The two gates control together to determine how much information the current state has to update.

reset gate is used to determine how much past information is used to generate candidate states. If Rt is 0, all previous states are forgotten:

According to the score of reset gate, candidate states can be calculated:

update gate is used to determine how much past information is retained and how much new information is added:

Finally, the calculation formula of hidden layer state is determined by update gate, candidate state and previous state

Next, let's review the Attention principle:

Well, let's take a look at the model in the paper:

1,word encoder layer

First, the words in each sentence are embedded and converted into word vectors. Then, the words are input into the two-way GRU network. Combined with the context information, the hidden state output corresponding to the words is obtained

2,word attention layer

The purpose of attention mechanism is to find out the most important words in a sentence and give them a larger proportion.

First, the output of the word encoder step is input into a single-layer perceptron and the result is taken as its implicit representation

Then, in order to measure the importance of words, a randomly initialized context vector at the word level is defined, and its similarity with each word in the sentence is calculated. After a softmax operation, a normalized attention weight matrix is obtained, which represents the weight of the t word in sentence i

Therefore, the vector Si of a sentence can be regarded as the weighted sum of the vectors of words in the sentence. The word level context vector here is randomly initialized and can be learned in the process of training. We can regard it as a high-level representation of query: "which words in a sentence contain more important information?"

3,sentence encoder

Through the above steps, we get the vector representation of each sentence, and then we can get the document vector in a similar way

4,sentence attention

Similar to word level attention, the author proposes a sentence level context vector to measure the importance of a sentence in the whole text.

5,softmax

The above v vector is the final document representation we get, and then input a fully connected softmax layer for classification.

6. Model effect

code implementation

Definition model

class HAN(object):  
    def __init__(self, max_sentence_num, max_sentence_length, num_classes, vocab_size,  
                 embedding_size, learning_rate, decay_steps, decay_rate,  
                 hidden_size, l2_lambda, grad_clip, is_training=False,  
                 initializer=tf.random_normal_initializer(stddev=0.1)):  
        self.vocab_size = vocab_size  
        self.max_sentence_num = max_sentence_num  
        self.max_sentence_length = max_sentence_length  
        self.num_classes = num_classes  
        self.embedding_size = embedding_size  
        self.hidden_size = hidden_size  
        self.learning_rate = learning_rate  
        self.decay_rate = decay_rate  
        self.decay_steps = decay_steps  
        self.l2_lambda = l2_lambda  
        self.grad_clip = grad_clip  
        self.initializer = initializer  
        self.global_step = tf.Variable(0, trainable=False, name='global_step')  
        # placeholder  
        self.input_x = tf.placeholder(tf.int32, [None, max_sentence_num, max_sentence_length], name='input_x')  
        self.input_y = tf.placeholder(tf.int32, [None, num_classes], name='input_y')  
        self.dropout_keep_prob = tf.placeholder(tf.float32, name='dropout_keep_prob')  
        if not is_training:  
            return  
        word_embedding = self.word2vec()  
        sen_vec = self.sen2vec(word_embedding)  
        doc_vec = self.doc2vec(sen_vec)  
        self.logits = self.inference(doc_vec)  
        self.loss_val = self.loss(self.input_y, self.logits)  
        self.train_op = self.train()  
        self.prediction = tf.argmax(self.logits, axis=1, name='prediction')  
        self.pred_min = tf.reduce_min(self.prediction)  
        self.pred_max = tf.reduce_max(self.prediction)  
        self.pred_cnt = tf.bincount(tf.cast(self.prediction, dtype=tf.int32))  
        self.label_cnt = tf.bincount(tf.cast(tf.argmax(self.input_y, axis=1), dtype=tf.int32))  
        self.accuracy = self.accuracy(self.logits, self.input_y)
  • END -

The End- http://bjbsair.com/2020-03-25/tech-info/6302/ Today, let's take a look at the effect of netred Attention. From ACL's paper, Hierarchical Attention Networks for Document Classification

**Overview of the paper
**

In recent years, in the field of NLP, it seems that the most popular is RNN, LSTM, GRU, attention and their variant combination framework. In this paper, the author analyzes the structure of the text, uses the structure of two-way GRU, and adjusts the attention: considering the attention of the word level and the attention of the sentence level, respectively, models the importance of words in sentences and sentences in documents. It's quite reasonable to think about it carefully. A document is made up of countless sentences, and a sentence is made up of countless words, fully considering the internal structure of the document.

The figure above is the overall framework of the text classification model in the paper. It can be seen that it is mainly divided into four parts:

  • word encoder (BiGRU layer)
  • word attention (Attention layer)
  • sentence encoder (BiGRU layer)
  • sentence attention (Attention layer)

Let's first review the principle of GRU:

GRU is a variant of RNN, which uses gate mechanism to record the status of current sequence. There are two types of gates in GRU: reset gate and update gate. The two gates control together to determine how much information the current state has to update.

reset gate is used to determine how much past information is used to generate candidate states. If Rt is 0, all previous states are forgotten:

According to the score of reset gate, candidate states can be calculated:

update gate is used to determine how much past information is retained and how much new information is added:

Finally, the calculation formula of hidden layer state is determined by update gate, candidate state and previous state

Next, let's review the Attention principle:

Well, let's take a look at the model in the paper:

1,word encoder layer

First, the words in each sentence are embedded and converted into word vectors. Then, the words are input into the two-way GRU network. Combined with the context information, the hidden state output corresponding to the words is obtained

2,word attention layer

The purpose of attention mechanism is to find out the most important words in a sentence and give them a larger proportion.

First, the output of the word encoder step is input into a single-layer perceptron and the result is taken as its implicit representation

Then, in order to measure the importance of words, a randomly initialized context vector at the word level is defined, and its similarity with each word in the sentence is calculated. After a softmax operation, a normalized attention weight matrix is obtained, which represents the weight of the t word in sentence i

Therefore, the vector Si of a sentence can be regarded as the weighted sum of the vectors of words in the sentence. The word level context vector here is randomly initialized and can be learned in the process of training. We can regard it as a high-level representation of query: "which words in a sentence contain more important information?"

3,sentence encoder

Through the above steps, we get the vector representation of each sentence, and then we can get the document vector in a similar way

4,sentence attention

Similar to word level attention, the author proposes a sentence level context vector to measure the importance of a sentence in the whole text.

5,softmax

The above v vector is the final document representation we get, and then input a fully connected softmax layer for classification.

6. Model effect

code implementation

Definition model

class HAN(object):  
    def __init__(self, max_sentence_num, max_sentence_length, num_classes, vocab_size,  
                 embedding_size, learning_rate, decay_steps, decay_rate,  
                 hidden_size, l2_lambda, grad_clip, is_training=False,  
                 initializer=tf.random_normal_initializer(stddev=0.1)):  
        self.vocab_size = vocab_size  
        self.max_sentence_num = max_sentence_num  
        self.max_sentence_length = max_sentence_length  
        self.num_classes = num_classes  
        self.embedding_size = embedding_size  
        self.hidden_size = hidden_size  
        self.learning_rate = learning_rate  
        self.decay_rate = decay_rate  
        self.decay_steps = decay_steps  
        self.l2_lambda = l2_lambda  
        self.grad_clip = grad_clip  
        self.initializer = initializer  
        self.global_step = tf.Variable(0, trainable=False, name='global_step')  
        # placeholder  
        self.input_x = tf.placeholder(tf.int32, [None, max_sentence_num, max_sentence_length], name='input_x')  
        self.input_y = tf.placeholder(tf.int32, [None, num_classes], name='input_y')  
        self.dropout_keep_prob = tf.placeholder(tf.float32, name='dropout_keep_prob')  
        if not is_training:  
            return  
        word_embedding = self.word2vec()  
        sen_vec = self.sen2vec(word_embedding)  
        doc_vec = self.doc2vec(sen_vec)  
        self.logits = self.inference(doc_vec)  
        self.loss_val = self.loss(self.input_y, self.logits)  
        self.train_op = self.train()  
        self.prediction = tf.argmax(self.logits, axis=1, name='prediction')  
        self.pred_min = tf.reduce_min(self.prediction)  
        self.pred_max = tf.reduce_max(self.prediction)  
        self.pred_cnt = tf.bincount(tf.cast(self.prediction, dtype=tf.int32))  
        self.label_cnt = tf.bincount(tf.cast(tf.argmax(self.input_y, axis=1), dtype=tf.int32))  
        self.accuracy = self.accuracy(self.logits, self.input_y)
  • END -

The End- http://bjbsair.com/2020-03-25/tech-info/6302/ Today, let's take a look at the effect of netred Attention. From ACL's paper, Hierarchical Attention Networks for Document Classification

**Overview of the paper
**

In recent years, in the field of NLP, it seems that the most popular is RNN, LSTM, GRU, attention and their variant combination framework. In this paper, the author analyzes the structure of the text, uses the structure of two-way GRU, and adjusts the attention: considering the attention of the word level and the attention of the sentence level, respectively, models the importance of words in sentences and sentences in documents. It's quite reasonable to think about it carefully. A document is made up of countless sentences, and a sentence is made up of countless words, fully considering the internal structure of the document.

The figure above is the overall framework of the text classification model in the paper. It can be seen that it is mainly divided into four parts:

  • word encoder (BiGRU layer)
  • word attention (Attention layer)
  • sentence encoder (BiGRU layer)
  • sentence attention (Attention layer)

Let's first review the principle of GRU:

GRU is a variant of RNN, which uses gate mechanism to record the status of current sequence. There are two types of gates in GRU: reset gate and update gate. The two gates control together to determine how much information the current state has to update.

reset gate is used to determine how much past information is used to generate candidate states. If Rt is 0, all previous states are forgotten:

According to the score of reset gate, candidate states can be calculated:

update gate is used to determine how much past information is retained and how much new information is added:

Finally, the calculation formula of hidden layer state is determined by update gate, candidate state and previous state

Next, let's review the Attention principle:

Well, let's take a look at the model in the paper:

1,word encoder layer

First, the words in each sentence are embedded and converted into word vectors. Then, the words are input into the two-way GRU network. Combined with the context information, the hidden state output corresponding to the words is obtained

2,word attention layer

The purpose of attention mechanism is to find out the most important words in a sentence and give them a larger proportion.

First, the output of the word encoder step is input into a single-layer perceptron and the result is taken as its implicit representation

Then, in order to measure the importance of words, a randomly initialized context vector at the word level is defined, and its similarity with each word in the sentence is calculated. After a softmax operation, a normalized attention weight matrix is obtained, which represents the weight of the t word in sentence i

Therefore, the vector Si of a sentence can be regarded as the weighted sum of the vectors of words in the sentence. The word level context vector here is randomly initialized and can be learned in the process of training. We can regard it as a high-level representation of query: "which words in a sentence contain more important information?"

3,sentence encoder

Through the above steps, we get the vector representation of each sentence, and then we can get the document vector in a similar way

4,sentence attention

Similar to word level attention, the author proposes a sentence level context vector to measure the importance of a sentence in the whole text.

5,softmax

The above v vector is the final document representation we get, and then input a fully connected softmax layer for classification.

6. Model effect

code implementation

Definition model

class HAN(object):  
    def __init__(self, max_sentence_num, max_sentence_length, num_classes, vocab_size,  
                 embedding_size, learning_rate, decay_steps, decay_rate,  
                 hidden_size, l2_lambda, grad_clip, is_training=False,  
                 initializer=tf.random_normal_initializer(stddev=0.1)):  
        self.vocab_size = vocab_size  
        self.max_sentence_num = max_sentence_num  
        self.max_sentence_length = max_sentence_length  
        self.num_classes = num_classes  
        self.embedding_size = embedding_size  
        self.hidden_size = hidden_size  
        self.learning_rate = learning_rate  
        self.decay_rate = decay_rate  
        self.decay_steps = decay_steps  
        self.l2_lambda = l2_lambda  
        self.grad_clip = grad_clip  
        self.initializer = initializer  
        self.global_step = tf.Variable(0, trainable=False, name='global_step')  
        # placeholder  
        self.input_x = tf.placeholder(tf.int32, [None, max_sentence_num, max_sentence_length], name='input_x')  
        self.input_y = tf.placeholder(tf.int32, [None, num_classes], name='input_y')  
        self.dropout_keep_prob = tf.placeholder(tf.float32, name='dropout_keep_prob')  
        if not is_training:  
            return  
        word_embedding = self.word2vec()  
        sen_vec = self.sen2vec(word_embedding)  
        doc_vec = self.doc2vec(sen_vec)  
        self.logits = self.inference(doc_vec)  
        self.loss_val = self.loss(self.input_y, self.logits)  
        self.train_op = self.train()  
        self.prediction = tf.argmax(self.logits, axis=1, name='prediction')  
        self.pred_min = tf.reduce_min(self.prediction)  
        self.pred_max = tf.reduce_max(self.prediction)  
        self.pred_cnt = tf.bincount(tf.cast(self.prediction, dtype=tf.int32))  
        self.label_cnt = tf.bincount(tf.cast(tf.argmax(self.input_y, axis=1), dtype=tf.int32))  
        self.accuracy = self.accuracy(self.logits, self.input_y)
  • END -

The End- http://bjbsair.com/2020-03-25/tech-info/6302/ Today, let's take a look at the effect of netred Attention. From ACL's paper, Hierarchical Attention Networks for Document Classification

**Overview of the paper
**

In recent years, in the field of NLP, it seems that the most popular is RNN, LSTM, GRU, attention and their variant combination framework. In this paper, the author analyzes the structure of the text, uses the structure of two-way GRU, and adjusts the attention: considering the attention of the word level and the attention of the sentence level, respectively, models the importance of words in sentences and sentences in documents. It's quite reasonable to think about it carefully. A document is made up of countless sentences, and a sentence is made up of countless words, fully considering the internal structure of the document.

The figure above is the overall framework of the text classification model in the paper. It can be seen that it is mainly divided into four parts:

  • word encoder (BiGRU layer)
  • word attention (Attention layer)
  • sentence encoder (BiGRU layer)
  • sentence attention (Attention layer)

Let's first review the principle of GRU:

GRU is a variant of RNN, which uses gate mechanism to record the status of current sequence. There are two types of gates in GRU: reset gate and update gate. The two gates control together to determine how much information the current state has to update.

reset gate is used to determine how much past information is used to generate candidate states. If Rt is 0, all previous states are forgotten:

According to the score of reset gate, candidate states can be calculated:

update gate is used to determine how much past information is retained and how much new information is added:

Finally, the calculation formula of hidden layer state is determined by update gate, candidate state and previous state

Next, let's review the Attention principle:

Well, let's take a look at the model in the paper:

1,word encoder layer

First, the words in each sentence are embedded and converted into word vectors. Then, the words are input into the two-way GRU network. Combined with the context information, the hidden state output corresponding to the words is obtained

2,word attention layer

The purpose of attention mechanism is to find out the most important words in a sentence and give them a larger proportion.

First, the output of the word encoder step is input into a single-layer perceptron and the result is taken as its implicit representation

Then, in order to measure the importance of words, a randomly initialized context vector at the word level is defined, and its similarity with each word in the sentence is calculated. After a softmax operation, a normalized attention weight matrix is obtained, which represents the weight of the t word in sentence i

Therefore, the vector Si of a sentence can be regarded as the weighted sum of the vectors of words in the sentence. The word level context vector here is randomly initialized and can be learned in the process of training. We can regard it as a high-level representation of query: "which words in a sentence contain more important information?"

3,sentence encoder

Through the above steps, we get the vector representation of each sentence, and then we can get the document vector in a similar way

4,sentence attention

Similar to word level attention, the author proposes a sentence level context vector to measure the importance of a sentence in the whole text.

5,softmax

The above v vector is the final document representation we get, and then input a fully connected softmax layer for classification.

6. Model effect

code implementation

Definition model

class HAN(object):  
    def __init__(self, max_sentence_num, max_sentence_length, num_classes, vocab_size,  
                 embedding_size, learning_rate, decay_steps, decay_rate,  
                 hidden_size, l2_lambda, grad_clip, is_training=False,  
                 initializer=tf.random_normal_initializer(stddev=0.1)):  
        self.vocab_size = vocab_size  
        self.max_sentence_num = max_sentence_num  
        self.max_sentence_length = max_sentence_length  
        self.num_classes = num_classes  
        self.embedding_size = embedding_size  
        self.hidden_size = hidden_size  
        self.learning_rate = learning_rate  
        self.decay_rate = decay_rate  
        self.decay_steps = decay_steps  
        self.l2_lambda = l2_lambda  
        self.grad_clip = grad_clip  
        self.initializer = initializer  
        self.global_step = tf.Variable(0, trainable=False, name='global_step')  
        # placeholder  
        self.input_x = tf.placeholder(tf.int32, [None, max_sentence_num, max_sentence_length], name='input_x')  
        self.input_y = tf.placeholder(tf.int32, [None, num_classes], name='input_y')  
        self.dropout_keep_prob = tf.placeholder(tf.float32, name='dropout_keep_prob')  
        if not is_training:  
            return  
        word_embedding = self.word2vec()  
        sen_vec = self.sen2vec(word_embedding)  
        doc_vec = self.doc2vec(sen_vec)  
        self.logits = self.inference(doc_vec)  
        self.loss_val = self.loss(self.input_y, self.logits)  
        self.train_op = self.train()  
        self.prediction = tf.argmax(self.logits, axis=1, name='prediction')  
        self.pred_min = tf.reduce_min(self.prediction)  
        self.pred_max = tf.reduce_max(self.prediction)  
        self.pred_cnt = tf.bincount(tf.cast(self.prediction, dtype=tf.int32))  
        self.label_cnt = tf.bincount(tf.cast(tf.argmax(self.input_y, axis=1), dtype=tf.int32))  
        self.accuracy = self.accuracy(self.logits, self.input_y)
  • END -

The End- http://bjbsair.com/2020-03-25/tech-info/6302/ Today, let's take a look at the effect of netred Attention. From ACL's paper, Hierarchical Attention Networks for Document Classification

**Overview of the paper
**

In recent years, in the field of NLP, it seems that the most popular is RNN, LSTM, GRU, attention and their variant combination framework. In this paper, the author analyzes the structure of the text, uses the structure of two-way GRU, and adjusts the attention: considering the attention of the word level and the attention of the sentence level, respectively, models the importance of words in sentences and sentences in documents. It's quite reasonable to think about it carefully. A document is made up of countless sentences, and a sentence is made up of countless words, fully considering the internal structure of the document.

The figure above is the overall framework of the text classification model in the paper. It can be seen that it is mainly divided into four parts:

  • word encoder (BiGRU layer)
  • word attention (Attention layer)
  • sentence encoder (BiGRU layer)
  • sentence attention (Attention layer)

Let's first review the principle of GRU:

GRU is a variant of RNN, which uses gate mechanism to record the status of current sequence. There are two types of gates in GRU: reset gate and update gate. The two gates control together to determine how much information the current state has to update.

reset gate is used to determine how much past information is used to generate candidate states. If Rt is 0, all previous states are forgotten:

According to the score of reset gate, candidate states can be calculated:

update gate is used to determine how much past information is retained and how much new information is added:

Finally, the calculation formula of hidden layer state is determined by update gate, candidate state and previous state

Next, let's review the Attention principle:

Well, let's take a look at the model in the paper:

1,word encoder layer

First, the words in each sentence are embedded and converted into word vectors. Then, the words are input into the two-way GRU network. Combined with the context information, the hidden state output corresponding to the words is obtained

2,word attention layer

The purpose of attention mechanism is to find out the most important words in a sentence and give them a larger proportion.

First, the output of the word encoder step is input into a single-layer perceptron and the result is taken as its implicit representation

Then, in order to measure the importance of words, a randomly initialized context vector at the word level is defined, and its similarity with each word in the sentence is calculated. After a softmax operation, a normalized attention weight matrix is obtained, which represents the weight of the t word in sentence i

Therefore, the vector Si of a sentence can be regarded as the weighted sum of the vectors of words in the sentence. The word level context vector here is randomly initialized and can be learned in the process of training. We can regard it as a high-level representation of query: "which words in a sentence contain more important information?"

3,sentence encoder

Through the above steps, we get the vector representation of each sentence, and then we can get the document vector in a similar way

4,sentence attention

Similar to word level attention, the author proposes a sentence level context vector to measure the importance of a sentence in the whole text.

5,softmax

The above v vector is the final document representation we get, and then input a fully connected softmax layer for classification.

6. Model effect

code implementation

Definition model

class HAN(object):  
    def __init__(self, max_sentence_num, max_sentence_length, num_classes, vocab_size,  
                 embedding_size, learning_rate, decay_steps, decay_rate,  
                 hidden_size, l2_lambda, grad_clip, is_training=False,  
                 initializer=tf.random_normal_initializer(stddev=0.1)):  
        self.vocab_size = vocab_size  
        self.max_sentence_num = max_sentence_num  
        self.max_sentence_length = max_sentence_length  
        self.num_classes = num_classes  
        self.embedding_size = embedding_size  
        self.hidden_size = hidden_size  
        self.learning_rate = learning_rate  
        self.decay_rate = decay_rate  
        self.decay_steps = decay_steps  
        self.l2_lambda = l2_lambda  
        self.grad_clip = grad_clip  
        self.initializer = initializer  
        self.global_step = tf.Variable(0, trainable=False, name='global_step')  
        # placeholder  
        self.input_x = tf.placeholder(tf.int32, [None, max_sentence_num, max_sentence_length], name='input_x')  
        self.input_y = tf.placeholder(tf.int32, [None, num_classes], name='input_y')  
        self.dropout_keep_prob = tf.placeholder(tf.float32, name='dropout_keep_prob')  
        if not is_training:  
            return  
        word_embedding = self.word2vec()  
        sen_vec = self.sen2vec(word_embedding)  
        doc_vec = self.doc2vec(sen_vec)  
        self.logits = self.inference(doc_vec)  
        self.loss_val = self.loss(self.input_y, self.logits)  
        self.train_op = self.train()  
        self.prediction = tf.argmax(self.logits, axis=1, name='prediction')  
        self.pred_min = tf.reduce_min(self.prediction)  
        self.pred_max = tf.reduce_max(self.prediction)  
        self.pred_cnt = tf.bincount(tf.cast(self.prediction, dtype=tf.int32))  
        self.label_cnt = tf.bincount(tf.cast(tf.argmax(self.input_y, axis=1), dtype=tf.int32))  
        self.accuracy = self.accuracy(self.logits, self.input_y)
  • END -

The End- http://bjbsair.com/2020-03-25/tech-info/6302/ Today, let's take a look at the effect of netred Attention. From ACL's paper, Hierarchical Attention Networks for Document Classification

**Overview of the paper
**

In recent years, in the field of NLP, it seems that the most popular is RNN, LSTM, GRU, attention and their variant combination framework. In this paper, the author analyzes the structure of the text, uses the structure of two-way GRU, and adjusts the attention: considering the attention of the word level and the attention of the sentence level, respectively, models the importance of words in sentences and sentences in documents. It's quite reasonable to think about it carefully. A document is made up of countless sentences, and a sentence is made up of countless words, fully considering the internal structure of the document.

The figure above is the overall framework of the text classification model in the paper. It can be seen that it is mainly divided into four parts:

  • word encoder (BiGRU layer)
  • word attention (Attention layer)
  • sentence encoder (BiGRU layer)
  • sentence attention (Attention layer)

Let's first review the principle of GRU:

GRU is a variant of RNN, which uses gate mechanism to record the status of current sequence. There are two types of gates in GRU: reset gate and update gate. The two gates control together to determine how much information the current state has to update.

reset gate is used to determine how much past information is used to generate candidate states. If Rt is 0, all previous states are forgotten:

According to the score of reset gate, candidate states can be calculated:

update gate is used to determine how much past information is retained and how much new information is added:

Finally, the calculation formula of hidden layer state is determined by update gate, candidate state and previous state

Next, let's review the Attention principle:

Well, let's take a look at the model in the paper:

1,word encoder layer

First, the words in each sentence are embedded and converted into word vectors. Then, the words are input into the two-way GRU network. Combined with the context information, the hidden state output corresponding to the words is obtained

2,word attention layer

The purpose of attention mechanism is to find out the most important words in a sentence and give them a larger proportion.

First, the output of the word encoder step is input into a single-layer perceptron and the result is taken as its implicit representation

Then, in order to measure the importance of words, a randomly initialized context vector at the word level is defined, and its similarity with each word in the sentence is calculated. After a softmax operation, a normalized attention weight matrix is obtained, which represents the weight of the t word in sentence i

Therefore, the vector Si of a sentence can be regarded as the weighted sum of the vectors of words in the sentence. The word level context vector here is randomly initialized and can be learned in the process of training. We can regard it as a high-level representation of query: "which words in a sentence contain more important information?"

3,sentence encoder

Through the above steps, we get the vector representation of each sentence, and then we can get the document vector in a similar way

4,sentence attention

Similar to word level attention, the author proposes a sentence level context vector to measure the importance of a sentence in the whole text.

5,softmax

The above v vector is the final document representation we get, and then input a fully connected softmax layer for classification.

6. Model effect

code implementation

Definition model

class HAN(object):  
    def __init__(self, max_sentence_num, max_sentence_length, num_classes, vocab_size,  
                 embedding_size, learning_rate, decay_steps, decay_rate,  
                 hidden_size, l2_lambda, grad_clip, is_training=False,  
                 initializer=tf.random_normal_initializer(stddev=0.1)):  
        self.vocab_size = vocab_size  
        self.max_sentence_num = max_sentence_num  
        self.max_sentence_length = max_sentence_length  
        self.num_classes = num_classes  
        self.embedding_size = embedding_size  
        self.hidden_size = hidden_size  
        self.learning_rate = learning_rate  
        self.decay_rate = decay_rate  
        self.decay_steps = decay_steps  
        self.l2_lambda = l2_lambda  
        self.grad_clip = grad_clip  
        self.initializer = initializer  
        self.global_step = tf.Variable(0, trainable=False, name='global_step')  
        # placeholder  
        self.input_x = tf.placeholder(tf.int32, [None, max_sentence_num, max_sentence_length], name='input_x')  
        self.input_y = tf.placeholder(tf.int32, [None, num_classes], name='input_y')  
        self.dropout_keep_prob = tf.placeholder(tf.float32, name='dropout_keep_prob')  
        if not is_training:  
            return  
        word_embedding = self.word2vec()  
        sen_vec = self.sen2vec(word_embedding)  
        doc_vec = self.doc2vec(sen_vec)  
        self.logits = self.inference(doc_vec)  
        self.loss_val = self.loss(self.input_y, self.logits)  
        self.train_op = self.train()  
        self.prediction = tf.argmax(self.logits, axis=1, name='prediction')  
        self.pred_min = tf.reduce_min(self.prediction)  
        self.pred_max = tf.reduce_max(self.prediction)  
        self.pred_cnt = tf.bincount(tf.cast(self.prediction, dtype=tf.int32))  
        self.label_cnt = tf.bincount(tf.cast(tf.argmax(self.input_y, axis=1), dtype=tf.int32))  
        self.accuracy = self.accuracy(self.logits, self.input_y)
  • END -

The End- http://bjbsair.com/2020-03-25/tech-info/6302/ Today, let's take a look at the effect of netred Attention. From ACL's paper, Hierarchical Attention Networks for Document Classification

**Overview of the paper
**

In recent years, in the field of NLP, it seems that the most popular is RNN, LSTM, GRU, attention and their variant combination framework. In this paper, the author analyzes the structure of the text, uses the structure of two-way GRU, and adjusts the attention: considering the attention of the word level and the attention of the sentence level, respectively, models the importance of words in sentences and sentences in documents. It's quite reasonable to think about it carefully. A document is made up of countless sentences, and a sentence is made up of countless words, fully considering the internal structure of the document.

The figure above is the overall framework of the text classification model in the paper. It can be seen that it is mainly divided into four parts:

  • word encoder (BiGRU layer)
  • word attention (Attention layer)
  • sentence encoder (BiGRU layer)
  • sentence attention (Attention layer)

Let's first review the principle of GRU:

GRU is a variant of RNN, which uses gate mechanism to record the status of current sequence. There are two types of gates in GRU: reset gate and update gate. The two gates control together to determine how much information the current state has to update.

reset gate is used to determine how much past information is used to generate candidate states. If Rt is 0, all previous states are forgotten:

According to the score of reset gate, candidate states can be calculated:

update gate is used to determine how much past information is retained and how much new information is added:

Finally, the calculation formula of hidden layer state is determined by update gate, candidate state and previous state

Next, let's review the Attention principle:

Well, let's take a look at the model in the paper:

1,word encoder layer

First, the words in each sentence are embedded and converted into word vectors. Then, the words are input into the two-way GRU network. Combined with the context information, the hidden state output corresponding to the words is obtained

2,word attention layer

The purpose of attention mechanism is to find out the most important words in a sentence and give them a larger proportion.

First, the output of the word encoder step is input into a single-layer perceptron and the result is taken as its implicit representation

Then, in order to measure the importance of words, a randomly initialized context vector at the word level is defined, and its similarity with each word in the sentence is calculated. After a softmax operation, a normalized attention weight matrix is obtained, which represents the weight of the t word in sentence i

Therefore, the vector Si of a sentence can be regarded as the weighted sum of the vectors of words in the sentence. The word level context vector here is randomly initialized and can be learned in the process of training. We can regard it as a high-level representation of query: "which words in a sentence contain more important information?"

3,sentence encoder

Through the above steps, we get the vector representation of each sentence, and then we can get the document vector in a similar way

4,sentence attention

Similar to word level attention, the author proposes a sentence level context vector to measure the importance of a sentence in the whole text.

5,softmax

The above v vector is the final document representation we get, and then input a fully connected softmax layer for classification.

6. Model effect

code implementation

Definition model

class HAN(object):  
    def __init__(self, max_sentence_num, max_sentence_length, num_classes, vocab_size,  
                 embedding_size, learning_rate, decay_steps, decay_rate,  
                 hidden_size, l2_lambda, grad_clip, is_training=False,  
                 initializer=tf.random_normal_initializer(stddev=0.1)):  
        self.vocab_size = vocab_size  
        self.max_sentence_num = max_sentence_num  
        self.max_sentence_length = max_sentence_length  
        self.num_classes = num_classes  
        self.embedding_size = embedding_size  
        self.hidden_size = hidden_size  
        self.learning_rate = learning_rate  
        self.decay_rate = decay_rate  
        self.decay_steps = decay_steps  
        self.l2_lambda = l2_lambda  
        self.grad_clip = grad_clip  
        self.initializer = initializer  
        self.global_step = tf.Variable(0, trainable=False, name='global_step')  
        # placeholder  
        self.input_x = tf.placeholder(tf.int32, [None, max_sentence_num, max_sentence_length], name='input_x')  
        self.input_y = tf.placeholder(tf.int32, [None, num_classes], name='input_y')  
        self.dropout_keep_prob = tf.placeholder(tf.float32, name='dropout_keep_prob')  
        if not is_training:  
            return  
        word_embedding = self.word2vec()  
        sen_vec = self.sen2vec(word_embedding)  
        doc_vec = self.doc2vec(sen_vec)  
        self.logits = self.inference(doc_vec)  
        self.loss_val = self.loss(self.input_y, self.logits)  
        self.train_op = self.train()  
        self.prediction = tf.argmax(self.logits, axis=1, name='prediction')  
        self.pred_min = tf.reduce_min(self.prediction)  
        self.pred_max = tf.reduce_max(self.prediction)  
        self.pred_cnt = tf.bincount(tf.cast(self.prediction, dtype=tf.int32))  
        self.label_cnt = tf.bincount(tf.cast(tf.argmax(self.input_y, axis=1), dtype=tf.int32))  
        self.accuracy = self.accuracy(self.logits, self.input_y)
  • END -

The End- http://bjbsair.com/2020-03-25/tech-info/6302/ Today, let's take a look at the effect of netred Attention. From ACL's paper, Hierarchical Attention Networks for Document Classification

**Overview of the paper
**

In recent years, in the field of NLP, it seems that the most popular is RNN, LSTM, GRU, attention and their variant combination framework. In this paper, the author analyzes the structure of the text, uses the structure of two-way GRU, and adjusts the attention: considering the attention of the word level and the attention of the sentence level, respectively, models the importance of words in sentences and sentences in documents. It's quite reasonable to think about it carefully. A document is made up of countless sentences, and a sentence is made up of countless words, fully considering the internal structure of the document.

The figure above is the overall framework of the text classification model in the paper. It can be seen that it is mainly divided into four parts:

  • word encoder (BiGRU layer)
  • word attention (Attention layer)
  • sentence encoder (BiGRU layer)
  • sentence attention (Attention layer)

Let's first review the principle of GRU:

GRU is a variant of RNN, which uses gate mechanism to record the status of current sequence. There are two types of gates in GRU: reset gate and update gate. The two gates control together to determine how much information the current state has to update.

reset gate is used to determine how much past information is used to generate candidate states. If Rt is 0, all previous states are forgotten:

According to the score of reset gate, candidate states can be calculated:

update gate is used to determine how much past information is retained and how much new information is added:

Finally, the calculation formula of hidden layer state is determined by update gate, candidate state and previous state

Next, let's review the Attention principle:

Well, let's take a look at the model in the paper:

1,word encoder layer

First, the words in each sentence are embedded and converted into word vectors. Then, the words are input into the two-way GRU network. Combined with the context information, the hidden state output corresponding to the words is obtained

2,word attention layer

The purpose of attention mechanism is to find out the most important words in a sentence and give them a larger proportion.

First, the output of the word encoder step is input into a single-layer perceptron and the result is taken as its implicit representation

Then, in order to measure the importance of words, a randomly initialized context vector at the word level is defined, and its similarity with each word in the sentence is calculated. After a softmax operation, a normalized attention weight matrix is obtained, which represents the weight of the t word in sentence i

Therefore, the vector Si of a sentence can be regarded as the weighted sum of the vectors of words in the sentence. The word level context vector here is randomly initialized and can be learned in the process of training. We can regard it as a high-level representation of query: "which words in a sentence contain more important information?"

3,sentence encoder

Through the above steps, we get the vector representation of each sentence, and then we can get the document vector in a similar way

4,sentence attention

Similar to word level attention, the author proposes a sentence level context vector to measure the importance of a sentence in the whole text.

5,softmax

The above v vector is the final document representation we get, and then input a fully connected softmax layer for classification.

6. Model effect

code implementation

Definition model

class HAN(object):  
    def __init__(self, max_sentence_num, max_sentence_length, num_classes, vocab_size,  
                 embedding_size, learning_rate, decay_steps, decay_rate,  
                 hidden_size, l2_lambda, grad_clip, is_training=False,  
                 initializer=tf.random_normal_initializer(stddev=0.1)):  
        self.vocab_size = vocab_size  
        self.max_sentence_num = max_sentence_num  
        self.max_sentence_length = max_sentence_length  
        self.num_classes = num_classes  
        self.embedding_size = embedding_size  
        self.hidden_size = hidden_size  
        self.learning_rate = learning_rate  
        self.decay_rate = decay_rate  
        self.decay_steps = decay_steps  
        self.l2_lambda = l2_lambda  
        self.grad_clip = grad_clip  
        self.initializer = initializer  
        self.global_step = tf.Variable(0, trainable=False, name='global_step')  
        # placeholder  
        self.input_x = tf.placeholder(tf.int32, [None, max_sentence_num, max_sentence_length], name='input_x')  
        self.input_y = tf.placeholder(tf.int32, [None, num_classes], name='input_y')  
        self.dropout_keep_prob = tf.placeholder(tf.float32, name='dropout_keep_prob')  
        if not is_training:  
            return  
        word_embedding = self.word2vec()  
        sen_vec = self.sen2vec(word_embedding)  
        doc_vec = self.doc2vec(sen_vec)  
        self.logits = self.inference(doc_vec)  
        self.loss_val = self.loss(self.input_y, self.logits)  
        self.train_op = self.train()  
        self.prediction = tf.argmax(self.logits, axis=1, name='prediction')  
        self.pred_min = tf.reduce_min(self.prediction)  
        self.pred_max = tf.reduce_max(self.prediction)  
        self.pred_cnt = tf.bincount(tf.cast(self.prediction, dtype=tf.int32))  
        self.label_cnt = tf.bincount(tf.cast(tf.argmax(self.input_y, axis=1), dtype=tf.int32))  
        self.accuracy = self.accuracy(self.logits, self.input_y)
  • END -

The End- http://bjbsair.com/2020-03-25/tech-info/6302/ Today, let's take a look at the effect of netred Attention. From ACL's paper, Hierarchical Attention Networks for Document Classification

**Overview of the paper
**

In recent years, in the field of NLP, it seems that the most popular is RNN, LSTM, GRU, attention and their variant combination framework. In this paper, the author analyzes the structure of the text, uses the structure of two-way GRU, and adjusts the attention: considering the attention of the word level and the attention of the sentence level, respectively, models the importance of words in sentences and sentences in documents. It's quite reasonable to think about it carefully. A document is made up of countless sentences, and a sentence is made up of countless words, fully considering the internal structure of the document.

The figure above is the overall framework of the text classification model in the paper. It can be seen that it is mainly divided into four parts:

  • word encoder (BiGRU layer)
  • word attention (Attention layer)
  • sentence encoder (BiGRU layer)
  • sentence attention (Attention layer)

Let's first review the principle of GRU:

GRU is a variant of RNN, which uses gate mechanism to record the status of current sequence. There are two types of gates in GRU: reset gate and update gate. The two gates control together to determine how much information the current state has to update.

reset gate is used to determine how much past information is used to generate candidate states. If Rt is 0, all previous states are forgotten:

According to the score of reset gate, candidate states can be calculated:

update gate is used to determine how much past information is retained and how much new information is added:

Finally, the calculation formula of hidden layer state is determined by update gate, candidate state and previous state

Next, let's review the Attention principle:

Well, let's take a look at the model in the paper:

1,word encoder layer

First, the words in each sentence are embedded and converted into word vectors. Then, the words are input into the two-way GRU network. Combined with the context information, the hidden state output corresponding to the words is obtained

2,word attention layer

The purpose of attention mechanism is to find out the most important words in a sentence and give them a larger proportion.

First, the output of the word encoder step is input into a single-layer perceptron and the result is taken as its implicit representation

Then, in order to measure the importance of words, a randomly initialized context vector at the word level is defined, and its similarity with each word in the sentence is calculated. After a softmax operation, a normalized attention weight matrix is obtained, which represents the weight of the t word in sentence i

Therefore, the vector Si of a sentence can be regarded as the weighted sum of the vectors of words in the sentence. The word level context vector here is randomly initialized and can be learned in the process of training. We can regard it as a high-level representation of query: "which words in a sentence contain more important information?"

3,sentence encoder

Through the above steps, we get the vector representation of each sentence, and then we can get the document vector in a similar way

4,sentence attention

Similar to word level attention, the author proposes a sentence level context vector to measure the importance of a sentence in the whole text.

5,softmax

The above v vector is the final document representation we get, and then input a fully connected softmax layer for classification.

6. Model effect

code implementation

Definition model

class HAN(object):  
    def __init__(self, max_sentence_num, max_sentence_length, num_classes, vocab_size,  
                 embedding_size, learning_rate, decay_steps, decay_rate,  
                 hidden_size, l2_lambda, grad_clip, is_training=False,  
                 initializer=tf.random_normal_initializer(stddev=0.1)):  
        self.vocab_size = vocab_size  
        self.max_sentence_num = max_sentence_num  
        self.max_sentence_length = max_sentence_length  
        self.num_classes = num_classes  
        self.embedding_size = embedding_size  
        self.hidden_size = hidden_size  
        self.learning_rate = learning_rate  
        self.decay_rate = decay_rate  
        self.decay_steps = decay_steps  
        self.l2_lambda = l2_lambda  
        self.grad_clip = grad_clip  
        self.initializer = initializer  
        self.global_step = tf.Variable(0, trainable=False, name='global_step')  
        # placeholder  
        self.input_x = tf.placeholder(tf.int32, [None, max_sentence_num, max_sentence_length], name='input_x')  
        self.input_y = tf.placeholder(tf.int32, [None, num_classes], name='input_y')  
        self.dropout_keep_prob = tf.placeholder(tf.float32, name='dropout_keep_prob')  
        if not is_training:  
            return  
        word_embedding = self.word2vec()  
        sen_vec = self.sen2vec(word_embedding)  
        doc_vec = self.doc2vec(sen_vec)  
        self.logits = self.inference(doc_vec)  
        self.loss_val = self.loss(self.input_y, self.logits)  
        self.train_op = self.train()  
        self.prediction = tf.argmax(self.logits, axis=1, name='prediction')  
        self.pred_min = tf.reduce_min(self.prediction)  
        self.pred_max = tf.reduce_max(self.prediction)  
        self.pred_cnt = tf.bincount(tf.cast(self.prediction, dtype=tf.int32))  
        self.label_cnt = tf.bincount(tf.cast(tf.argmax(self.input_y, axis=1), dtype=tf.int32))  
        self.accuracy = self.accuracy(self.logits, self.input_y)
  • END -

The End

Tags: Programming network

Posted on Wed, 25 Mar 2020 21:31:02 -0400 by shamly