NLP natural language processing learning Attention mechanism

Tip: after the article is written, the directory can be generated automatically. Please refer to the help document on the right for how to generate it

preface

Tip: Here you can add the general contents to be recorded in this article:
For example, with the continuous development of artificial intelligence, machine learning technology is becoming more and more important. Many people have started learning machine learning. This paper introduces the basic content of machine learning.

Tip: the following is the main content of this article. The following cases can be used for reference

1, Attention mechanism

In the ordinary RNN structure, the Encoder needs to convert a sentence into a vector and then use it in the Decoder, which requires the Encoder to include all the information in the source sentence. However, when the sentence length is too long, this requirement is difficult to meet, or there will be a bottleneck (for example, inputting field length content such as an article), Of course, we can use deeper RNN and most units to solve this problem, but it also costs a lot. So is there any way to optimize the existing RNN structure?

Therefore, Bahdanau et al. Proposed the attention mechanism in 2015. Attention is translated into Chinese and called Attention based model. Just like when we see a painting, we can quickly say the main content of the painting and ignore the background in the painting, because we pay more attention to the main content.

In this way, in our RNN, we have all the information obtained through LSTM or GRU, so we only focus on the key points in these information, instead of using all the encoder information in each time step of the Decoder, so we can solve the problem mentioned in the first paragraph

Then the Attention mechanism we want to talk about now can help us solve this problem

1.1 implementation of attention mechanism

Here we consider an example of Text Translation:
Text translation can be realized through Seq2Seq we learned in the previous section.

The left side of the figure above is Encoder to get the corresponding hidden_state

The Decoder is on the right, and the content in the blue box is teacher forcing, which can accelerate the training of the model.

1 different from the Seq2Seq model, Attention does not take the last output of the encoder as the initial hidden state of the decoder, but randomly initiates the hidden state of a decoder z0.
2. Then match the output of each time step of z0 coder to obtain the corresponding a(query value)

The match operation can be the following:
a. Cosine values of Z and h
b. Neural network, input z and h
c.h.Twz

3. After calculating the output of each time step in the encoder with z, perform softmax operation 2 on the obtained result. (get the weight of attention)

After 4, the result after softmax and the output h of the original encoder are weighted and summed.

Namely:
c 0 = ∑ α ^ 0 i h i c^0 = \sum\hat{\alpha}_0^ih^i c0=∑α^0ihi

5 after obtaining c0, take it as the input of the decoder, and pass in the initialized z0 to obtain the output of the first time step on the decoder and hidden_state(z1)

6. match z1 with the output of each time step of all encoders. After softmax, multiply and sum the results of each time step of the encoder as the weight to obtain c1

7 take c1 and z1 as the inputs of the decoder to get the output of the next time step. In this cycle, the output of the final decoder has been obtained as the terminator.

Reference to the above:
reference material

1.2 introduction of different Attention

1.2.1 soft attention and hard attention

1.soft-Attention
At the beginning, the Attention mechanism proposed by Bahdanau et al. Is usually called soft Attention. The so-called soft Attention refers to the probability that each word input in the encoder will calculate an Attention.

2.hard-Attention
In image capture, a hard attention method is proposed, hoping to find a word corresponding to an output word directly from the input. However, because there is often a relationship between words and words in NLP, we will not only focus on a certain word, so we will use soft attention, so we won't introduce hard attention here

Examples are as follows:

1.2.2 global attention and Local Attention

Bahdanau Attention proposed by bahdanau et al. Is called local attention. Later, Luong Attention proposed by Luong et al. Is a global attention.

The so-called global attenion refers to the weight of the attenion of all the inputs at the encoder end

Local attenion uses the input weight of part of the encoder (hidden state of the encoder in the current time step), which can reduce the amount of calculation, especially when the length of the sentence is relatively long.

1.2.3 difference between bahdanau attention and Luong Attention

Difference 1: calculation data and location of attention

Bahdanau Attention will use the previous hidden state to calculate the attention weight, so we will use the operation of attention before the GRU in the code. At the same time, we will concatenate the result of attention and the result of word embedding as the input of GRU (refer to pytorch Toritul) Bahdanau uses a bidirectional GRU, and uses the concat result of the positive and negative encoder output as the encoder output, as shown in the following figure

Luong attendion uses the output of the current decoder to calculate the attention weight, so in the code, the operation of attention will be carried out after Gru, and the result of context vector and Gru will be concat enated to the final output. Luong uses multi-layer Gru, and only the output of the last layer (encoder output) will be used

Difference 2: different methods for calculating attention weights
1. match function of bahdanau attention,
a i j = v a T t a n h ( W a Z i − 1 , + U a h j ) a_i^j = v^T_a tanh (W_aZ_,+U_ah_j) aij = vattanh (WA − Zi − 1, + Ua − hj), calculate all a i j a_i^j After aij ， after calculating softmax, we get a ^ i j \hat_i^j a^ij, i.e

a ^ i j = e x p ( a i j ) ∑ e x p ( a i j ) \hat_i^j = \frac{\sum exp(a_i^j)} a^ij=∑exp(aij)exp(aij)

among

v a T yes one individual ginseng number Moment front ， need want cover Training Practice ， W a yes real present yes Z i − 1 of shape shape change turn v_a^T is a parameter matrix and needs to be trained. W_a is to realize the shape change of Z vaT is a parameter matrix and needs to be trained. Wa is to realize the shape change of Zi − 1,
U a real present yes h j of shape shape change turn ( Moment front ride method ， reason solution by Line nature return return ， real present number according to shape shape of yes Qi ) U_a realizes the shape change of h_j (matrix multiplication, understood as linear regression, realizes the alignment of data shapes) Ua ＾ realize the shape change of hj ＾ (matrix multiplication, understood as linear regression, and realize the alignment of data shapes),
Z i − 1 yes d e c o d e r end front one second of latent hide shape state ， h j yes e n c o d e r of o u t p u t Z is the last hidden state of the decoder, and H J is the output of the encoder Zi − 1 − is the last hidden state of the decoder, and hj − is the output of the encoder

2 Luong Attenion is simpler than Bahdanau Attention as a whole. He uses three methods to calculate the weight

Matrix multiplication: general
- Directly perform a matrix transformation (linear regression) on the hidden state of the decoder, and then perform matrix multiplication with the encoder outputs
dot
- Matrix multiplication is directly performed on the hidden state of the decoder and the encoder outputs
concat
- Put the hidden state of the decoder and the output of the encoder into concat, align the results processed by tanh, and multiply the matrix with the encoder outputs

In the end, the results of the two attention are not very different, so we can consider using Luong attention to complete the code in the future.

2. Code implementation of attention

import torch import torch.nn as nn import torch.nn.functional as F class LuoAttention(nn.Module): def __init__(self,hidden_size,method="general"): super(LuoAttention,self).__init__() assert method in ["dot","general","concat"] self.method=method self.wa=nn.Linear(hidden_size,hidden_size,bias=False) self.wa_cat=nn.Linear(2*hidden_size,hidden_size,bias=False) self.va=nn.Linear(hidden_size,1,bias=False) def forward(self,hidden_state,encoder_outputs): """ :param hidden_state: [num_layers,batch_size,hidden_size] :param encoder_outputs: [seq_len,batch_size,hidden_size] :return: """ hidden_state=hidden_state[-1,:,:] encoder_outputs = encoder_outputs.permute(1, 0, 2) ## hidden_state=[batch_size,hidden_size,1] hidden_state = hidden_state.permute(1, 2, 0) if self.method=="dot": atten_weights=self._dot(hidden_state,encoder_outputs) elif self.method=="general": atten_weights=self._general(hidden_state,encoder_outputs) else : atten_weights=self._concat(hidden_state,encoder_outputs) atten_weights=atten_weights.permute(1,0) ##[seq_len,batch_size] return atten_weights def _dot(self,hidden_state,encoder_outputs): ## atten_temp=[batch_size,seq_len,1] atten_temp=encoder_outputs.bmm(hidden_state) atten_weights=F.softmax(atten_temp.squeeze(-1),-1) ## atten_weights=[batch_size,seq_len] return atten_weights def _general(self,hidden_state,encoder_outputs): ## encoder_outputs=[batch_size,seq_len,hidden_size] ## hidden_state =[batch_size,hidden_size,1] encoder_outputs=self.wa(encoder_outputs) ## atten_temp=[batch_size,seq_len,1] atten_temp=encoder_outputs.bmm(hidden_state) atten_weights=F.softmax(atten_temp.squeeze(-1),-1) ## atten_weigths=[batch_size,seq_len] return atten_weights def _concat(self,hidden_state,encoder_outputs): ## hidden_state=[batch_size,hidden_size] hidden_state = hidden_state.squeeze(-1) ## hidden_state=[batch_size,seq_len,hidden_size] hidden_state = hidden_state.repeat(1, encoder_outputs.size(1), 1) cated = torch.cat([hidden_state, encoder_outputs], dim=-1) concat = self.wa_cat(cated) ## [batch_size,seq_len,hidden_size] atten_temp = self.va(concat) atten_weights = F.softmax(atten_temp.squeeze(-1), dim=-1) return atten_weights

Code changes in decoder:

def forward_step(self,decoder_input,decoder_hidden,encoder_outputs): # decoder_hidden.to(device) # decoder_input.to(device) embedded=self.embedding(decoder_input) # embedded.to(device) ## [seq_len,batch_size,embedding_dim] embedded=torch.transpose(embedded,0,1) # embedded.to(device) # pack_padded_sequence(embedded,lengths=input_length) out,decoder_hidden=self.gru(embedded,decoder_hidden) ## out [1,batch_size,hidden_size] ## atten_weights [seq_len,batch_size] atten_weights=self.atten(decoder_hidden,encoder_outputs) atten_weights=atten_weights.sqeeze(1).permute(2,1,0) ##atten_weights [batch_size,1,seq_len] encoder_outputs=encoder_outputs.permute(1,0,2) ## encoder_outputs [batch_size,seq_len,hidden_size] ## context [batch_size,1,hidden_size] context=atten_weights.bmm(encoder_outputs) context=context.squeeze(1) out=out.squeeze(0) concat_input=torch.cat((out,context),dim=1) ## concat_output [batch_size,hidden_size] concat_output=torch.tanh(self.concat(concat_input)) # out.to(device) # decoder_hidden.to(device) ## [1,batch_size,hidden_size] # out=out.squeeze(0) ## [batch_size,hidden_size] # out.to(device) out=self.fc(concat_output) output=F.log_softmax(out,-1) ## [batch_size,vob_size] return output,decoder_hidden,atten_weights