Attention and its pytorch code implementation

The basic assumption of Seq2Seq based on RNN: the last implicit state (a vector) of the original sequence contains all the information of the sequence. (this is obviously unreasonable)

Seg2Seg problem: insufficient ability to remember long sequences

Solution: when a target language word is to be generated, not only the state of the previous time and the words already generated, but also the words to be generated are more relevant to the words in the source sentence, that is, which words in the source sentence are more concerned. This approach is called Attention mechanism

Attention

Published in 2015 by Luong et al Effective Approaches to Attention-based Neural Machine Translation In this paper, attention technology is proposed. Through attention technology, seg2seg model greatly improves the quality of machine translation.

The reason is that the attention mechanism enables the seg2seg model to distinguish and focus on the input sequence.

example:

  1. Assuming that the model has generated the word "I", it is necessary to generate the next word;
  2. Obviously, it has the greatest relationship with "love" in the source language. Therefore, multiply the state corresponding to "love" in the source language sentence by a larger weight, such as 0.6, while the weight of other words is smaller;
  3. Finally, the state corresponding to each word in the source language sentence is weighted and summed, and used as an additional output for new state update.

Seg2Seg model combined with Attention mechanism

  • Combined with attention, the seg2seg model decoder will look at all the states of the encoder every time it updates the state (the decoder will know where to focus)

  • After Encoder finishes working, Attention and Decoder start working at the same time

  • attention can be simply understood as an effective weighted summation technology, and the key point is how to obtain the weight.

  • Calculate weight: α i = a l i g h ( h i , s 0 ) \alpha_i=aligh(h_i,s_0) αi​=aligh(hi​,s0​)
    (equivalent to calculation) h i h_i hi and s 0 of mutually shut nature Correlation of s_0 s0 (correlation of)
    ( h i by E n c o d e r of latent hide layer shape state , s 0 by E n c o d e r of most after one individual latent hide layer shape state h_i is the hidden layer state of Encoder, s_0 is the last hidden layer state of Encoder hi = hidden layer state of Encoder, s0 = last hidden layer state of Encoder)
    (the weight is a number between 0 and 1, which adds up to 1)
  • Calculation method:
  1. Linear maps:
    k i = W K ⋅ h i , f o r i = 1 t o m k_i=W_K·h_i,for i = 1 to m ki​=WK​⋅hi​,fori=1tom q 0 = W Q ⋅ s 0 q_0=W_Q·s_0 q0​=WQ​⋅s0​
  2. Inner product:
    α i ~ = k i T q 0 \tilde{\alpha_i} = \mathbf{k}^\mathrm{T}_iq_0 αi​~​=kiT​q0​
  3. Normalization:
    [ α 1 , ⋅ ⋅ ⋅ , α m ] = s o f t m a x ( [ α 1 ~ , ⋅ ⋅ ⋅ , α m ~ ] ) [\alpha_1,···,\alpha_m] = softmax([\tilde{\alpha_1},···,\tilde{\alpha_m}]) [α1​,⋅⋅⋅,αm​]=softmax([α1​~​,⋅⋅⋅,αm​~​])

There is another method to calculate the weight, which is used in the code of this article, but now the more mainstream is the first method

  • After the weight is obtained, the Context Vector:

c 0 = α 1 h 1 + ⋅ ⋅ ⋅ + α m h m c_0=\alpha_1h_1+···+\alpha_mh_m c0​=α1​h1​+⋅⋅⋅+αm​hm​

  • Update Decoder state vector( s 0 = h m s_0=h_m s0​=hm​)
  1. SimpleRNN
    s 1 = t a n h ( A ′ ⋅ [ X 1 ′ s 0 ] + b ) s_1=tanh(A'·\begin{bmatrix}X_1'\\s_0\\ \end{bmatrix}+b) s1​=tanh(A′⋅[X1′​s0​​]+b)
  2. SimpleRNN+Attention
    s 1 = t a n h ( A ′ ⋅ [ X 1 ′ s 0 c 0 ] + b ) s_1=tanh(A'·\begin{bmatrix}X_1'\\s_0\\c_0\\ \end{bmatrix}+b) s1​=tanh(A′⋅⎣⎡​X1′​s0​c0​​⎦⎤​+b)
  • Disadvantages: the amount of calculation is very large
  • attention time complexity is high m t mt mt (product of encoder and decoder sequence length)

GRU

Since GRU is used in the following code, it is briefly introduced here.

GRU (Gate Recurrent Unit) is a kind of RNN. Like LSTM, it is also proposed to solve the problems of gradient in long-term memory and back propagation.

The structure of GRU input and output is similar to that of ordinary RNN, and the internal idea is similar to LSTM.

Compared with LSTM, GRU has one less Gate, relatively fewer parameters and faster training speed, but it can also achieve the same functions as LSTM.

Input and output of GRU

  • Like RNN, there's nothing to say

Internal structure of GRU

  1. Pass the previous state h t − 1 h_{t-1} ht − 1 − and current input x t x_t xt , to obtain two gating states
    heavy Set door control : r = σ ( W r ⋅ [ x t h t − 1 ] ) Reset gating: r = \sigma(W^r · \ begin {bMatrix} x_t \ \ h {T-1} \ \ \ \ end {bMatrix}) Reset gating: r= σ (Wr⋅[xt​ht−1​​]) more new door control : z = σ ( W z ⋅ [ x t h t − 1 ] ) Update gating: z = \sigma(W^z · \ begin {bMatrix} x_t \ \ h {T-1} \ \ \ \ end {bMatrix}) Update gating: z= σ (Wz⋅[xt​ht−1​​])

  2. After obtaining the gating information, first use the reset gating to obtain the reset data h t − 1 ′ = h t − 1 ⋅ r h_{t-1}^{'} = h_{t-1} \cdot r ht − 1 '= ht − 1 ⋅ r h t − 1 ′ h_{t-1}^{'} ht − 1 'and input x t x_t xt , splice, and then scale the data to - 1 ~ 1 through a tanh function h ′ = t a n h ( W [ x t h t − 1 ′ ] ) h'=tanh(W\begin{bmatrix}x_t\\h_{t-1}^{'}\\ \end{bmatrix}) h ′ = tanh(W[xt ht − 1 ′]) here h ′ h' h 'mainly contains the current input x t x_t xt, the hidden layer state of the previous time is added, which is similar to the forget gate of LSTM.

  3. Update memory stage
    At this stage, we carried out two steps: forgetting and memory.
    The previously obtained updated gating is used first h t = ( 1 − z ) ⋅ h t − 1 + z ⋅ h ′ h_t = (1-z) \cdot h_{t-1} + z \cdot h' ht​=(1−z)⋅ht−1​+z⋅h′
    This is the key point of GRU: Use Update gating z z z can forget and select memory at the same time

code implementation

Task: Machine Translation
Source language: German
Target language: English

  • Import package
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from torchtext.legacy import data,datasets

import spacy
import numpy as np

import random
import math

import de_core_news_sm

Data preprocessing

  • Set seed
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
  • Next, we will load the Spacy module and define markers for the source language and target language. (this is troublesome. It is recommended to download directly from Spacy's github and install it.)
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')
  • Build word breaker
def tokenize_de(text):
    # Tokenizes German text from a string into a list of strings
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    # Tokenizes English text from a string into a list of strings
    return [tok.text for tok in spacy_en.tokenizer(text)]
  • Next, we'll set the fields that determine how the data is processed. We attach the start and end of the sequence tag and lowercase all text.
SRC = data.Field(tokenize = tokenize_de, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

TRG = data.Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)
  • We load the training set, verification set and test set to generate the dataset class. We use the multi 30K dataset provided by torchtext, which is a dataset containing about 30000 parallel English, German and French sentences, each sentence containing about 12 words.
train_data, valid_data, test_data = datasets.Multi30k.splits(exts = ('.de', '.en'),
															fields = (SRC, TRG))
  • You can view the following loaded datasets
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000

  • View data instances
vars(train_data.examples[0])

{'src': ['zwei',
  'junge',
  'weiße',
  'männer',
  'sind',
  'im',
  'freien',
  'in',
  'der',
  'nähe',
  'vieler',
  'büsche',
  '.'],
 'trg': ['two',
  'young',
  ',',
  'white',
  'males',
  'are',
  'outside',
  'near',
  'many',
  'bushes',
  '.']}
  • Build a vocabulary to convert any tags that appear less than 2 times to tags.
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)
  • Check the size of the generated thesaurus
print(f"Unique tokens in source (de) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}")

Unique tokens in source (de) vocabulary: 7855
Unique tokens in target (en) vocabulary: 5893

  • View the most common words in the glossary and how many times they appear in the dataset.
SRC.vocab.freqs.most_common(20)

[('.', 28809),
 ('ein', 18851),
 ('einem', 13711),
 ('in', 11895),
 ('eine', 9909),
 (',', 8938),
 ('und', 8925),
 ('mit', 8843),
 ('auf', 8745),
 ('mann', 7805),
 ('einer', 6765),
 ('der', 4990),
 ('frau', 4186),
 ('die', 3949),
 ('zwei', 3873),
 ('einen', 3479),
 ('im', 3107),
 ('an', 3062),
 ('von', 2363),
 ('sich', 2273)]
  • You can also use the stoi (string to int) or itos (int to string) method to output the first 10 words of TRG vocab below.
print(TRG.vocab.itos[:10])

# ['<unk>', '<pad>', '<sos>', '<eos>', 'a', '.', 'in', 'the', 'on', 'man']
  • Set up GPU, build iterator
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

BATCH_SIZE = 64

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE, 
    device = device)
  • Take a look at the generated batch
batch = next(iter(train_iterator))
print(batch)

[torchtext.legacy.data.batch.Batch of size 64 from MULTI30K]
	[.src]:[torch.cuda.LongTensor of size 25x64 (GPU 0)]
	[.trg]:[torch.cuda.LongTensor of size 28x64 (GPU 0)]

Build model

Encoder

The Encoder uses a single-layer bidirectional GRU

  • The hidden layer state output of bidirectional GRU is composed of two vectors

    For example: h 1 = [ h 1 → ; h T ← ] ; h 2 = [ h 2 → ; h T − 1 ← ] ; . . . . . . . . . . h_1=[\overrightarrow{h_1};\overleftarrow{h_T}];h_2=[\overrightarrow{h_2};\overleftarrow{h_{T-1}}];.......... h1​=[h1​ ​;hT​ ​];h2​=[h2​ ​;hT−1​ ​];..........
    o u t p u t = { h 1 , h 2 , . . . . . . h T } output=\{h_1,h_2,......h_T\} output={h1​,h2​,......hT​}

  • Suppose this is a GRU of layer m h i d d e n = { [ h T 1 → ; h 1 1 ← ] , [ h T 2 → ; h 1 2 ← ] , . . . . . . , [ h T m → ; h 1 m ← ] } hidden=\{[\overrightarrow{h_T^1};\overleftarrow{h_1^1}],[\overrightarrow{h_T^2};\overleftarrow{h_1^2}],......,[\overrightarrow{h_T^m};\overleftarrow{h_1^m}]\} hidden={[hT1​ ​;h11​ ​],[hT2​ ​;h12​ ​],......,[hTm​ ​;h1m​ ​]}

  • What we need is the output of the last layer of hidden (including forward and reverse), so we can take out the hidden state of the last layer through hidden[-2,:,:] and hidden[-1,:,:] and splice them as s 0 s_0 s0​.

  • s 0 s_0 Dimension transformation of s0 (we need to s 0 s_0 s0 {dimension changes to match the initial hidden state of the decoder)

    1. enc_hidden:[n_layersnum_directions, batch_size, hid_dim2]
    2. Since it is bidirectional, concat: [batch_size, enc_hid_dim*x]
    3. Through a full connection layer: [batch_size,dec_hid_dim]
    4. The rest needs unsqueeze and repeat, which will be completed in the next step.
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src): 
        '''
        src = [src_len, batch_size]
        '''
        src = src.transpose(0, 1) # src = [batch_size, src_len]
        embedded = self.dropout(self.embedding(src)).transpose(0, 1) # embedded = [src_len, batch_size, emb_dim]
        
        # enc_output = [src_len, batch_size, hid_dim * num_directions]
        # enc_hidden = [n_layers * num_directions, batch_size, hid_dim]
        enc_output, enc_hidden = self.rnn(embedded) # if h_0 is not give, it will be set 0 acquiescently

        # enc_hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        # enc_output are always from the last layer
        
        # enc_hidden [-2, :, : ] is the last of the forwards RNN 
        # enc_hidden [-1, :, : ] is the last of the backwards RNN
        
        # initial decoder hidden is final hidden state of the forwards and backwards 
        # encoder RNNs fed through a linear layer
        # s = [batch_size, dec_hid_dim]
        s = torch.tanh(self.fc(torch.cat((enc_hidden[-2,:,:], enc_hidden[-1,:,:]), dim = 1)))
        # s is the output of the hidden layer, which is then used as the initial hidden state of the decoder
        # Due to the dimension mismatch, a fully connected network changes the dimension to adapt to the initial hidden layer dimension of the decoder.
        # unsqueeze(0) is required later
        
        return enc_output, s

Attention


E t = t a n h [ a t t n ( s t − 1 , H ) ] E_t=tanh[attn(s_{t-1},H)] Et​=tanh[attn(st−1​,H)] α ~ t = v ⋅ E t \tilde\alpha_t = v·E_t α~t​=v⋅Et​ α t = s o f t m a x ( α ~ t ) \alpha_t=softmax(\tilde\alpha_t) αt​=softmax(α~t​)

  1. s t − 1 yes finger e n c o d e r of latent hide layer transport Out S {ut-1} refers to the hidden layer output of the encoder st − 1 refers to the hidden layer output of the encoder
  2. H finger of yes E n c o d e r in of change amount e n c _ o u p u t H refers to the variable enc \ _outputin Encoder H refers to the variable enc_output in Encoder
  3. attn() is actually a simple fully connected neural network.

Dimension transformation:

  1. First, convert the s passed from the encoder into [batch_size, src_len, dec_hid_dim]
  2. enc_output transfer – > [batch_sizr, src_len, enc_hid_dim * 2] so that they can concat enate
  3. Calculate the first formula to obtain energy [batch_size,src_len, dec_hid_dim]
  4. The second formula is linear transformation [batch_size,src_len,1], and then squeeze (2) removes the last dimension
  5. Finally, softmax does not change the dimension.

The weight value is returned

class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim, bias=False)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, s, enc_output):
        
        # s = [batch_size, dec_hid_dim]
        # enc_output = [src_len, batch_size, enc_hid_dim * 2]
        
        batch_size = enc_output.shape[1]
        src_len = enc_output.shape[0]
        
        # repeat decoder hidden state src_len times
        # s = [batch_size, src_len, dec_hid_dim]
        # enc_output = [batch_size, src_len, enc_hid_dim * 2]
        s = s.unsqueeze(1).repeat(1, src_len, 1)
        enc_output = enc_output.transpose(0, 1)
        
        # energy = [batch_size, src_len, dec_hid_dim]
        energy = torch.tanh(self.attn(torch.cat((s, enc_output), dim = 2)))
        
        # attention = [batch_size, src_len]
        attention = self.v(energy).squeeze(2)
        
        return F.softmax(attention, dim=1)

Decoder

  • The Decoder uses a unidirectional single-layer GRU

Actually, there are three formulas:

  1. Use the weight obtained in the attention part to calculate the context vector c = α t H c=\alpha_tH c=αt​H
  2. Update decoder status vector s t = G R U ( e m b ( y t ) , c , s t − 1 ) s_t=GRU(emb(y_t),c,s_{t-1}) st​=GRU(emb(yt​),c,st−1​)
  3. y t ^ = f ( e m b ( y t ) , c , s t ) \hat{y_t}=f(emb(y_t),c,s_t) yt​^​=f(emb(yt​),c,st​)
  • H finger of yes E n c o d e r in of change amount e n c _ o u p u t H refers to the variable enc in Encoder\_ ouput H refers to the variable enc in Encoder_ ouput
  • e m b ( y t ) finger of yes take d e c _ i n p u t through too W o r d E m b e d d i n g of after have to reach of junction fruit emb(y_t) refers to putting dec\_input is the result obtained after WordEmbedding emb(yt) means that dec_input is the result obtained after WordEmbedding
  • f ( ) letter number real Occasion upper Just yes by Yes turn change dimension degree The f() function is actually used to convert dimensions The f() function is actually used to convert dimensions
  • torch.bmm multiplies two tensor matrices with dimension 3

Dimension transformation

  1. At the beginning, the decoder calls attention once to get the weight α t \alpha_t α t, whose dimension is [batch_size, src_len]
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super().__init__()
        self.output_dim = output_dim
        self.attention = attention
        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
        self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, dec_input, s, enc_output):
             
        # dec_input = [batch_size]
        # s = [batch_size, dec_hid_dim]
        # enc_output = [src_len, batch_size, enc_hid_dim * 2]
        
        dec_input = dec_input.unsqueeze(1) # dec_input = [batch_size, 1]
        
        embedded = self.dropout(self.embedding(dec_input)).transpose(0, 1) # embedded = [1, batch_size, emb_dim]
        
        # a = [batch_size, 1, src_len]  
        a = self.attention(s, enc_output).unsqueeze(1)
        
        # enc_output = [batch_size, src_len, enc_hid_dim * 2]
        enc_output = enc_output.transpose(0, 1)

        # c = [1, batch_size, enc_hid_dim * 2]
        c = torch.bmm(a, enc_output).transpose(0, 1)

        # rnn_input = [1, batch_size, (enc_hid_dim * 2) + emb_dim]
        rnn_input = torch.cat((embedded, c), dim = 2)
            
        # dec_output = [src_len(=1), batch_size, dec_hid_dim]
        # dec_hidden = [n_layers * num_directions, batch_size, dec_hid_dim]
        dec_output, dec_hidden = self.rnn(rnn_input, s.unsqueeze(0))
        
        # embedded = [batch_size, emb_dim]
        # dec_output = [batch_size, dec_hid_dim]
        # c = [batch_size, enc_hid_dim * 2]
        embedded = embedded.squeeze(0)
        dec_output = dec_output.squeeze(0)
        c = c.squeeze(0)
        
        # pred = [batch_size, output_dim]
        pred = self.fc_out(torch.cat((dec_output, c, embedded), dim = 1))
        
        return pred, dec_hidden.squeeze(0)

Seg2Seg(with attention)

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        # src = [src_len, batch_size]
        # trg = [trg_len, batch_size]
        # teacher_forcing_ratio is probability to use teacher forcing
        
        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        # tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        # enc_output is all hidden states of the input sequence, back and forwards
        # s is the final forward and backward hidden states, passed through a linear layer
        enc_output, s = self.encoder(src)
                
        # first input to the decoder is the <sos> tokens
        dec_input = trg[0,:]
        
        for t in range(1, trg_len):
            
            # insert dec_input token embedding, previous hidden state and all encoder hidden states
            # receive output tensor (predictions) and new hidden state
            dec_output, s = self.decoder(dec_input, s, enc_output)
            
            # place predictions in a tensor holding predictions for each token
            outputs[t] = dec_output
            
            # decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            # get the highest predicted token from our predictions
            top1 = dec_output.argmax(1) 
            
            # if teacher forcing, use actual next token as next input
            # if not, use predicted token
            dec_input = trg[t] if teacher_force else top1

        return outputs

reference resources

Tags: Pytorch Deep Learning

Posted on Wed, 10 Nov 2021 15:40:15 -0500 by mohabitar