The basic assumption of Seq2Seq based on RNN: the last implicit state (a vector) of the original sequence contains all the information of the sequence. (this is obviously unreasonable)
Seg2Seg problem: insufficient ability to remember long sequences
Solution: when a target language word is to be generated, not only the state of the previous time and the words already generated, but also the words to be generated are more relevant to the words in the source sentence, that is, which words in the source sentence are more concerned. This approach is called Attention mechanism
Attention
Published in 2015 by Luong et al Effective Approaches to Attentionbased Neural Machine Translation In this paper, attention technology is proposed. Through attention technology, seg2seg model greatly improves the quality of machine translation.
The reason is that the attention mechanism enables the seg2seg model to distinguish and focus on the input sequence.
example:
 Assuming that the model has generated the word "I", it is necessary to generate the next word;
 Obviously, it has the greatest relationship with "love" in the source language. Therefore, multiply the state corresponding to "love" in the source language sentence by a larger weight, such as 0.6, while the weight of other words is smaller;
 Finally, the state corresponding to each word in the source language sentence is weighted and summed, and used as an additional output for new state update.
Seg2Seg model combined with Attention mechanism

Combined with attention, the seg2seg model decoder will look at all the states of the encoder every time it updates the state (the decoder will know where to focus)

After Encoder finishes working, Attention and Decoder start working at the same time

attention can be simply understood as an effective weighted summation technology, and the key point is how to obtain the weight.
 Calculate weight:
α
i
=
a
l
i
g
h
(
h
i
,
s
0
)
\alpha_i=aligh(h_i,s_0)
αi=aligh(hi,s0)
(equivalent to calculation) h i h_i hi and s 0 of mutually shut nature Correlation of s_0 s0 (correlation of)
( h i by E n c o d e r of latent hide layer shape state ， s 0 by E n c o d e r of most after one individual latent hide layer shape state h_i is the hidden layer state of Encoder, s_0 is the last hidden layer state of Encoder hi = hidden layer state of Encoder, s0 = last hidden layer state of Encoder)
(the weight is a number between 0 and 1, which adds up to 1)  Calculation method:
 Linear maps:
k i = W K ⋅ h i ， f o r i = 1 t o m k_i=W_K·h_i，for i = 1 to m ki=WK⋅hi，fori=1tom q 0 = W Q ⋅ s 0 q_0=W_Q·s_0 q0=WQ⋅s0  Inner product:
α i ~ = k i T q 0 \tilde{\alpha_i} = \mathbf{k}^\mathrm{T}_iq_0 αi~=kiTq0  Normalization:
[ α 1 , ⋅ ⋅ ⋅ ， α m ] = s o f t m a x ( [ α 1 ~ , ⋅ ⋅ ⋅ , α m ~ ] ) [\alpha_1,···，\alpha_m] = softmax([\tilde{\alpha_1},···,\tilde{\alpha_m}]) [α1,⋅⋅⋅，αm]=softmax([α1~,⋅⋅⋅,αm~])
There is another method to calculate the weight, which is used in the code of this article, but now the more mainstream is the first method
 After the weight is obtained, the Context Vector:
c 0 = α 1 h 1 + ⋅ ⋅ ⋅ + α m h m c_0=\alpha_1h_1+···+\alpha_mh_m c0=α1h1+⋅⋅⋅+αmhm
 Update Decoder state vector( s 0 = h m s_0=h_m s0=hm)
 SimpleRNN
s 1 = t a n h ( A ′ ⋅ [ X 1 ′ s 0 ] + b ) s_1=tanh(A'·\begin{bmatrix}X_1'\\s_0\\ \end{bmatrix}+b) s1=tanh(A′⋅[X1′s0]+b)  SimpleRNN+Attention
s 1 = t a n h ( A ′ ⋅ [ X 1 ′ s 0 c 0 ] + b ) s_1=tanh(A'·\begin{bmatrix}X_1'\\s_0\\c_0\\ \end{bmatrix}+b) s1=tanh(A′⋅⎣⎡X1′s0c0⎦⎤+b)
 Disadvantages: the amount of calculation is very large
 attention time complexity is high m t mt mt (product of encoder and decoder sequence length)
GRU
Since GRU is used in the following code, it is briefly introduced here.
GRU (Gate Recurrent Unit) is a kind of RNN. Like LSTM, it is also proposed to solve the problems of gradient in longterm memory and back propagation.
The structure of GRU input and output is similar to that of ordinary RNN, and the internal idea is similar to LSTM.
Compared with LSTM, GRU has one less Gate, relatively fewer parameters and faster training speed, but it can also achieve the same functions as LSTM.
Input and output of GRU
 Like RNN, there's nothing to say
Internal structure of GRU

Pass the previous state h t − 1 h_{t1} ht − 1 − and current input x t x_t xt ， to obtain two gating states
heavy Set door control : r = σ ( W r ⋅ [ x t h t − 1 ] ) Reset gating: r = \sigma(W^r · \ begin {bMatrix} x_t \ \ h {T1} \ \ \ \ end {bMatrix}) Reset gating: r= σ (Wr⋅[xtht−1]) more new door control : z = σ ( W z ⋅ [ x t h t − 1 ] ) Update gating: z = \sigma(W^z · \ begin {bMatrix} x_t \ \ h {T1} \ \ \ \ end {bMatrix}) Update gating: z= σ (Wz⋅[xtht−1]) 
After obtaining the gating information, first use the reset gating to obtain the reset data h t − 1 ′ = h t − 1 ⋅ r h_{t1}^{'} = h_{t1} \cdot r ht − 1 '= ht − 1 ⋅ r h t − 1 ′ h_{t1}^{'} ht − 1 'and input x t x_t xt ， splice, and then scale the data to  1 ~ 1 through a tanh function h ′ = t a n h ( W [ x t h t − 1 ′ ] ) h'=tanh(W\begin{bmatrix}x_t\\h_{t1}^{'}\\ \end{bmatrix}) h ′ = tanh(W[xt ht − 1 ′]) here h ′ h' h 'mainly contains the current input x t x_t xt, the hidden layer state of the previous time is added, which is similar to the forget gate of LSTM.

Update memory stage
At this stage, we carried out two steps: forgetting and memory.
The previously obtained updated gating is used first h t = ( 1 − z ) ⋅ h t − 1 + z ⋅ h ′ h_t = (1z) \cdot h_{t1} + z \cdot h' ht=(1−z)⋅ht−1+z⋅h′
This is the key point of GRU: Use Update gating z z z can forget and select memory at the same time
code implementation
Task: Machine Translation
Source language: German
Target language: English
 Import package
import torch import torch.nn as nn import torch.optim as optim import torch.nn.functional as F from torchtext.legacy import data,datasets import spacy import numpy as np import random import math import de_core_news_sm
Data preprocessing
 Set seed
SEED = 1234 random.seed(SEED) np.random.seed(SEED) torch.manual_seed(SEED) torch.cuda.manual_seed(SEED) torch.backends.cudnn.deterministic = True
 Next, we will load the Spacy module and define markers for the source language and target language. (this is troublesome. It is recommended to download directly from Spacy's github and install it.)
spacy_de = spacy.load('de_core_news_sm') spacy_en = spacy.load('en_core_web_sm')
 Build word breaker
def tokenize_de(text): # Tokenizes German text from a string into a list of strings return [tok.text for tok in spacy_de.tokenizer(text)] def tokenize_en(text): # Tokenizes English text from a string into a list of strings return [tok.text for tok in spacy_en.tokenizer(text)]
 Next, we'll set the fields that determine how the data is processed. We attach the start and end of the sequence tag and lowercase all text.
SRC = data.Field(tokenize = tokenize_de, init_token = '<sos>', eos_token = '<eos>', lower = True) TRG = data.Field(tokenize = tokenize_en, init_token = '<sos>', eos_token = '<eos>', lower = True)
 We load the training set, verification set and test set to generate the dataset class. We use the multi 30K dataset provided by torchtext, which is a dataset containing about 30000 parallel English, German and French sentences, each sentence containing about 12 words.
train_data, valid_data, test_data = datasets.Multi30k.splits(exts = ('.de', '.en'), fields = (SRC, TRG))
 You can view the following loaded datasets
print(f"Number of training examples: {len(train_data.examples)}") print(f"Number of validation examples: {len(valid_data.examples)}") print(f"Number of testing examples: {len(test_data.examples)}") Number of training examples: 29000 Number of validation examples: 1014 Number of testing examples: 1000
 View data instances
vars(train_data.examples[0]) {'src': ['zwei', 'junge', 'weiße', 'männer', 'sind', 'im', 'freien', 'in', 'der', 'nähe', 'vieler', 'büsche', '.'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}
 Build a vocabulary to convert any tags that appear less than 2 times to tags.
SRC.build_vocab(train_data, min_freq = 2) TRG.build_vocab(train_data, min_freq = 2)
 Check the size of the generated thesaurus
print(f"Unique tokens in source (de) vocabulary: {len(SRC.vocab)}") print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}") Unique tokens in source (de) vocabulary: 7855 Unique tokens in target (en) vocabulary: 5893
 View the most common words in the glossary and how many times they appear in the dataset.
SRC.vocab.freqs.most_common(20) [('.', 28809), ('ein', 18851), ('einem', 13711), ('in', 11895), ('eine', 9909), (',', 8938), ('und', 8925), ('mit', 8843), ('auf', 8745), ('mann', 7805), ('einer', 6765), ('der', 4990), ('frau', 4186), ('die', 3949), ('zwei', 3873), ('einen', 3479), ('im', 3107), ('an', 3062), ('von', 2363), ('sich', 2273)]
 You can also use the stoi (string to int) or itos (int to string) method to output the first 10 words of TRG vocab below.
print(TRG.vocab.itos[:10]) # ['<unk>', '<pad>', '<sos>', '<eos>', 'a', '.', 'in', 'the', 'on', 'man']
 Set up GPU, build iterator
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') BATCH_SIZE = 64 train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits( (train_data, valid_data, test_data), batch_size = BATCH_SIZE, device = device)
 Take a look at the generated batch
batch = next(iter(train_iterator)) print(batch) [torchtext.legacy.data.batch.Batch of size 64 from MULTI30K] [.src]:[torch.cuda.LongTensor of size 25x64 (GPU 0)] [.trg]:[torch.cuda.LongTensor of size 28x64 (GPU 0)]
Build model
Encoder
The Encoder uses a singlelayer bidirectional GRU

The hidden layer state output of bidirectional GRU is composed of two vectors
For example: h 1 = [ h 1 → ; h T ← ] ; h 2 = [ h 2 → ; h T − 1 ← ] ; . . . . . . . . . . h_1=[\overrightarrow{h_1};\overleftarrow{h_T}];h_2=[\overrightarrow{h_2};\overleftarrow{h_{T1}}];.......... h1=[h1 ;hT ];h2=[h2 ;hT−1 ];..........
o u t p u t = { h 1 , h 2 , . . . . . . h T } output=\{h_1,h_2,......h_T\} output={h1,h2,......hT} 
Suppose this is a GRU of layer m h i d d e n = { [ h T 1 → ; h 1 1 ← ] , [ h T 2 → ; h 1 2 ← ] , . . . . . . , [ h T m → ; h 1 m ← ] } hidden=\{[\overrightarrow{h_T^1};\overleftarrow{h_1^1}],[\overrightarrow{h_T^2};\overleftarrow{h_1^2}],......,[\overrightarrow{h_T^m};\overleftarrow{h_1^m}]\} hidden={[hT1 ;h11 ],[hT2 ;h12 ],......,[hTm ;h1m ]}

What we need is the output of the last layer of hidden (including forward and reverse), so we can take out the hidden state of the last layer through hidden[2,:,:] and hidden[1,:,:] and splice them as s 0 s_0 s0.

s 0 s_0 Dimension transformation of s0 (we need to s 0 s_0 s0 {dimension changes to match the initial hidden state of the decoder)
 enc_hidden:[n_layersnum_directions, batch_size, hid_dim2]
 Since it is bidirectional, concat: [batch_size, enc_hid_dim*x]
 Through a full connection layer: [batch_size,dec_hid_dim]
 The rest needs unsqueeze and repeat, which will be completed in the next step.
class Encoder(nn.Module): def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout): super().__init__() self.embedding = nn.Embedding(input_dim, emb_dim) self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True) self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim) self.dropout = nn.Dropout(dropout) def forward(self, src): ''' src = [src_len, batch_size] ''' src = src.transpose(0, 1) # src = [batch_size, src_len] embedded = self.dropout(self.embedding(src)).transpose(0, 1) # embedded = [src_len, batch_size, emb_dim] # enc_output = [src_len, batch_size, hid_dim * num_directions] # enc_hidden = [n_layers * num_directions, batch_size, hid_dim] enc_output, enc_hidden = self.rnn(embedded) # if h_0 is not give, it will be set 0 acquiescently # enc_hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...] # enc_output are always from the last layer # enc_hidden [2, :, : ] is the last of the forwards RNN # enc_hidden [1, :, : ] is the last of the backwards RNN # initial decoder hidden is final hidden state of the forwards and backwards # encoder RNNs fed through a linear layer # s = [batch_size, dec_hid_dim] s = torch.tanh(self.fc(torch.cat((enc_hidden[2,:,:], enc_hidden[1,:,:]), dim = 1))) # s is the output of the hidden layer, which is then used as the initial hidden state of the decoder # Due to the dimension mismatch, a fully connected network changes the dimension to adapt to the initial hidden layer dimension of the decoder. # unsqueeze(0) is required later return enc_output, s
Attention
E
t
=
t
a
n
h
[
a
t
t
n
(
s
t
−
1
,
H
)
]
E_t=tanh[attn(s_{t1},H)]
Et=tanh[attn(st−1,H)]
α
~
t
=
v
⋅
E
t
\tilde\alpha_t = v·E_t
α~t=v⋅Et
α
t
=
s
o
f
t
m
a
x
(
α
~
t
)
\alpha_t=softmax(\tilde\alpha_t)
αt=softmax(α~t)
 s t − 1 yes finger e n c o d e r of latent hide layer transport Out S {ut1} refers to the hidden layer output of the encoder st − 1 refers to the hidden layer output of the encoder
 H finger of yes E n c o d e r in of change amount e n c _ o u p u t H refers to the variable enc \ _outputin Encoder H refers to the variable enc_output in Encoder
 attn() is actually a simple fully connected neural network.
Dimension transformation:
 First, convert the s passed from the encoder into [batch_size, src_len, dec_hid_dim]
 enc_output transfer – > [batch_sizr, src_len, enc_hid_dim * 2] so that they can concat enate
 Calculate the first formula to obtain energy [batch_size,src_len, dec_hid_dim]
 The second formula is linear transformation [batch_size,src_len,1], and then squeeze (2) removes the last dimension
 Finally, softmax does not change the dimension.
The weight value is returned
class Attention(nn.Module): def __init__(self, enc_hid_dim, dec_hid_dim): super().__init__() self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim, bias=False) self.v = nn.Linear(dec_hid_dim, 1, bias = False) def forward(self, s, enc_output): # s = [batch_size, dec_hid_dim] # enc_output = [src_len, batch_size, enc_hid_dim * 2] batch_size = enc_output.shape[1] src_len = enc_output.shape[0] # repeat decoder hidden state src_len times # s = [batch_size, src_len, dec_hid_dim] # enc_output = [batch_size, src_len, enc_hid_dim * 2] s = s.unsqueeze(1).repeat(1, src_len, 1) enc_output = enc_output.transpose(0, 1) # energy = [batch_size, src_len, dec_hid_dim] energy = torch.tanh(self.attn(torch.cat((s, enc_output), dim = 2))) # attention = [batch_size, src_len] attention = self.v(energy).squeeze(2) return F.softmax(attention, dim=1)
Decoder
 The Decoder uses a unidirectional singlelayer GRU
Actually, there are three formulas:
 Use the weight obtained in the attention part to calculate the context vector c = α t H c=\alpha_tH c=αtH
 Update decoder status vector s t = G R U ( e m b ( y t ) , c , s t − 1 ) s_t=GRU(emb(y_t),c,s_{t1}) st=GRU(emb(yt),c,st−1)
 y t ^ = f ( e m b ( y t ) , c , s t ) \hat{y_t}=f(emb(y_t),c,s_t) yt^=f(emb(yt),c,st)
 H finger of yes E n c o d e r in of change amount e n c _ o u p u t H refers to the variable enc in Encoder\_ ouput H refers to the variable enc in Encoder_ ouput
 e m b ( y t ) finger of yes take d e c _ i n p u t through too W o r d E m b e d d i n g of after have to reach of junction fruit emb(y_t) refers to putting dec\_input is the result obtained after WordEmbedding emb(yt) means that dec_input is the result obtained after WordEmbedding
 f ( ) letter number real Occasion upper Just yes by Yes turn change dimension degree The f() function is actually used to convert dimensions The f() function is actually used to convert dimensions
 torch.bmm multiplies two tensor matrices with dimension 3
Dimension transformation
 At the beginning, the decoder calls attention once to get the weight α t \alpha_t α t, whose dimension is [batch_size, src_len]
class Decoder(nn.Module): def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention): super().__init__() self.output_dim = output_dim self.attention = attention self.embedding = nn.Embedding(output_dim, emb_dim) self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim) self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim) self.dropout = nn.Dropout(dropout) def forward(self, dec_input, s, enc_output): # dec_input = [batch_size] # s = [batch_size, dec_hid_dim] # enc_output = [src_len, batch_size, enc_hid_dim * 2] dec_input = dec_input.unsqueeze(1) # dec_input = [batch_size, 1] embedded = self.dropout(self.embedding(dec_input)).transpose(0, 1) # embedded = [1, batch_size, emb_dim] # a = [batch_size, 1, src_len] a = self.attention(s, enc_output).unsqueeze(1) # enc_output = [batch_size, src_len, enc_hid_dim * 2] enc_output = enc_output.transpose(0, 1) # c = [1, batch_size, enc_hid_dim * 2] c = torch.bmm(a, enc_output).transpose(0, 1) # rnn_input = [1, batch_size, (enc_hid_dim * 2) + emb_dim] rnn_input = torch.cat((embedded, c), dim = 2) # dec_output = [src_len(=1), batch_size, dec_hid_dim] # dec_hidden = [n_layers * num_directions, batch_size, dec_hid_dim] dec_output, dec_hidden = self.rnn(rnn_input, s.unsqueeze(0)) # embedded = [batch_size, emb_dim] # dec_output = [batch_size, dec_hid_dim] # c = [batch_size, enc_hid_dim * 2] embedded = embedded.squeeze(0) dec_output = dec_output.squeeze(0) c = c.squeeze(0) # pred = [batch_size, output_dim] pred = self.fc_out(torch.cat((dec_output, c, embedded), dim = 1)) return pred, dec_hidden.squeeze(0)
Seg2Seg(with attention)
class Seq2Seq(nn.Module): def __init__(self, encoder, decoder, device): super().__init__() self.encoder = encoder self.decoder = decoder self.device = device def forward(self, src, trg, teacher_forcing_ratio = 0.5): # src = [src_len, batch_size] # trg = [trg_len, batch_size] # teacher_forcing_ratio is probability to use teacher forcing batch_size = src.shape[1] trg_len = trg.shape[0] trg_vocab_size = self.decoder.output_dim # tensor to store decoder outputs outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device) # enc_output is all hidden states of the input sequence, back and forwards # s is the final forward and backward hidden states, passed through a linear layer enc_output, s = self.encoder(src) # first input to the decoder is the <sos> tokens dec_input = trg[0,:] for t in range(1, trg_len): # insert dec_input token embedding, previous hidden state and all encoder hidden states # receive output tensor (predictions) and new hidden state dec_output, s = self.decoder(dec_input, s, enc_output) # place predictions in a tensor holding predictions for each token outputs[t] = dec_output # decide if we are going to use teacher forcing or not teacher_force = random.random() < teacher_forcing_ratio # get the highest predicted token from our predictions top1 = dec_output.argmax(1) # if teacher forcing, use actual next token as next input # if not, use predicted token dec_input = trg[t] if teacher_force else top1 return outputs