# 1. Understand Transformer architecture

## 1.1 role of transformer model

- The transformer model based on seq2seq architecture can complete typical tasks in NLP field, such as machine translation, text generation, etc. at the same time, it can also build a pre training language model for transfer learning of different tasks

Statement:

In the following architecture analysis, we will assume that the Transformer model architecture is used to process the translation from one language text to another language text, so many naming methods follow the rules in NLP. For example, the Embedding layer will be called the text Embedding layer, the tensor generated by the Embedding layer will be called the word Embedding tensor, and its last dimension will be called the word vector, etc

## 1.2 Transformer overall architecture diagram

The overall architecture of Transformer can be divided into four parts:

- Input part
- Output part
- Encoder part
- Decoder part

The input section contains:

- Source text embedding layer and its position encoder
- Target text embedding layer and its position encoder

The output section contains:

- Linear layer
- softmax layer

Encoder part:

- It is composed of N encoder layers stacked
- Each encoder layer consists of two sublayer connection structures
- The first sublayer connection structure includes a multi head self attention sublayer, a normalization layer and a residual connection
- The second sublayer connection structure includes a feedforward full connection sublayer, a normalization layer and a residual connection

Decoder part:

- It is composed of N decoder layers stacked
- Each decoder layer consists of three sublayer connection structures
- The first sublayer connection structure includes a multi head self attention sublayer, a normalization layer and a residual connection
- The second sublayer connection structure includes a multi head attention sublayer, a normalization layer and a residual connection
- The third sublayer connection structure includes a feedforward full connection sublayer, a normalization layer and a residual connection

# 2. Implementation of input part

The input section contains:

- Source text embedding layer and its position encoder
- Target text embedding layer and its position encoder

## 2.1 role of text embedding layer

Both source text embedding and target text embedding are to transform the digital representation of words in text into vector representation, hoping to capture the relationship between words in such a high-dimensional space

Installation of pytorch 0.3.0 and its prerequisite Kit:

# The toolkit installed with pip includes pytorch-0.3.0, numpy, Matplotlib and Seaborn pip install http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp36-cp36m-linux_x86_64.whl numpy matplotlib seaborn # MAC system installation, python version < = 3.6 pip install torch==0.3.0.post4 numpy matplotlib seaborn

Code analysis of text embedding layer:

# Import prerequisite Toolkit import torch # The predefined network layer torch.nn, some common layers that the tool developers have helped us develop, # For example, convolution layer, lstm layer, embedding layer, etc. we don't need to rebuild the wheel import torch.nn as nn # Mathematical computing Toolkit import math # Encapsulate the function Variable in torch from torch.autograd import Variable # Define the Embeddings class as like as two peas to implement the text embedding layer. Here, s indicates two identical embedded layers, which share parameters. # This class inherits nn.Module, so it has some functions of the standard layer. Here, we can also understand it as a pattern. All layers implemented by ourselves will be written in this way class Embeddings(nn.Module): def __init__(self, d_model, vocab): """Class initialization function, There are two parameters, d_model: Dimension of word embedding, vocab: Refers to the size of the thesaurus.""" # The next step is to use super to specify the initialization function that inherits nn.Module. All layers implemented by ourselves will be written in this way super(Embeddings, self).__init__() # Then, call the predefined layer Embedding in nn to obtain a word embedded object self.lut self.lut = nn.Embedding(vocab, d_model) # Finally, d_model passed into class self.d_model = d_model def forward(self, x): """It can be understood as the forward propagation logic of this layer, and this function will be available in all layers When an instantiated object parameter is passed to this class, This type of function is called automatically parameter x: because Embedding The first floor is the first floor, Therefore, it represents the tensor of the text input to the model after lexical mapping""" # Pass x to self.lut and match it with self. D under the root sign_ Model is multiplied and returned as a result return self.lut(x) * math.sqrt(self.d_model)

nn.Embedding Demo:

>>> embedding = nn.Embedding(10, 3) >>> input = torch.LongTensor([[1,2,4,5],[4,3,2,9]]) >>> embedding(input) tensor([[[-0.0251, -1.6902, 0.7172], [-0.6431, 0.0748, 0.6969], [ 1.4970, 1.3448, -0.9685], [-0.3677, -2.7265, -0.1685]], [[ 1.4970, 1.3448, -0.9685], [ 0.4362, -0.4004, 0.9400], [-0.6431, 0.0748, 0.6969], [ 0.9124, -2.3616, 1.1151]]]) >>> embedding = nn.Embedding(10, 3, padding_idx=0) >>> input = torch.LongTensor([[0,2,0,5]]) >>> embedding(input) tensor([[[ 0.0000, 0.0000, 0.0000], [ 0.1535, -2.0309, 0.9315], [ 0.0000, 0.0000, 0.0000], [-0.1655, 0.9897, 0.0635]]])

Instantiation parameters:

# The word embedding dimension is 512 d_model = 512 # The size of the thesaurus is 1000 vocab = 1000

Input parameters:

# The input x is a long integer tensor encapsulated in a Variable, with a shape of 2 x 4 x = Variable(torch.LongTensor([[100,2,421,508],[491,998,1,221]]))

Call:

emb = Embeddings(d_model, vocab) embr = emb(x) print("embr:", embr)

Output effect:

embr: Variable containing: ( 0 ,.,.) = 35.9321 3.2582 -17.7301 ... 3.4109 13.8832 39.0272 8.5410 -3.5790 -12.0460 ... 40.1880 36.6009 34.7141 -17.0650 -1.8705 -20.1807 ... -12.5556 -34.0739 35.6536 20.6105 4.4314 14.9912 ... -0.1342 -9.9270 28.6771 ( 1 ,.,.) = 27.7016 16.7183 46.6900 ... 17.9840 17.2525 -3.9709 3.0645 -5.5105 10.8802 ... -13.0069 30.8834 -38.3209 33.1378 -32.1435 -3.9369 ... 15.6094 -29.7063 40.1361 -31.5056 3.3648 1.4726 ... 2.8047 -9.6514 -23.4909 [torch.FloatTensor of size 2x4x512]

## 2.2 function of position encoder

Because there is no processing of lexical location information in the encoder structure of Transformer, it is necessary to add a location encoder after the Embedding layer to add the information that may produce different semantics due to different lexical locations to the word Embedding tensor to make up for the lack of location information

# Define the position encoder class. We also regard it as a layer, so it will inherit nn.Module class PositionalEncoding(nn.Module): def __init__(self, d_model, dropout, max_len=5000): """Initialization function of position encoder class, There are three parameters, namely d_model: Word embedding dimension, dropout: Set 0 ratio, max_len: Maximum length of each sentence""" super(PositionalEncoding, self).__init__() # Instantiate the dropout layer predefined in nn and pass dropout into it to obtain the object self.dropout self.dropout = nn.Dropout(p=dropout) # Initialize a position coding matrix, which is a 0 matrix, and the size of the matrix is max_len x d_model. pe = torch.zeros(max_len, d_model) # Initialize an absolute position matrix. Here, the absolute position of a word is represented by its index # So we first use the Lagrange method to obtain a continuous natural number vector, and then use the unsqueeze method to expand the vector dimension to make it a matrix, # And because the parameter is 1, which represents the position of the matrix expansion, it will turn the vector into a max_ Matrix of len x 1, position = torch.arange(0, max_len).unsqueeze(1) # After the absolute position matrix is initialized, the next step is to consider how to add these position information to the position coding matrix, # The simplest idea is to set Max first_ The absolute position matrix of len x 1 is transformed into max_len x d_model shape, and then overwrite the original initial position coding matrix, # To do this matrix transformation, you need a 1XD_ Transformation matrix div of model shape_ Term, our requirements for this transformation matrix, in addition to the shape, # It is also hoped that it can scale the absolute position coding of natural numbers to a sufficiently small number, which will help to converge faster in the subsequent gradient descent process. In this way, we can start initializing the transformation matrix # First, we use rangeTo obtain a natural number matrix, but careful students will find that we did not initialize a 1XD as expected_ Matrix of model, # Instead, there is a jump, only half initialized, 1xd_model/2 matrix. Why is it half? In fact, it is not really initialized half of the matrix, # We can regard it as initializing twice, and the transformation matrix initialized each time will be treated differently. The transformation matrix initialized for the first time is distributed on the sine wave, and the transformation matrix initialized for the second time is distributed on the cosine wave, # The two matrices are filled in the even and odd positions of the position coding matrix to form the final position coding matrix div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) # In this way, we get the position coding matrix PE, which is still a two-dimensional matrix. If you want to add it to the output of embedding (a three-dimensional tensor), # You must expand a dimension, so use unsqueeze to expand the dimension pe = pe.unsqueeze(0) # Finally, register the pe position coding matrix as the buffer of the model. What is the buffer, # We think it is helpful to the model effect, but it is not a super parameter or parameter in the model structure and does not need to be updated with the optimization steps # After registration, we can be loaded together with the model structure and parameters when the model is saved and reloaded self.register_buffer('pe', pe) def forward(self, x): """forward The arguments to the function are x, Embedded representation of words representing text sequences""" # Before adding, we do some adaptation work on pe. Slice the second dimension of this three-dimensional tensor, that is, the one dimension of the maximum length of the sentence, to the same dimension as the input x, that is, x.size(1), # Because we default to max_len is 5000. Generally speaking, it is too large to have a sentence containing 5000 words, so it is necessary to adapt to the input tensor # Finally, the Variable is encapsulated to make it the same style as x, but it does not need gradient solution, so the requires_grad is set to false x = x + Variable(self.pe[:, :x.size(1)], requires_grad=False) # Finally, use the self.dropout object to 'discard' and return the result return self.dropout(x)

nn.Dropout Demo:

>>> m = nn.Dropout(p=0.2) >>> input = torch.randn(4, 5) >>> output = m(input) >>> output Variable containing: 0.0000 -0.5856 -1.4094 0.0000 -1.0290 2.0591 -1.3400 -1.7247 -0.9885 0.1286 0.5099 1.3715 0.0000 2.2079 -0.5497 -0.0000 -0.7839 -1.2434 -0.1222 1.2815 [torch.FloatTensor of size 4x5]

torch.unsqueeze Demo:

>>> x = torch.tensor([1, 2, 3, 4]) >>> torch.unsqueeze(x, 0) tensor([[ 1, 2, 3, 4]]) >>> torch.unsqueeze(x, 1) tensor([[ 1], [ 2], [ 3], [ 4]])

Instantiation parameters:

# The word embedding dimension is 512 d_model = 512 # Set the 0 ratio to 0.1 dropout = 0.1 # Maximum sentence length max_len=60

Input parameters:

# The input x is the tensor of the output of the Embedding layer, and the shape is 2 x 4 x 512 x = embr Variable containing: ( 0 ,.,.) = 35.9321 3.2582 -17.7301 ... 3.4109 13.8832 39.0272 8.5410 -3.5790 -12.0460 ... 40.1880 36.6009 34.7141 -17.0650 -1.8705 -20.1807 ... -12.5556 -34.0739 35.6536 20.6105 4.4314 14.9912 ... -0.1342 -9.9270 28.6771 ( 1 ,.,.) = 27.7016 16.7183 46.6900 ... 17.9840 17.2525 -3.9709 3.0645 -5.5105 10.8802 ... -13.0069 30.8834 -38.3209 33.1378 -32.1435 -3.9369 ... 15.6094 -29.7063 40.1361 -31.5056 3.3648 1.4726 ... 2.8047 -9.6514 -23.4909 [torch.FloatTensor of size 2x4x512]

Call:

pe = PositionalEncoding(d_model, dropout, max_len) pe_result = pe(x) print("pe_result:", pe_result)

Output effect:

pe_result: Variable containing: ( 0 ,.,.) = -19.7050 0.0000 0.0000 ... -11.7557 -0.0000 23.4553 -1.4668 -62.2510 -2.4012 ... 66.5860 -24.4578 -37.7469 9.8642 -41.6497 -11.4968 ... -21.1293 -42.0945 50.7943 0.0000 34.1785 -33.0712 ... 48.5520 3.2540 54.1348 ( 1 ,.,.) = 7.7598 -21.0359 15.0595 ... -35.6061 -0.0000 4.1772 -38.7230 8.6578 34.2935 ... -43.3556 26.6052 4.3084 24.6962 37.3626 -26.9271 ... 49.8989 0.0000 44.9158 -28.8435 -48.5963 -0.9892 ... -52.5447 -4.1475 -3.0450 [torch.FloatTensor of size 2x4x512]

Draw the distribution curve of features in vocabulary vector:

import matplotlib.pyplot as plt # Create a canvas that is 15 x 5 in size plt.figure(figsize=(15, 5)) # Instantiate the PositionalEncoding class to get the pe object. The input parameters are 20 and 0 pe = PositionalEncoding(20, 0) # Then pass the tensor encapsulated by Variable to pe, so that pe will directly execute the forward function, # And the values in this tensor are all 0, which is equivalent to the position coding tensor after being processed y = pe(Variable(torch.zeros(1, 100, 20))) # Then define the abscissa and ordinate of the canvas. The abscissa is the length of 100. The ordinate is the corresponding value of a dimensional feature in a vocabulary under different lengths # Because there are 20 dimensions in total, we only look at the values of dimensions 4, 5, 6 and 7 plt.plot(np.arange(100), y[0, :, 4:8].data.numpy()) # Fill in dimension prompt information on the canvas plt.legend(["dim %d"%p for p in [4,5,6,7]])

Output effect:

Effect analysis:

- The curve of each color represents the meaning of a feature in a word in different positions
- Ensure that the embedding vector of the corresponding position of the same word will change with different positions
- The value range of sine wave and cosine wave is 1 to - 1, which controls the embedded value and is helpful to the rapid calculation of gradient

come on.

thank!

strive!