[NLP] Transformer architecture analysis

1. Understand Transformer architecture

1.1 role of transformer model

  • The transformer model based on seq2seq architecture can complete typical tasks in NLP field, such as machine translation, text generation, etc. at the same time, it can also build a pre training language model for transfer learning of different tasks

Statement:

In the following architecture analysis, we will assume that the Transformer model architecture is used to process the translation from one language text to another language text, so many naming methods follow the rules in NLP. For example, the Embedding layer will be called the text Embedding layer, the tensor generated by the Embedding layer will be called the word Embedding tensor, and its last dimension will be called the word vector, etc

1.2 Transformer overall architecture diagram

The overall architecture of Transformer can be divided into four parts:

  • Input part
  • Output part
  • Encoder part
  • Decoder part

The input section contains:

  • Source text embedding layer and its position encoder
  • Target text embedding layer and its position encoder


The output section contains:

  • Linear layer
  • softmax layer


Encoder part:

  • It is composed of N encoder layers stacked
  • Each encoder layer consists of two sublayer connection structures
  • The first sublayer connection structure includes a multi head self attention sublayer, a normalization layer and a residual connection
  • The second sublayer connection structure includes a feedforward full connection sublayer, a normalization layer and a residual connection


Decoder part:

  • It is composed of N decoder layers stacked
  • Each decoder layer consists of three sublayer connection structures
  • The first sublayer connection structure includes a multi head self attention sublayer, a normalization layer and a residual connection
  • The second sublayer connection structure includes a multi head attention sublayer, a normalization layer and a residual connection
  • The third sublayer connection structure includes a feedforward full connection sublayer, a normalization layer and a residual connection

2. Implementation of input part

The input section contains:

  • Source text embedding layer and its position encoder
  • Target text embedding layer and its position encoder

2.1 role of text embedding layer

Both source text embedding and target text embedding are to transform the digital representation of words in text into vector representation, hoping to capture the relationship between words in such a high-dimensional space

Installation of pytorch 0.3.0 and its prerequisite Kit:

# The toolkit installed with pip includes pytorch-0.3.0, numpy, Matplotlib and Seaborn
pip install http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp36-cp36m-linux_x86_64.whl numpy matplotlib seaborn

# MAC system installation, python version < = 3.6
pip install torch==0.3.0.post4 numpy matplotlib seaborn

Code analysis of text embedding layer:

# Import prerequisite Toolkit
import torch

# The predefined network layer torch.nn, some common layers that the tool developers have helped us develop, 
# For example, convolution layer, lstm layer, embedding layer, etc. we don't need to rebuild the wheel
import torch.nn as nn

# Mathematical computing Toolkit
import math

# Encapsulate the function Variable in torch
from torch.autograd import Variable

# Define the Embeddings class as like as two peas to implement the text embedding layer. Here, s indicates two identical embedded layers, which share parameters.
# This class inherits nn.Module, so it has some functions of the standard layer. Here, we can also understand it as a pattern. All layers implemented by ourselves will be written in this way
class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        """Class initialization function, There are two parameters, d_model: Dimension of word embedding, vocab: Refers to the size of the thesaurus."""
        # The next step is to use super to specify the initialization function that inherits nn.Module. All layers implemented by ourselves will be written in this way
        super(Embeddings, self).__init__()
        # Then, call the predefined layer Embedding in nn to obtain a word embedded object self.lut
        self.lut = nn.Embedding(vocab, d_model)
        # Finally, d_model passed into class
        self.d_model = d_model

    def forward(self, x):
        """It can be understood as the forward propagation logic of this layer, and this function will be available in all layers
           When an instantiated object parameter is passed to this class, This type of function is called automatically
           parameter x: because Embedding The first floor is the first floor, Therefore, it represents the tensor of the text input to the model after lexical mapping"""

        # Pass x to self.lut and match it with self. D under the root sign_ Model is multiplied and returned as a result
        return self.lut(x) * math.sqrt(self.d_model)

nn.Embedding Demo:

>>> embedding = nn.Embedding(10, 3)
>>> input = torch.LongTensor([[1,2,4,5],[4,3,2,9]])
>>> embedding(input)
tensor([[[-0.0251, -1.6902,  0.7172],
         [-0.6431,  0.0748,  0.6969],
         [ 1.4970,  1.3448, -0.9685],
         [-0.3677, -2.7265, -0.1685]],

        [[ 1.4970,  1.3448, -0.9685],
         [ 0.4362, -0.4004,  0.9400],
         [-0.6431,  0.0748,  0.6969],
         [ 0.9124, -2.3616,  1.1151]]])


>>> embedding = nn.Embedding(10, 3, padding_idx=0)
>>> input = torch.LongTensor([[0,2,0,5]])
>>> embedding(input)
tensor([[[ 0.0000,  0.0000,  0.0000],
         [ 0.1535, -2.0309,  0.9315],
         [ 0.0000,  0.0000,  0.0000],
         [-0.1655,  0.9897,  0.0635]]])

Instantiation parameters:

# The word embedding dimension is 512
d_model = 512

# The size of the thesaurus is 1000
vocab = 1000

Input parameters:

# The input x is a long integer tensor encapsulated in a Variable, with a shape of 2 x 4
x = Variable(torch.LongTensor([[100,2,421,508],[491,998,1,221]]))

Call:

emb = Embeddings(d_model, vocab)
embr = emb(x)
print("embr:", embr)

Output effect:

embr: Variable containing:
( 0 ,.,.) = 
  35.9321   3.2582 -17.7301  ...    3.4109  13.8832  39.0272
   8.5410  -3.5790 -12.0460  ...   40.1880  36.6009  34.7141
 -17.0650  -1.8705 -20.1807  ...  -12.5556 -34.0739  35.6536
  20.6105   4.4314  14.9912  ...   -0.1342  -9.9270  28.6771

( 1 ,.,.) = 
  27.7016  16.7183  46.6900  ...   17.9840  17.2525  -3.9709
   3.0645  -5.5105  10.8802  ...  -13.0069  30.8834 -38.3209
  33.1378 -32.1435  -3.9369  ...   15.6094 -29.7063  40.1361
 -31.5056   3.3648   1.4726  ...    2.8047  -9.6514 -23.4909
[torch.FloatTensor of size 2x4x512]

2.2 function of position encoder

Because there is no processing of lexical location information in the encoder structure of Transformer, it is necessary to add a location encoder after the Embedding layer to add the information that may produce different semantics due to different lexical locations to the word Embedding tensor to make up for the lack of location information

# Define the position encoder class. We also regard it as a layer, so it will inherit nn.Module    
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout, max_len=5000):
        """Initialization function of position encoder class, There are three parameters, namely d_model: Word embedding dimension, 
           dropout: Set 0 ratio, max_len: Maximum length of each sentence"""
        super(PositionalEncoding, self).__init__()

        # Instantiate the dropout layer predefined in nn and pass dropout into it to obtain the object self.dropout
        self.dropout = nn.Dropout(p=dropout)

        # Initialize a position coding matrix, which is a 0 matrix, and the size of the matrix is max_len x d_model.
        pe = torch.zeros(max_len, d_model)

        # Initialize an absolute position matrix. Here, the absolute position of a word is represented by its index 
        # So we first use the Lagrange method to obtain a continuous natural number vector, and then use the unsqueeze method to expand the vector dimension to make it a matrix, 
        # And because the parameter is 1, which represents the position of the matrix expansion, it will turn the vector into a max_ Matrix of len x 1, 
        position = torch.arange(0, max_len).unsqueeze(1)

        # After the absolute position matrix is initialized, the next step is to consider how to add these position information to the position coding matrix,
        # The simplest idea is to set Max first_ The absolute position matrix of len x 1 is transformed into max_len x d_model shape, and then overwrite the original initial position coding matrix, 
        # To do this matrix transformation, you need a 1XD_ Transformation matrix div of model shape_ Term, our requirements for this transformation matrix, in addition to the shape,
        # It is also hoped that it can scale the absolute position coding of natural numbers to a sufficiently small number, which will help to converge faster in the subsequent gradient descent process. In this way, we can start initializing the transformation matrix
        # First, we use rangeTo obtain a natural number matrix, but careful students will find that we did not initialize a 1XD as expected_ Matrix of model, 
        # Instead, there is a jump, only half initialized, 1xd_model/2 matrix. Why is it half? In fact, it is not really initialized half of the matrix,
        # We can regard it as initializing twice, and the transformation matrix initialized each time will be treated differently. The transformation matrix initialized for the first time is distributed on the sine wave, and the transformation matrix initialized for the second time is distributed on the cosine wave, 
        # The two matrices are filled in the even and odd positions of the position coding matrix to form the final position coding matrix
        div_term = torch.exp(torch.arange(0, d_model, 2) *
                             -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        # In this way, we get the position coding matrix PE, which is still a two-dimensional matrix. If you want to add it to the output of embedding (a three-dimensional tensor),
        # You must expand a dimension, so use unsqueeze to expand the dimension
        pe = pe.unsqueeze(0)

        # Finally, register the pe position coding matrix as the buffer of the model. What is the buffer,
        # We think it is helpful to the model effect, but it is not a super parameter or parameter in the model structure and does not need to be updated with the optimization steps 
        # After registration, we can be loaded together with the model structure and parameters when the model is saved and reloaded
        self.register_buffer('pe', pe)

    def forward(self, x):
        """forward The arguments to the function are x, Embedded representation of words representing text sequences"""
        # Before adding, we do some adaptation work on pe. Slice the second dimension of this three-dimensional tensor, that is, the one dimension of the maximum length of the sentence, to the same dimension as the input x, that is, x.size(1),
        # Because we default to max_len is 5000. Generally speaking, it is too large to have a sentence containing 5000 words, so it is necessary to adapt to the input tensor 
        # Finally, the Variable is encapsulated to make it the same style as x, but it does not need gradient solution, so the requires_grad is set to false
        x = x + Variable(self.pe[:, :x.size(1)], 
                         requires_grad=False)
        # Finally, use the self.dropout object to 'discard' and return the result
        return self.dropout(x)

nn.Dropout Demo:

>>> m = nn.Dropout(p=0.2)
>>> input = torch.randn(4, 5)
>>> output = m(input)
>>> output
Variable containing:
 0.0000 -0.5856 -1.4094  0.0000 -1.0290
 2.0591 -1.3400 -1.7247 -0.9885  0.1286
 0.5099  1.3715  0.0000  2.2079 -0.5497
-0.0000 -0.7839 -1.2434 -0.1222  1.2815
[torch.FloatTensor of size 4x5]

torch.unsqueeze Demo:

>>> x = torch.tensor([1, 2, 3, 4])
>>> torch.unsqueeze(x, 0)
tensor([[ 1,  2,  3,  4]])
>>> torch.unsqueeze(x, 1)
tensor([[ 1],
        [ 2],
        [ 3],
        [ 4]])

Instantiation parameters:

# The word embedding dimension is 512
d_model = 512

# Set the 0 ratio to 0.1
dropout = 0.1

# Maximum sentence length
max_len=60

Input parameters:

# The input x is the tensor of the output of the Embedding layer, and the shape is 2 x 4 x 512
x = embr
Variable containing:
( 0 ,.,.) = 
  35.9321   3.2582 -17.7301  ...    3.4109  13.8832  39.0272
   8.5410  -3.5790 -12.0460  ...   40.1880  36.6009  34.7141
 -17.0650  -1.8705 -20.1807  ...  -12.5556 -34.0739  35.6536
  20.6105   4.4314  14.9912  ...   -0.1342  -9.9270  28.6771

( 1 ,.,.) = 
  27.7016  16.7183  46.6900  ...   17.9840  17.2525  -3.9709
   3.0645  -5.5105  10.8802  ...  -13.0069  30.8834 -38.3209
  33.1378 -32.1435  -3.9369  ...   15.6094 -29.7063  40.1361
 -31.5056   3.3648   1.4726  ...    2.8047  -9.6514 -23.4909
[torch.FloatTensor of size 2x4x512]

Call:

pe = PositionalEncoding(d_model, dropout, max_len)
pe_result = pe(x)
print("pe_result:", pe_result)

Output effect:

pe_result: Variable containing:
( 0 ,.,.) = 
 -19.7050   0.0000   0.0000  ...  -11.7557  -0.0000  23.4553
  -1.4668 -62.2510  -2.4012  ...   66.5860 -24.4578 -37.7469
   9.8642 -41.6497 -11.4968  ...  -21.1293 -42.0945  50.7943
   0.0000  34.1785 -33.0712  ...   48.5520   3.2540  54.1348

( 1 ,.,.) = 
   7.7598 -21.0359  15.0595  ...  -35.6061  -0.0000   4.1772
 -38.7230   8.6578  34.2935  ...  -43.3556  26.6052   4.3084
  24.6962  37.3626 -26.9271  ...   49.8989   0.0000  44.9158
 -28.8435 -48.5963  -0.9892  ...  -52.5447  -4.1475  -3.0450
[torch.FloatTensor of size 2x4x512]

Draw the distribution curve of features in vocabulary vector:

import matplotlib.pyplot as plt

# Create a canvas that is 15 x 5 in size
plt.figure(figsize=(15, 5))

# Instantiate the PositionalEncoding class to get the pe object. The input parameters are 20 and 0
pe = PositionalEncoding(20, 0)

# Then pass the tensor encapsulated by Variable to pe, so that pe will directly execute the forward function, 
# And the values in this tensor are all 0, which is equivalent to the position coding tensor after being processed
y = pe(Variable(torch.zeros(1, 100, 20)))

# Then define the abscissa and ordinate of the canvas. The abscissa is the length of 100. The ordinate is the corresponding value of a dimensional feature in a vocabulary under different lengths
# Because there are 20 dimensions in total, we only look at the values of dimensions 4, 5, 6 and 7
plt.plot(np.arange(100), y[0, :, 4:8].data.numpy())

# Fill in dimension prompt information on the canvas
plt.legend(["dim %d"%p for p in [4,5,6,7]])

Output effect:


Effect analysis:

  • The curve of each color represents the meaning of a feature in a word in different positions
  • Ensure that the embedding vector of the corresponding position of the same word will change with different positions
  • The value range of sine wave and cosine wave is 1 to - 1, which controls the embedded value and is helpful to the rapid calculation of gradient

come on.

thank!

strive!

Tags: Deep Learning NLP Transformer

Posted on Sun, 24 Oct 2021 05:15:44 -0400 by maciek4