Step by step to understand the python Chatbot tutorial code - define the model


I'm a programming Xiaobai. Although the registration time is long, I'm not engaged in coding. I began to teach myself Python in order to learn AI.
I usually knock the code according to the book, but I don't have a deep understanding. Now I want to study chatbot and find that my coding level needs to be strengthened, so I open this series to record the process of deducting code line by line. Of course, it doesn't start from 0. It just writes out what you don't understand. It can also be used as data for future reference.

Finally, I would like to reiterate that I have not learned programming systematically. Writing this series is to break through myself. Please give me your advice!

Useful tools

Websites that can visualize code

Source of code



Step by step to understand the python Chatbot tutorial code (I) - load and preprocess data
Step by step to understand the python Chatbot tutorial code (II) - data processing
Step by step to understand the python Chatbot tutorial code (III) - create a dictionary
Step by step to understand the python Chatbot tutorial code (IV) - prepare data for the model
Step by step to understand the python Chatbot tutorial code (V) - define the model

The head is bigger

I sprained my foot when playing badminton with children. I experienced the feeling of being a wheelchair for the first time and found that it was not easy to control. It was different from that in the film. Fortunately, there was no fracture, but after lying in bed for 5 days, I felt the "fun" of old age life in advance. I feel that I still need to exercise well. In addition, appropriate shoes and preparatory activities are also necessary.

And the complexity of the code has doubled. This series is coming to an end

Code and description Define Models

Watching Li Mu's video will help you understand the following content

Seq2Seq model

The core of our chatbot is a sequence to sequence (seq2seq) model. The input of seq2seq model is a variable length sequence, and the output is also a variable length sequence. And the length of the two sequences is not the same. Generally, we use RNN to deal with variable length sequences. Sutskever et al. Found that this kind of problem can be solved by using two RNNs. The input and output of such problems are variable and different in length, including question answering system, machine translation, automatic summarization and so on, which can be solved by seq2seq model.

One RNN is called Encoder, which encodes the variable length input sequence into a fixed length context vector. We can generally think that this vector contains the semantics of the input sentence. The second RNN is called Decoder. The initial hidden state is the output context vector of Encoder, and the input is (the special token indicating the beginning of the sentence). Then the RNN is used to calculate the output of the first time, and then the output of the second time and the new hidden state are calculated with the output of the first time and the hidden state,... Until a special (the special token indicating the end of the sentence) is output at a certain time Or the length exceeds a threshold. The Seq2Seq model is shown in the figure below.


Encoder is an RNN. It will traverse every token (word) of the input. The input at each time is the implicit state and input of the previous time, and then there will be an output and a new implicit state. This new hidden state will be used as the input hidden state at the next time. There is an output at each time. For the seq2seq model, usually only the implicit state of the last time is retained, which is considered to encode the semantics of the whole sentence. However, the Attention mechanism will be used later, and it will also use the output of the encoder at each time. After the encoder processing, the hidden state at the last time will be taken as the initial hidden state of the Decoder.

Multi layer Gated Recurrent Unit(GRU) or LSTM is usually used as Encoder. GRU is used here. Please refer to Cho et al. In 2014 [paper].

Here, a bidirectional RNN is used, as shown in the figure below.

Note that before accessing RNN, there will be an Embedding layer to map each word (ID or one hot vector) into a continuous dense vector. We can think that this vector encodes the semantics of a word. In our model, we define its size as the same as the hidden state size of RNN (but not necessarily the same). With Embedding, the model will encode similar words into similar vectors (closer).

Finally, in order to transfer the batch data of padding to RNN, we need to use the following two functions to pack and unpack: torch.nn.utils.rnn.pack_padded_sequence and torch.nn.utils.rnn.pad_packed_sequence

Calculation diagram:

  1. The ID of the word is transformed into a vector through the Embedding layer.

  2. pack the padded data.

  3. Pass in GRU for Forward calculation.

  4. Unpack calculation results

  5. Add up the result vectors of the two-way GRU.

  6. Returns the output (at all times) and the implicit state at the last time.


input_seq: a batch input sentence, shape is (max_length, batch_size)

input_lengths: a list with a length of batch, indicating the actual length of the sentence.

Hidden: initialize hidden state (usually zero), shape is (n_layers x num_directions, batch_size, hidden_size)


outputs: output vector of GRU of the last layer (bidirectional vectors are added together), shape(max_length, batch_size, hidden_size)

Hidden: the hidden state at the last moment. The shape is (n_layers x num_directions, batch_size, hidden_size)

The encoder RNN code is as follows:

class EncoderRNN(nn.Module):
    def __init__(self, hidden_size, embedding, n_layers=1, dropout=0):
        super(EncoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        self.embedding = embedding

        # Initialize GRU; the input_size and hidden_size params are both set to 'hidden_size'
        #   because our input size is a word embedding with number of features == hidden_size
        self.gru = nn.GRU(hidden_size, hidden_size, n_layers,
                          dropout=(0 if n_layers == 1 else dropout), bidirectional=True)

    def forward(self, input_seq, input_lengths, hidden=None):
        # Convert word indexes to embeddings
        embedded = self.embedding(input_seq)
        # Pack padded batch of sequences for RNN module
        packed = nn.utils.rnn.pack_padded_sequence(embedded, input_lengths)
        # Forward pass through GRU
        outputs, hidden = self.gru(packed, hidden)
        # Unpack padding
        outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs)
        # Sum bidirectional GRU outputs
        outputs = outputs[:, :, :self.hidden_size] + outputs[:, : ,self.hidden_size:]
        # Return output and final hidden state
        return outputs, hidden


Initialize gruelf.gru = nn.gru (hidden_size, hidden_size, n_layers, dropout = (0 if n_layers = = 1 else dropout), bidirectional = true)
Input here_ Size and hidden_ The size parameter is set to hidden_size, because we assume that the output size of the embedding layer is hidden_size
If there is only one layer, dropout is not performed. Otherwise, dropout of GRU is performed using the passed in parameter dropout.


Enter (max_length, batch). After Embedding, it becomes (max_length, batch, hidden_size)

packed because RNN(GRU) needs to know the actual length, PyTorch provides a function pack_padded_sequence packs the input vector and length into an object PackedSequence, which is easy to use.

forward calculation through GRU requires input and implicit variables
If the input passed in is a Tensor (max_length, batch, hidden_size)
Then the output is (max_length, batch, hidden_size*num_directions).
The third dimension is hidden_size and num_ The actual order of directions is num_directions are in front, so we can:

  • Use outputs.view(seq_len, batch, num_directions, hidden_size) to get 4-dimensional vectors.

  • The third dimension is the direction and the fourth is the hidden state.

If the input is a PackedSequence object, the output is also a PackedSequence object. We need to use the function pad_packed_sequence turns it into a vector with shape (max_length, batch, hidden*num_directions) and a list to represent the length of the output. Of course, this list and the input_lengths are exactly the same, so usually we don't need it.

outputs, hidden = self.gru(packed, hidden) get outputs as (max_length, batch, hidden*num_directions)

outputs, _ = torch.nn.utils.rnn.pad_packed_sequence(outputs):

We need to put the output num_directions add up two-way vectors
Because the third dimension of outputs is to put forward hidden first_ Size results, and then put backward hidden_size results
So outputs [:,:,: self. Hidden_size] gets the forward result. Outputs [:,:, self. Hidden_size:] is the backward result
Note that if bidirectional is False, the size of the third dimension of outputs is hidden_size,
At this time, outputs [:,:, self. Hidden_size:] does not exist, so it will not be added.

return outputs, hidden returns the final output and the hidden state at the last moment.


Decoder is also an RNN, which outputs one word at a time. The input of each time is the hidden state of the previous time and the output of the previous time. The first hidden state is the last hidden state of Encoder, and the input is special. Then use RNN to calculate the new hidden state and output the first word, and then use the new hidden state and the first word to calculate the second word,... Until it is encountered and the output is ended. The problem with an ordinary RNN Decoder is that it only depends on the implicit state at the last moment of the Encoder. Although this implicit state (context vector) can encode the semantics of the input sentence in theory, it will be difficult in practice. Therefore, when the input sentence is very long, the effect will be very long.

In order to solve this problem, Bahdanau et al paper The attention mechanism is proposed in. When the Decoder calculates the T-Time, in addition to the implicit state of the t-1 time and the input of the current time, the attention mechanism can also refer to the input of all times of the Encoder. Take machine translation for example. When we translate the t-th word of a sentence, we will focus our attention on a word.

Of course, the common attention is a kind of soft attention. Assuming that there are five words input, attention may be a probability, for example (0.6,0.1,0.1,0.1,0.1), indicating that the first word input is the most concerned at present. At the same time, we also calculated the output vector at each time, assuming that the five times are y 1 , ... , y 5 y_1,...,y_5 y1,..., y5, then we can use the attention probability weighting to get the context vector at the current time 0.6 y 1 + 0.1 y 2 + ... + 0.1 y 5 0.6y_1+0.1y_2+...+0.1y_5 0.6y1​+0.1y2​+...+0.1y5​.

There are many ways to calculate attention. Here we introduce Luong et al paper Proposed method. It uses the new implicit state calculated by the GRU at the current time to calculate the attention score. First, it uses a score function to calculate the similarity score between the implicit state and the output of the Encoder. The larger the score, the more attention should be paid to the word. Then use the softmax function to turn the score into probability. Take machine translation as an example. At time t, h t h_t ht represents the new hidden state of GRU output at time t, which can be considered as h t h_t ht indicates the semantics that need to be translated at present. By calculation h t h_t ht , and y 1 , ... , y n y_1,...,y_n y1,..., yn's score, if h t h_t ht , and y 1 y_1 y1 , has a high score, so we can consider the current main translated words x 1 x_1 The semantics of x1. There are many calculation methods of score function in, as shown in the following figure:

Where h t h_t ht , represents the hidden state at time t, for example, the first method of calculating score, which is calculated directly h t h_t ht , and h s h_s The larger the inner product of hs , the more similar the two vectors are, so pay more attention to the word. The second method is similar, but introduces a matrix that can be learned. We can think that it is right first h t h_t ht , make a linear transformation, and then compare it with h s h_s hs} calculate the inner product. The third method splices them together and uses a fully connected network to calculate the score.

Note that we introduced separate calculations earlier h t h_t ht , and y 1 y_1 Inner product of y1 + h t h_t ht , and y 2 y_2 The inner product of y2. But for efficiency, it can be calculated at one time h t h_t ht , and h s = [ y 1 , y 2 , ... , y n ] h_s=[y_1,y_2,...,y_n] hs = [y1, y2,..., yn] product. The calculation process is shown in the figure below.


Enter the shape of hidden (1, batch=64, hidden_size=500)
encoder_ The shape of outputs is (input_lengths=10, batch=64, hidden_size=500)
hidden * encoder_ The shape obtained by output is (10, 64, 500), and then the score can be calculated by summing the third dimension.


The input is the hidden state of the last time and the output Encoder of the Encoder at all times_ outputs
The output is the probability of attention, that is, the length is input_ The vector of lengths, whose sum adds up to 1.

Calculate the score of attention. The shape of hidden input is (1, batch=64, hidden_size=500), indicating the implicit state of batch data at time t
encoder_ The shape of outputs is (input_lengths=10, batch=64, hidden_size=500)

attn_energies = attn_ Energy. T() Attn_ Transpose energies from (max_length=10, batch=64) to (64, 10)

F.softmax(attn_energies, dim=1).unsqueeze(1) uses the softmax function to change the score into probability, the shape is still (64, 10), and then unsqueeze(1) into (64, 1, 10)

# Luong attention layer
class Attn(nn.Module):
    def __init__(self, method, hidden_size):
        super(Attn, self).__init__()
        self.method = method
        if self.method not in ['dot', 'general', 'concat']:
            raise ValueError(self.method, "is not an appropriate attention method.")
        self.hidden_size = hidden_size
        if self.method == 'general':
            self.attn = nn.Linear(self.hidden_size, hidden_size)
        elif self.method == 'concat':
            self.attn = nn.Linear(self.hidden_size * 2, hidden_size)
            self.v = nn.Parameter(torch.FloatTensor(hidden_size))
    def dot_score(self, hidden, encoder_output):
        return torch.sum(hidden * encoder_output, dim=2)
    def general_score(self, hidden, encoder_output):
        energy = self.attn(encoder_output)
        return torch.sum(hidden * energy, dim=2)

    def concat_score(self, hidden, encoder_output):
        energy = self.attn(, -1, -1), encoder_output), 2)).tanh()
        return torch.sum(self.v * energy, dim=2)

    def forward(self, hidden, encoder_outputs):
        # Calculate the attention weights (energies) based on the given method
        if self.method == 'general':
            attn_energies = self.general_score(hidden, encoder_outputs)
        elif self.method == 'concat':
            attn_energies = self.concat_score(hidden, encoder_outputs)
        elif self.method == 'dot':
            attn_energies = self.dot_score(hidden, encoder_outputs)

        # Transpose max_length and batch_size dimensions
        attn_energies = attn_energies.t()

        # Return the softmax normalized probability scores (with added dimension)
        return F.softmax(attn_energies, dim=1).unsqueeze(1)

The above code implements three score calculation methods: dot, general and concat, which correspond to the previous three formulas respectively. Here we introduce the simplest dot method. There are also some comments in the code, only dot_ The score function is difficult to understand. Let's analyze it. First, the shape of this function is (1, batch=64, hidden_size=500) and encoder_ The shape of outputs is (input_lengths=10, batch=64, hidden_size=500).

How to calculate the inner product of hidden and 10 encoder output vectors? For simplicity, let's assume that batch is 1, so the second dimension (batch dimension) can be removed, so hidden is (1, 500), and encoder_outputs is (10, 500). The definition of inner product is that the corresponding bits of two vectors are multiplied and then added, but the encoder_outputs are 10 500 dimensional vectors. Of course, we can write a for loop to calculate, but the efficiency is very low. Here is a small technique, using broadcasting and hidden * encoder_outputs can be understood as copying hidden from (1500) to (10, 500) (of course, the actual implementation will not do so), and then multiplying two (10, 500) matrices. Note that the multiplication here is not matrix multiplication, but the so-called Hadamard multiplication, which actually multiplies the corresponding positions, such as the following example:

So hidden * encoder_outputs can combine the hidden vector (500 numbers) with the encoder_ The 10 vectors (500 numbers) of outputs are multiplied by the corresponding positions. The inner product also needs to add up the 500 products. Therefore, torch.sum(hidden * encoder_output, dim=2) is used to add up the 500 products in the second dimension to obtain 10 score values. Of course, we actually have a batch dimension, so the final Attn_ Energy is (10, 64). Then put Attn in the forward function_ Energies is transposed to (64, 10), and then the softmax function is used to turn 10 scores into probabilities. The shape is still (64, 10). For the convenience of later use, we use unsqueeze(1) to turn it into (64, 1, 10).

With the attention sub module, we can implement the Decoder. Encoder can input one sequence into GRU at a time to get the output of the whole sequence. However, the input of the Decoder at time t is the output of time t-1, which is unknown before the calculation of time t-1 is completed, so the data can only be processed one time at a time. Therefore, the input of GRU of encoder is (max_length, batch, hidden_size), while the input of Decoder is (1, batch, hidden_size). In addition, the Decoder can only use the previous information, so it can only use one-way (rather than two-way) GRU, and the GRU of the encoder is two-way, if the two types of hidden_ If the size is the same, the number of hidden units of the Decoder is reduced by half. How can we take the last hidden state of the encoder as the initial hidden state of the Decoder? Here, the two-way results at each time are added up, so their sizes can match (please refer to the code for the two-way addition of encoder).

Calculation diagram:

  1. Enter the word ID into the Embedding layer

  2. Use the unidirectional GRU to continue Forward to calculate a time.

  3. The new implicit state is used to calculate the attention weight

  4. Use the attention weight to get the context vector

  5. The context vector is spliced with the output of GRU, and then enters a fully connected network, so that the output size is still hidden_size

  6. Output from hidden using a projection matrix_ Size becomes the dictionary size, and then softmax becomes the probability 7) to return the output and the new hidden state


input_step: shape is (1, batch_size)

last_hidden: the hidden state of the last time. The shape is (n_layers x num_directions, batch_size, hidden_size)

encoder_ Outputs: the output of the encoder. The shape is (max_length, batch_size, hidden_size)


output: the probability of outputting each word at the current time. shape is (batch_size, voc.num_words)

Hidden: new hidden state, shape is (n_layers x num_directions, batch_size, hidden_size)

class LuongAttnDecoderRNN(nn.Module):
    def __init__(self, attn_model, embedding, hidden_size, output_size, n_layers=1, dropout=0.1):
        super(LuongAttnDecoderRNN, self).__init__()
        # Save to self, attn_model is the object of the Attn class defined above.
        # Keep for reference
        self.attn_model = attn_model
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.dropout = dropout

        # Define layers
        # Define the layers of the Decoder
        self.embedding = embedding
        self.embedding_dropout = nn.Dropout(dropout)
        self.gru = nn.GRU(hidden_size, hidden_size, n_layers, dropout=(0 if n_layers == 1 else dropout))
        self.concat = nn.Linear(hidden_size * 2, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)

        self.attn = Attn(attn_model, hidden_size)

    def forward(self, input_step, last_hidden, encoder_outputs):
        # Note: the decoder can only process data at one time in each step, because t+1 time can be calculated only after t time is calculated.
        # input_ The shape of step is (1, 64), 64 is batch, and 1 is the current input word ID (output from the previous time)
        # Through embedding, the layer becomes (1, 64, 500), and then dropout. The shape remains unchanged.
        # Note: we run this one step (word) at a time
        # Get embedding of current input word
        embedded = self.embedding(input_step)
        embedded = self.embedding_dropout(embedded)
        # Pass embedded into GRU for forward calculation
        # Get RNN_ The shape of output is (1, 64, 500)
        # hidden is (2, 64, 500). Because it is a bidirectional GRU, the first dimension is 2.
        # Forward through unidirectional GRU
        rnn_output, hidden = self.gru(embedded, last_hidden)
        # Calculate the attention weight. According to the previous analysis, Attn_ The shape of weights is (64, 1, 10)
        # Calculate attention weights from the current GRU output
        attn_weights = self.attn(rnn_output, encoder_outputs)
        # encoder_outputs are (10, 64, 500) 
        # encoder_ The shape after outputs. Transfer (0, 1) is (64, 10, 500)
        # attn_weights.bmm is followed by (64, 1, 500)

        # bmm is batch matrix multiplication. The first dimension is batch. We can put attn_weights are regarded as 64 (1,10) matrices
        # Put the encoder_ Outputs. Transfer (0, 1) is regarded as 64 (10, 500) matrices
        # Then bmm is 64 (1,10) matrices x (10,500) matrices, and finally (64,1,500) is obtained
        # Multiply attention weights to encoder outputs to get new "weighted sum" context vector
        context = attn_weights.bmm(encoder_outputs.transpose(0, 1))
        # Splice the context vector with the GRU output
        # rnn_output changes from (1, 64, 500) to (64, 500)
        # Concatenate weighted context vector and GRU output using Luong eq. 5
        rnn_output = rnn_output.squeeze(0)
        # context changes from (64, 1, 500) to (64, 500)
        context = context.squeeze(1)
        # Spliced (64, 1000)
        concat_input =, context), 1)
        # self.concat is a matrix (1000, 500),
        # The output of self.concat(concat_input) is (64, 500)
        # Then use tanh to return the output to (- 1,1), concat_ The shape of output is (64, 500)
        concat_output = torch.tanh(self.concat(concat_input))
        # out is (500, dictionary size = 7826) 
        # Predict next word using Luong eq. 6
        output = self.out(concat_output)
        # softmax is changed into probability to represent the probability of outputting each word at the current time.
        output = F.softmax(output, dim=1)
        # Returns the output and the new hidden state 
        # Return output and final hidden state
        return output, hidden

Tags: Python Machine Learning AI Pytorch Deep Learning

Posted on Fri, 03 Dec 2021 14:50:14 -0500 by dbakker