Datawhale teamed up to learn NLP_ Emotion analysis baseline updated learning notes

This article is for learning Datawhale 2021.9 team learning emotion analysis notes
Original learning document address:

See for baseline notes

There are many places in baseline that can be optimized, such as

  • Using pre trained word vectors
  • Replace RNN network with bidirectional LSTM network
  • Due to the increase of parameter quantity, regular term is added
  • Try a different optimizer

Next, modify the baseline from the above aspects

1. Replace RNN network with LSTM network

Instantiation model

INPUT_DIM = len(TEXT.vocab)
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]
# UNK_IDX does not need to be passed in, because PAD does not need to be calculated and unk needs to be calculated

model = RNN(INPUT_DIM,

Model definition

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers,
                 bidirectional, dropout, pad_idx):
        # Embedding embedding layer (word vector)
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)

        # RNN variant - bidirectional LSTM
        self.rnn = nn.LSTM(embedding_dim,  # input_size
                           hidden_dim,  #output_size
                           num_layers=n_layers,  # Number of layers
                           bidirectional=bidirectional,  #Is it bidirectional
                           dropout=dropout)  #Random removal of neurons
        # Linear connection layer
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        # Forward propagation + backward propagation two hidden_state and merged together, so × two

        # Random removal of neurons
        self.dropout = nn.Dropout(dropout)

    def forward(self, text, text_lengths):
        # text shape [send len, batch size]

        embedded = self.dropout(self.embedding(text))
        # embedded shape [send len, batch size, EMB dim]

        # pack sequence
        # lengths need to be on CPU!
        # Package embedded to avoid calculating the pad part later
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded,'cpu'))

        packed_output, (hidden, cell) = self.rnn(packed_embedded)

        # unpacked sequence
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)

        # output shape [send len, batch size, hid dim * num directions]
        # padding tokens in output are tensors with a value of 0

        # hidden shapes [num layers * num directions, batch size, hid dim]
        # cell shape [num layers * num directions, batch size, hid dim]

        # concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
        # and apply dropout

        hidden = self.dropout([-2,:,:], hidden[-1,:,:]), dim = 1))

        return self.fc(hidden)
  • Give an example of the forward process

Forward input:

text, text_lengths = batch.text  # batch.text returns a tuple (digitized tensor, length of each sentence)
model(text, text_lengths)

Shape of text: [send len, batch size]
Example: text=[745, 64] a batch has 64 sentences, and the longest sentence length is 745

Textenter embedded [vocab_size, embedded dim] to get embedded [send len, batch size, EMB dim]
Example: embedded = [25002, 100] embedded = [745, 64, 100]

Embedded input to pack_ padded_ Packed in sequence_ embedded
This step is mainly to prevent pad token from participating in parameter update. Note that lengths need to be on CPU!

packed_embedded input to RNN will determine whether it is PackedSequence and finally output packed_output, (hidden, cell)

If you want to use output, you need to set packed_output decompression
output and hidden, cell
The dimension of output is [745, 64, 512], where 745 is the sentence length, 64 sentences, 512 is because the hidden size=256, and the splicing of two layers of lstm is 512


tensor([745, 745, 744, 744, 743, 743, 742, 738, 738, 738, 738, 736, 734, 734,
        734, 734, 733, 732, 731, 731, 730, 730, 729, 728, 727, 727, 726, 726,
        725, 723, 723, 723, 722, 722, 721, 721, 720, 720, 719, 716, 716, 715,
        715, 715, 713, 712, 712, 711, 707, 707, 707, 706, 705, 704, 703, 702,
        702, 701, 701, 700, 699, 699, 699, 698])

After recording the length of each sentence, you can see that the iterator will sort automatically. Put sentences with similar length together to reduce the number of padding


tensor([[-0.0019,  0.0004, -0.0030,  ..., -0.0097,  0.0420,  0.0041],
        [ 0.0034,  0.0361, -0.0009,  ..., -0.0292,  0.0479,  0.0052],
        [ 0.0123,  0.0347,  0.0076,  ..., -0.0316, -0.0182, -0.0500],
        [-0.0454,  0.0027, -0.0149,  ..., -0.1120,  0.0238, -0.0076],
        [-0.0103,  0.0637,  0.0412,  ..., -0.0712,  0.0208, -0.0349],
        [-0.0093,  0.0435,  0.0166,  ..., -0.0712,  0.0133, -0.0321]],


tensor([[ 0.0110,  0.0757, -0.0370,  ...,  0.0177, -0.0130, -0.0178],
        [-0.0145,  0.0666, -0.0171,  ..., -0.0171,  0.0181,  0.0005],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],

From the output of these two, it is obvious that the output with insufficient sentence length is 0

  • Next, let's see what hidden looks like

hidden and cell dimensions [4, 64, 255]

hidden is the right output of each layer, and output is the upward output of the model

hidden shapes [num layers * num directions, batch size, hid dim]

cell shape [num layers * num directions, batch size, hid dim]

Because the super parameter N_LAYERS = 2 and bidirectional=True, two-way two-layer lstm, num directions=2, so the stitching is the hidden output of four layers

  • reference resources:

The Bi lstm model is actually two separate lstm models, but their respective outputs and hidden are spliced together when used.

This article mentioned why many data have the quantity of T transpose matrix, which should be related to batch_first related

It also mentioned why lstm needs to perform the pack operation. Unlike RNN, the continuous input of pad(0) to lstm will also have an impact on the output. Therefore, the position before pad should be recorded, and it will be ignored after calculation. See the above article for details. Why does the normal RNN used by baseline not perform pack operation?

  • Why does concat get hidden[-2,:,:], hidden[-1,:,:]
    The following figure shows a bidirectional lstm structure

    The graph is a w+1 layer, that is, num_ Structure of lstm with layers = 2

    reference resources

For example, we define a num_ Bidirectional LSTM with layers = 3, h_n the size of the first dimension is equal to 6 (2 * 3), h_n[0] indicates the last time of forward propagation of the first layer
Output of step, h_n[1] represents the output of the last time step of backward propagation of the first layer, h_n[2] represents the output of the last time step of forward propagation of the second layer, h_n[3] represents the output of the last time step of backward propagation of the second layer, h_n[4] and h_n[5] represents the output of the last time step when the third layer propagates forward and backward respectively.

Therefore, each layer of the model is a two-way lstm, rather than a stack of three layers to the left and three layers to the right

be careful:
Before entering embeddings into RNN, we need to use nn.utils.rnn.packed_padded_sequence 'packages' them to ensure that RNN will only process tokens that are not pad. The output we get includes packed_output (a packed sequence), hidden sate and cell state. If the "package" operation is not performed, the output hidden state and cell state are probably the pad token from the sentence. If packed padded sensors is used, the hidden state and cell state of the last non padded element will be output.

Then we use nn.utils.rnn.pad_packed_sequence transforms the output sentence 'decompressed' into a tensor. It should be noted that the output from padding tokens is a zero tensor. Generally, we only need to 'decompress' when using the output in subsequent models. Although not required in this case, this is just to show its steps.

2 use the pre training word vector

Select the GloVe word vector. The full name of GloVe is: Global Vectors for Word Representation. Here is a detailed introduction and a lot of resources. This tutorial will not introduce how to get the word vector, but simply describe how to use the word vector. Here we use "glove.6B.100d", where 6B means that the word vector is trained on 6 billion tokens, and 100d means that the word vector is 100 dimensional (note that the word vector has more than 800 megabytes)

TEXT.build_vocab means that the word vectors of the words in the current training data are extracted from the pre trained word vectors to form the vocab (vocabulary) of the current training set. For words that do not appear in the current word vector corpus (recorded as UNK, unknown), they are randomly initialized by Gaussian distribution (unk_init = torch.Tensor.normal_).


                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

pretrained_embeddings = TEXT.vocab.vectors
# Check word vector shape [vocab size, embedding dim]

# The pre trained embedded word vector is used to replace the weight parameters initialized by the original model

#Set unknown and padding token to 0. They are not related to emotion.
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token][UNK_IDX] = torch.zeros(EMBEDDING_DIM)[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

The input is an identity matrix. embedding is a [25002, 100] matrix. Each row represents a vector of one word

In this way, the input is multiplied by embedding to obtain [25002, 100], which is equivalent to that the input is a matrix for indexing

It should be noted that the word vector of pad token will not be learned during model training. The word vector of unknown token will be learned.

It can be seen that during model initialization, unknown token is not passed in, only pad token is passed in. Some also set the unk word vector to the mean of all words

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx):

3 using Adam optimizer

import torch.optim as optim

optimizer = optim.Adam(model.parameters())

4 model validation

import spacy
nlp = spacy.load('en_core_web_sm')

def predict_sentiment(model, sentence):
    model.eval()  # Switch the model to evaluate mode 
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]  # Word segmentation of sentences
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]  # Convert each word after word segmentation corresponding to the vocabulary into the corresponding index
    length = [len(indexed)]  # Gets the length of the sentence
    tensor = torch.LongTensor(indexed).to(device)  # Convert indexes from list to tensor
    tensor = tensor.unsqueeze(1)  # Convert length into tensor
    length_tensor = torch.LongTensor(length)  # Compress the predicted value to 0 ~ 1 with sigmoid
    prediction = torch.sigmoid(model(tensor, length_tensor))  
    return prediction.item()  # Using the item() method, the tensor with only one value is transformed into an integer 0

5 an environmental problem encountered

When installing torchtext, the current version of torchtext is 1.8.1, which is not compatible with the latest version of torchtext. After Baidu installed torchtext==0.9, which requires 1.8.0,

During the installation of torchtext, 1.8.0 version of torch is still installed. Enter condal ist to display two versions of torch,

At this time, the environment no longer supports gpu training, so the newly installed version 1.8.0 torch is uninstalled,

Modulenotfounderror is displayed after importing torch: no module named 'Torch'. After the original version 1.8.1 torch is installed, the gpu still cannot be used.

There was no choice but to reinstall the environment. Does torch text only support large versions of torch?

reference resources:

Tags: neural networks Deep Learning NLP

Posted on Sat, 18 Sep 2021 12:57:26 -0400 by stukov