This article is for learning Datawhale 2021.9 team learning emotion analysis notes
Original learning document address: https://github.com/datawhalechina/team-learning-nlp/tree/master/EmotionalAnalysis
See for baseline notes https://blog.csdn.net/weixin_43634785/article/details/120289701?spm=1001.2014.3001.5502
There are many places in baseline that can be optimized, such as
- Using pre trained word vectors
- Replace RNN network with bidirectional LSTM network
- Due to the increase of parameter quantity, regular term is added
- Try a different optimizer
Next, modify the baseline from the above aspects
1. Replace RNN network with LSTM network
"""5.Model""" INPUT_DIM = len(TEXT.vocab) EMBEDDING_DIM = 100 HIDDEN_DIM = 256 OUTPUT_DIM = 1 N_LAYERS = 2 BIDIRECTIONAL = True DROPOUT = 0.5 PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] # UNK_IDX does not need to be passed in, because PAD does not need to be calculated and unk needs to be calculated model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)
class RNN(nn.Module): def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout, pad_idx): super().__init__() # Embedding embedding layer (word vector) self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx) # RNN variant - bidirectional LSTM self.rnn = nn.LSTM(embedding_dim, # input_size hidden_dim, #output_size num_layers=n_layers, # Number of layers bidirectional=bidirectional, #Is it bidirectional dropout=dropout) #Random removal of neurons # Linear connection layer self.fc = nn.Linear(hidden_dim * 2, output_dim) # Forward propagation + backward propagation two hidden_state and merged together, so × two # Random removal of neurons self.dropout = nn.Dropout(dropout) def forward(self, text, text_lengths): # text shape [send len, batch size] embedded = self.dropout(self.embedding(text)) # embedded shape [send len, batch size, EMB dim] # pack sequence # lengths need to be on CPU! # Package embedded to avoid calculating the pad part later packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.to('cpu')) packed_output, (hidden, cell) = self.rnn(packed_embedded) # unpacked sequence output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output) # output shape [send len, batch size, hid dim * num directions] # padding tokens in output are tensors with a value of 0 # hidden shapes [num layers * num directions, batch size, hid dim] # cell shape [num layers * num directions, batch size, hid dim] # concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers # and apply dropout hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)) return self.fc(hidden)
- Give an example of the forward process
text, text_lengths = batch.text # batch.text returns a tuple (digitized tensor, length of each sentence) model(text, text_lengths)
Shape of text: [send len, batch size]
Example: text=[745, 64] a batch has 64 sentences, and the longest sentence length is 745
Textenter embedded [vocab_size, embedded dim] to get embedded [send len, batch size, EMB dim]
Example: embedded = [25002, 100] embedded = [745, 64, 100]
Embedded input to pack_ padded_ Packed in sequence_ embedded
This step is mainly to prevent pad token from participating in parameter update. Note that lengths need to be on CPU!
packed_embedded input to RNN will determine whether it is PackedSequence and finally output packed_output, (hidden, cell)
If you want to use output, you need to set packed_output decompression
output and hidden, cell
The dimension of output is [745, 64, 512], where 745 is the sentence length, 64 sentences, 512 is because the hidden size=256, and the splicing of two layers of lstm is 512
tensor([745, 745, 744, 744, 743, 743, 742, 738, 738, 738, 738, 736, 734, 734, 734, 734, 733, 732, 731, 731, 730, 730, 729, 728, 727, 727, 726, 726, 725, 723, 723, 723, 722, 722, 721, 721, 720, 720, 719, 716, 716, 715, 715, 715, 713, 712, 712, 711, 707, 707, 707, 706, 705, 704, 703, 702, 702, 701, 701, 700, 699, 699, 699, 698])
After recording the length of each sentence, you can see that the iterator will sort automatically. Put sentences with similar length together to reduce the number of padding
tensor([[-0.0019, 0.0004, -0.0030, ..., -0.0097, 0.0420, 0.0041], [ 0.0034, 0.0361, -0.0009, ..., -0.0292, 0.0479, 0.0052], [ 0.0123, 0.0347, 0.0076, ..., -0.0316, -0.0182, -0.0500], ..., [-0.0454, 0.0027, -0.0149, ..., -0.1120, 0.0238, -0.0076], [-0.0103, 0.0637, 0.0412, ..., -0.0712, 0.0208, -0.0349], [-0.0093, 0.0435, 0.0166, ..., -0.0712, 0.0133, -0.0321]], grad_fn=<SelectBackward>)
tensor([[ 0.0110, 0.0757, -0.0370, ..., 0.0177, -0.0130, -0.0178], [-0.0145, 0.0666, -0.0171, ..., -0.0171, 0.0181, 0.0005], [ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000], ..., [ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]], grad_fn=<SelectBackward>)
From the output of these two, it is obvious that the output with insufficient sentence length is 0
- Next, let's see what hidden looks like
hidden and cell dimensions [4, 64, 255]
hidden is the right output of each layer, and output is the upward output of the model
hidden shapes [num layers * num directions, batch size, hid dim]
cell shape [num layers * num directions, batch size, hid dim]
Because the super parameter N_LAYERS = 2 and bidirectional=True, two-way two-layer lstm, num directions=2, so the stitching is the hidden output of four layers
- reference resources: https://zhuanlan.zhihu.com/p/79064602
The Bi lstm model is actually two separate lstm models, but their respective outputs and hidden are spliced together when used.
This article mentioned why many data have the quantity of T transpose matrix, which should be related to batch_first related
It also mentioned why lstm needs to perform the pack operation. Unlike RNN, the continuous input of pad(0) to lstm will also have an impact on the output. Therefore, the position before pad should be recorded, and it will be ignored after calculation. See the above article for details. Why does the normal RNN used by baseline not perform pack operation?
- Why does concat get hidden[-2,:,:], hidden[-1,:,:]
The following figure shows a bidirectional lstm structure
The graph is a w+1 layer, that is, num_ Structure of lstm with layers = 2
reference resources https://zhuanlan.zhihu.com/p/39191116
For example, we define a num_ Bidirectional LSTM with layers = 3, h_n the size of the first dimension is equal to 6 (2 * 3), h_n indicates the last time of forward propagation of the first layer
Output of step, h_n represents the output of the last time step of backward propagation of the first layer, h_n represents the output of the last time step of forward propagation of the second layer, h_n represents the output of the last time step of backward propagation of the second layer, h_n and h_n represents the output of the last time step when the third layer propagates forward and backward respectively.
Therefore, each layer of the model is a two-way lstm, rather than a stack of three layers to the left and three layers to the right
Before entering embeddings into RNN, we need to use nn.utils.rnn.packed_padded_sequence 'packages' them to ensure that RNN will only process tokens that are not pad. The output we get includes packed_output (a packed sequence), hidden sate and cell state. If the "package" operation is not performed, the output hidden state and cell state are probably the pad token from the sentence. If packed padded sensors is used, the hidden state and cell state of the last non padded element will be output.
Then we use nn.utils.rnn.pad_packed_sequence transforms the output sentence 'decompressed' into a tensor. It should be noted that the output from padding tokens is a zero tensor. Generally, we only need to 'decompress' when using the output in subsequent models. Although not required in this case, this is just to show its steps.
2 use the pre training word vector
Select the GloVe word vector. The full name of GloVe is: Global Vectors for Word Representation. Here is a detailed introduction and a lot of resources. This tutorial will not introduce how to get the word vector, but simply describe how to use the word vector. Here we use "glove.6B.100d", where 6B means that the word vector is trained on 6 billion tokens, and 100d means that the word vector is 100 dimensional (note that the word vector has more than 800 megabytes)
TEXT.build_vocab means that the word vectors of the words in the current training data are extracted from the pre trained word vectors to form the vocab (vocabulary) of the current training set. For words that do not appear in the current word vector corpus (recorded as UNK, unknown), they are randomly initialized by Gaussian distribution (unk_init = torch.Tensor.normal_).
MAX_VOCAB_SIZE = 25_000 TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE, vectors = "glove.6B.100d", unk_init = torch.Tensor.normal_) LABEL.build_vocab(train_data)
pretrained_embeddings = TEXT.vocab.vectors # Check word vector shape [vocab size, embedding dim] print(pretrained_embeddings.shape) # The pre trained embedded word vector is used to replace the weight parameters initialized by the original model model.embedding.weight.data.copy_(pretrained_embeddings) #Set unknown and padding token to 0. They are not related to emotion. UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token] model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM) model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)
The input is an identity matrix. embedding is a [25002, 100] matrix. Each row represents a vector of one word
In this way, the input is multiplied by embedding to obtain [25002, 100], which is equivalent to that the input is a matrix for indexing
It should be noted that the word vector of pad token will not be learned during model training. The word vector of unknown token will be learned.
It can be seen that during model initialization, unknown token is not passed in, only pad token is passed in. Some also set the unk word vector to the mean of all words
class RNN(nn.Module): def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout, pad_idx):
3 using Adam optimizer
import torch.optim as optim optimizer = optim.Adam(model.parameters())
4 model validation
import spacy nlp = spacy.load('en_core_web_sm') def predict_sentiment(model, sentence): model.eval() # Switch the model to evaluate mode tokenized = [tok.text for tok in nlp.tokenizer(sentence)] # Word segmentation of sentences indexed = [TEXT.vocab.stoi[t] for t in tokenized] # Convert each word after word segmentation corresponding to the vocabulary into the corresponding index length = [len(indexed)] # Gets the length of the sentence tensor = torch.LongTensor(indexed).to(device) # Convert indexes from list to tensor tensor = tensor.unsqueeze(1) # Convert length into tensor length_tensor = torch.LongTensor(length) # Compress the predicted value to 0 ~ 1 with sigmoid prediction = torch.sigmoid(model(tensor, length_tensor)) return prediction.item() # Using the item() method, the tensor with only one value is transformed into an integer 0
5 an environmental problem encountered
When installing torchtext, the current version of torchtext is 1.8.1, which is not compatible with the latest version of torchtext. After Baidu installed torchtext==0.9, which requires 1.8.0,
During the installation of torchtext, 1.8.0 version of torch is still installed. Enter condal ist to display two versions of torch,
At this time, the environment no longer supports gpu training, so the newly installed version 1.8.0 torch is uninstalled,
Modulenotfounderror is displayed after importing torch: no module named 'Torch'. After the original version 1.8.1 torch is installed, the gpu still cannot be used.
There was no choice but to reinstall the environment. Does torch text only support large versions of torch?