0. Introduction
First of all, it should be pointed out that the code is downloaded from Mr. Li Hongyi's course, not my own code. This article mainly adds some explanations and comments to the original code, and changes traditional characters into simplified characters.
The problem we need to deal with is to divide text comments on Twitter into positive and negative. Specific requirements are as follows:
The model we used is as follows:
Among them, word embedding is to convert words into vectors for subsequent training in LSTM. In the following code, the author selects the word2vec model (skip gram, CBOW, etc.) to complete this transformation. You can learn the specific algorithm by searching the articles of big guys in CSDN or B station.
1. Download data
path_prefix = './' !gdown --id '1lz0Wtwxsh5YCPdqQ3E3l_nbfJT1N13V8' --output data.zip !unzip data.zip !ls
# this is for filtering the warnings import warnings warnings.filterwarnings('ignore')
2. Read in data
Because the data format is not a general format, you need to write your own reading function
import torch import numpy as np import pandas as pd import torch.optim as optim import torch.nn.functional as F def load_training_data(path='training_label.txt'): # Read in the data required for training # If 'training'_ Label. TXT ', you need to read its label. If it is' training_nolabel.txt ', no need to read label (there is no label itself) if 'training_label' in path: #Judgment training_ The words label are not in the path to determine whether it is necessary to read label #Common ways to read text data stored in txt with open(path, 'r') as f: lines = f.readlines() #Read data line by line lines = [line.strip('\n').split(' ') for line in lines] x = [line[2:] for line in lines] #The second column is followed by text data y = [line[0] for line in lines] #The first column is label return x, y else: with open(path, 'r') as f: lines = f.readlines() x = [line.strip('\n').split(' ') for line in lines] return x def load_testing_data(path='testing_data'): # Read in the data required for testing with open(path, 'r') as f: lines = f.readlines() X = ["".join(line.strip('\n').split(",")[1:]).strip() for line in lines[1:]] X = [sen.split(' ') for sen in X] return X def evaluation(outputs, labels): #Define your own evaluation function and evaluate it with the accuracy of classification #outputs => probability (float) #labels => labels outputs[outputs>=0.5] = 1 # Greater than or equal to 0.5 is malicious outputs[outputs<0.5] = 0 # Less than 0.5 is harmless correct = torch.sum(torch.eq(outputs, labels)).item() return correct
3. Define word2vec model
Word2vec model can convert words into vectors, and can magically preserve the similarity of words. For the specific algorithm flow, you can search the articles of big guys on csdn, Zhihu or B station. We use word2vec here to convert the text into a vector, so as to input the corresponding neural network for learning. (neural networks only recognize numbers, not English)
# This block is used to train word embedding of word to vector # be careful! This block uses cpu when training word to vector, which may take more than 10 minutes (I tried, and it really takes a long time) import os import numpy as np import pandas as pd import argparse from gensim.models import word2vec def train_word2vec(x): # Training word embedding of word to vector #size is the number of layers of neural network, window is the window length, min_count is used to ignore words that appear too few. worker is the number of threads and iter is the number of cycles model = word2vec.Word2Vec(x, size=250, window=5, min_count=5, workers=12, iter=10, sg=1) return model if __name__ == "__main__": print("loading training data ...") train_x, y = load_training_data('training_label.txt') train_x_no_label = load_training_data('training_nolabel.txt') print("loading testing data ...") test_x = load_testing_data('testing_data.txt') model = train_word2vec(train_x + train_x_no_label + test_x) print("saving model ...") # model.save(os.path.join(path_prefix, 'model/w2v_all.model')) model.save(os.path.join(path_prefix, 'w2v_all.model')) #Saving the model can make the subsequent training more convenient, which is a good habit
4. Define data preprocessing class
Because we have to face text data, we must preprocess the data. For the convenience of subsequent operation, the author encapsulates it into a class here. Specifically include:
- Read in the previously trained word2vec model and save the trained embedding (this embedding contains the parameters used in training the word2vec model)
- Add "PAD" or "UNK" to embedding_matrix
- Making embedding_matrix
- The length of the input sentence becomes consistent, which is convenient for subsequent input into the neural network
- Realize word2indx and turn the words in the sentence into the corresponding index
- Convert label to tensor format
from torch import nn from gensim.models import Word2Vec class Preprocess(): def __init__(self, sentences, sen_len, w2v_path="./w2v.model"): #First, define some properties of the class self.w2v_path = w2v_path self.sentences = sentences self.sen_len = sen_len self.idx2word = [] self.word2idx = {} self.embedding_matrix = [] def get_w2v_model(self): # Read in the previously trained word to vec model self.embedding = Word2Vec.load(self.w2v_path) self.embedding_dim = self.embedding.vector_size #The embedding dimension is the length of the vector in the trained Word2vec def add_embedding(self, word): # Add word ("< pad >" or "< unk >") into embedding and give it a randomly generated representation vector # Because we sometimes need to use "< pad >" or "< unk >", but they can't be trained in word2vec, and they don't need to generate a vector that can reflect their relationship with other words, so they are generated randomly vector = torch.empty(1, self.embedding_dim)#Generate empty torch.nn.init.uniform_(vector)#Random generation self.word2idx[word] = len(self.word2idx)#Put the corresponding index in word2idx self.idx2word.append(word)#Put the corresponding word in idx2word self.embedding_matrix = torch.cat([self.embedding_matrix, vector], 0)#In embedding_ Add a new vector to the matrix def make_embedding(self, load=True): print("Get embedding ...") # Get trained Word2vec word embedding if load: print("loading word to vec model ...") self.get_w2v_model() else: raise NotImplementedError # Make a dictionary of word2idx # Make an idx2word list # Make a list of word2vector for i, word in enumerate(self.embedding.wv.vocab): print('get words #{}'.format(i+1), end='\r') #e.g. self.word2index ['ha'] = 1 #e.g. self.index2word[1] = 'ha' #e.g. self.vectors[1] = 'ha' vector self.word2idx[word] = len(self.word2idx) self.idx2word.append(word) self.embedding_matrix.append(self.embedding[word]) print('') self.embedding_matrix = torch.tensor(self.embedding_matrix) # Add "< pad >" and "< unk >" into embedding self.add_embedding("<PAD>") self.add_embedding("<UNK>") print("total words: {}".format(len(self.embedding_matrix))) return self.embedding_matrix def pad_sequence(self, sentence): # Make each sentence the same length if len(sentence) > self.sen_len: #Multiple direct truncation sentence = sentence[:self.sen_len] else: #Add "< pad >" less pad_len = self.sen_len - len(sentence) for _ in range(pad_len): sentence.append(self.word2idx["<PAD>"]) assert len(sentence) == self.sen_len return sentence def sentence_word2idx(self): # Turn the words in the sentence into the corresponding index sentence_list = [] for i, sen in enumerate(self.sentences): print('sentence count #{}'.format(i+1), end='\r') sentence_idx = [] for word in sen: if (word in self.word2idx.keys()): sentence_idx.append(self.word2idx[word]) else: sentence_idx.append(self.word2idx["<UNK>"]) # Make each sentence the same length sentence_idx = self.pad_sequence(sentence_idx) sentence_list.append(sentence_idx) return torch.LongTensor(sentence_list) def labels_to_tensor(self, y): # Turn labels into tensor y = [int(label) for label in y] return torch.LongTensor(y)
5. Create Dataset
This step is relatively simple, just the Dataset class
# 'required for dataset created '__ init__', '__getitem__', '__len__' # So that the dataloader can be used import torch from torch.utils import data class TwitterDataset(data.Dataset): """ Expected data shape like:(data_num, data_len) Data can be a list of numpy array or a list of lists input data shape : (data_num, seq_len, feature_dim) __len__ will return the number of data """ def __init__(self, X, y): self.data = X self.label = y def __getitem__(self, idx): if self.label is None: return self.data[idx] return self.data[idx], self.label[idx] def __len__(self): return len(self.data)
6. Establish model
The LSTM model to be used after the establishment mainly includes three parts:
- embedding layer
- LSTM
- Fully connected neural network
embedding layer can be understood as encoding our text so that LSTM can understand. The specific method used is word2vec model.
LSTM model mainly needs input:
- input_size: enter the feature dimension, that is, the number of input elements in each line
- hidden_size: the dimension of hidden layer state, that is, the number of hidden layer nodes, which is similar to the structure of single-layer perceptron.
- num_ Layers: the number of layers stacked by the LSTM. The default value is 1 layer. If it is set to 2, the second LSTM receives the calculation result of the first LSTM.
- batch_first: whether the first dimension of input and output is batch_size, the default value is False. Because in Torch, people are used to using the dataset contained in Torch. The dataloader continuously inputs data to the neural network model. There is a batch in it_ The size parameter indicates how many data are entered at a time. In the LSTM model, the input data must be a batch of data. In order to distinguish whether the batch data in the LSTM and the batch data in the dataloader have the same meaning, the LSTM model is distinguished by setting this parameter.
- Dropout: the default value is 0. Whether to add dropout layer after other RNN layers except the last RNN layer.
- Bidirectional: whether it is a bidirectional RNN. The default is: false. If it is true, then: num_directions=2, otherwise 1.
The fully connected neural network is mainly used to convert the output of LSTM and the final prediction.
import torch from torch import nn class LSTM_Net(nn.Module): def __init__(self, embedding, embedding_dim, hidden_dim, num_layers, dropout=0.5, fix_embedding=True): super(LSTM_Net, self).__init__() # Making embedded layer self.embedding = torch.nn.Embedding(embedding.size(0),embedding.size(1)) self.embedding.weight = torch.nn.Parameter(embedding)#The parameters of the embedding layer directly call the parameters in the embedding we previously trained with word2vec # Whether to embed fix, if fix_embedding is False. embedding will also be trained during training self.embedding.weight.requires_grad = False if fix_embedding else True self.embedding_dim = embedding.size(1) self.hidden_dim = hidden_dim self.num_layers = num_layers self.dropout = dropout self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, batch_first=True) self.classifier = nn.Sequential( nn.Dropout(dropout), nn.Linear(hidden_dim, 1), nn.Sigmoid() ) def forward(self, inputs): inputs = self.embedding(inputs) x, _ = self.lstm(inputs, None) # Dimension of x (batch, seq_len, hidden_size) # Take the hidden state of the last LSTM (my understanding is that the output of the last one is the best for the understanding of the whole text) x = x[:, -1, :] x = self.classifier(x) return x
7. Define model training function
This training process is similar to the training between. The notes are very detailed and can be understood through the notes.
import torch from torch import nn import torch.optim as optim import torch.nn.functional as F def training(batch_size, n_epoch, lr, model_dir, train, valid, model, device): total = sum(p.numel() for p in model.parameters())#Total parameters trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)#Parameters to be trained print('\nstart training, parameter total:{}, trainable:{}\n'.format(total, trainable))#Look at the parameters of the model model.train() # Set the mode of the model to train, so that optimizer can update the parameters of the model criterion = nn.BCELoss() # Define the loss function. Here, we use binary cross entry loss t_batch = len(train) v_batch = len(valid) optimizer = optim.Adam(model.parameters(), lr=lr) # Pass the parameters of the model to optimizer and assign appropriate learning rate total_loss, total_acc, best_acc = 0, 0, 0 for epoch in range(n_epoch): total_loss, total_acc = 0, 0 # Do training for i, (inputs, labels) in enumerate(train): inputs = inputs.to(device, dtype=torch.long) # device is "CUDA", and inputs are changed to torch.cuda.longtenser labels = labels.to(device, dtype=torch.float) # The device is "CUDA", and the labels are changed to torch.cuda.FloatTensor. Because the etc. should be put into the criterion, the type should be float optimizer.zero_grad() # Since the gradient s of loss.backward() will be accumulated, it is necessary to adjust zero after each batch outputs = model(inputs) # input to model outputs = outputs.squeeze() # Remove the outermost dimension so that outputs can be put into criterion() loss = criterion(outputs, labels) # Calculate the training loss of the model at this time loss.backward() # Calculate the gradient of loss optimizer.step() # Update parameters of training model correct = evaluation(outputs, labels) # Calculate the training accuracy of the model at this time total_acc += (correct / batch_size) total_loss += loss.item() print('[ Epoch{}: {}/{} ] loss:{:.3f} acc:{:.3f} '.format( epoch+1, i+1, t_batch, loss.item(), correct*100/batch_size), end='\r') print('\nTrain | Loss:{:.5f} Acc: {:.3f}'.format(total_loss/t_batch, total_acc/t_batch*100)) # This paragraph does validation model.eval() # Set the mode of model to eval, so that the parameters of model will be fixed with torch.no_grad(): total_loss, total_acc = 0, 0 for i, (inputs, labels) in enumerate(valid): inputs = inputs.to(device, dtype=torch.long) labels = labels.to(device, dtype=torch.float) outputs = model(inputs) outputs = outputs.squeeze() loss = criterion(outputs, labels) correct = evaluation(outputs, labels) total_acc += (correct / batch_size) total_loss += loss.item() print("Valid | Loss:{:.5f} Acc: {:.3f} ".format(total_loss/v_batch, total_acc/v_batch*100)) if total_acc > best_acc: # If the result of validation is better than all the previous results, save the current model for subsequent prediction best_acc = total_acc #torch.save(model, "{}/val_acc_{:.3f}.model".format(model_dir,total_acc/v_batch*100)) torch.save(model, "{}/ckpt.model".format(model_dir)) print('saving model with acc {:.3f}'.format(total_acc/v_batch*100)) print('-----------------------------------------------') model.train() # Set the mode of the model to train so that optimizer can update the parameters of the model (because it has just changed to eval mode)
8. Define model test function
import torch from torch import nn import torch.optim as optim import torch.nn.functional as F def testing(batch_size, test_loader, model, device): model.eval() ret_output = [] with torch.no_grad(): for i, inputs in enumerate(test_loader): inputs = inputs.to(device, dtype=torch.long) outputs = model(inputs) outputs = outputs.squeeze() outputs[outputs>=0.5] = 1 # Greater than or equal to 0.5 is negative outputs[outputs<0.5] = 0 # Less than 0.5 is positive ret_output += outputs.int().tolist() return ret_output
9. Call the previous functions to start training
- Sort out the paths of each data
- Define sentence length, whether to fix embedding, batch size, number of rounds to be trained, epoch, value of learning rate, and data saving path of model
- read in data
- input and labels are preprocessed
- Make a model object
- Divide data into training data and validation data (take part of training data as validation data)
- Make the data into a dataset for the data loader to access
- Convert data to batch of tensors
- Start training
import os import torch import argparse import numpy as np from torch import nn from gensim.models import word2vec from sklearn.model_selection import train_test_split # Via torch.cuda.is_ The returned value of available () is used to judge whether there is an environment using GPU. If there is any, device is set to "CUDA", and if there is no, it is set to "cpu" device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Sort out the paths of each data train_with_label = os.path.join(path_prefix, 'training_label.txt') train_no_label = os.path.join(path_prefix, 'training_nolabel.txt') testing_data = os.path.join(path_prefix, 'testing_data.txt') w2v_path = os.path.join(path_prefix, 'w2v_all.model') # Sorting out the path of word2vec model # Define sentence length, whether to fix embedding, batch size, number of rounds to be trained, epoch, value of learning rate, and data saving path of model sen_len = 30 fix_embedding = True # fix embedding during training batch_size = 128 epoch = 5 lr = 0.001 # model_dir = os.path.join(path_prefix, 'model/') # model directory for checkpoint model model_dir = path_prefix # model directory for checkpoint model print("loading data ...") # Put 'training_label.txt 'and' training_nolabel.txt 'read in train_x, y = load_training_data(train_with_label) train_x_no_label = load_training_data(train_no_label) # Preprocess input and labels preprocess = Preprocess(train_x, sen_len, w2v_path=w2v_path) embedding = preprocess.make_embedding(load=True) train_x = preprocess.sentence_word2idx() y = preprocess.labels_to_tensor(y) # Make a model object model = LSTM_Net(embedding, embedding_dim=250, hidden_dim=250, num_layers=1, dropout=0.5, fix_embedding=fix_embedding) model = model.to(device) # The device is "cuda", and the model uses GPU for training (the inputs put in also need to be cuda tensor) # Divide data into training data and validation data (take part of training data as validation data) X_train, X_val, y_train, y_val = train_x[:190000], train_x[190000:], y[:190000], y[190000:] # Make the data into a dataset for the data loader to access train_dataset = TwitterDataset(X=X_train, y=y_train) val_dataset = TwitterDataset(X=X_val, y=y_val) # Convert data to batch of tensors train_loader = torch.utils.data.DataLoader(dataset = train_dataset, batch_size = batch_size, shuffle = True, num_workers = 8) val_loader = torch.utils.data.DataLoader(dataset = val_dataset, batch_size = batch_size, shuffle = False, num_workers = 8) # Start training training(batch_size, epoch, lr, model_dir, train_loader, val_loader, model, device)
The training results are as follows:
10. Make predictions and save results
print("loading testing data ...") test_x = load_testing_data(testing_data) preprocess = Preprocess(test_x, sen_len, w2v_path=w2v_path) embedding = preprocess.make_embedding(load=True) test_x = preprocess.sentence_word2idx() test_dataset = TwitterDataset(X=test_x, y=None) test_loader = torch.utils.data.DataLoader(dataset = test_dataset, batch_size = batch_size, shuffle = False, num_workers = 8) print('\nload model ...') model = torch.load(os.path.join(model_dir, 'ckpt.model')) outputs = testing(batch_size, test_loader, model, device) # Save to csv tmp = pd.DataFrame({"id":[str(i) for i in range(len(test_x))],"label":outputs}) print("save csv ...") tmp.to_csv(os.path.join(path_prefix, 'predict.csv'), index=False) print("Finish Predicting")