Pytoch actual combat__ Text classification by LSTM

0. Introduction

First of all, it should be pointed out that the code is downloaded from Mr. Li Hongyi's course, not my own code. This article mainly adds some explanations and comments to the original code, and changes traditional characters into simplified characters.
The problem we need to deal with is to divide text comments on Twitter into positive and negative. Specific requirements are as follows:

The model we used is as follows:


Among them, word embedding is to convert words into vectors for subsequent training in LSTM. In the following code, the author selects the word2vec model (skip gram, CBOW, etc.) to complete this transformation. You can learn the specific algorithm by searching the articles of big guys in CSDN or B station.

1. Download data

path_prefix = './'
!gdown --id '1lz0Wtwxsh5YCPdqQ3E3l_nbfJT1N13V8' --output data.zip
!unzip data.zip
!ls
# this is for filtering the warnings
import warnings
warnings.filterwarnings('ignore')

2. Read in data

Because the data format is not a general format, you need to write your own reading function

import torch
import numpy as np
import pandas as pd
import torch.optim as optim
import torch.nn.functional as F

def load_training_data(path='training_label.txt'):
    # Read in the data required for training
    # If 'training'_ Label. TXT ', you need to read its label. If it is' training_nolabel.txt ', no need to read label (there is no label itself)
    if 'training_label' in path: #Judgment training_ The words label are not in the path to determine whether it is necessary to read label
        #Common ways to read text data stored in txt
        with open(path, 'r') as f: 
            lines = f.readlines() #Read data line by line
            lines = [line.strip('\n').split(' ') for line in lines]
        x = [line[2:] for line in lines] #The second column is followed by text data
        y = [line[0] for line in lines] #The first column is label
        return x, y
    else:
        with open(path, 'r') as f:
            lines = f.readlines()
            x = [line.strip('\n').split(' ') for line in lines]
        return x

def load_testing_data(path='testing_data'):
    # Read in the data required for testing
    with open(path, 'r') as f:
        lines = f.readlines()
        X = ["".join(line.strip('\n').split(",")[1:]).strip() for line in lines[1:]]
        X = [sen.split(' ') for sen in X]
    return X

def evaluation(outputs, labels): #Define your own evaluation function and evaluate it with the accuracy of classification
    #outputs => probability (float)
    #labels => labels
    outputs[outputs>=0.5] = 1 # Greater than or equal to 0.5 is malicious
    outputs[outputs<0.5] = 0 # Less than 0.5 is harmless
    correct = torch.sum(torch.eq(outputs, labels)).item()
    return correct

3. Define word2vec model

Word2vec model can convert words into vectors, and can magically preserve the similarity of words. For the specific algorithm flow, you can search the articles of big guys on csdn, Zhihu or B station. We use word2vec here to convert the text into a vector, so as to input the corresponding neural network for learning. (neural networks only recognize numbers, not English)

# This block is used to train word embedding of word to vector
# be careful! This block uses cpu when training word to vector, which may take more than 10 minutes (I tried, and it really takes a long time)
import os
import numpy as np
import pandas as pd
import argparse
from gensim.models import word2vec

def train_word2vec(x):
    # Training word embedding of word to vector
    #size is the number of layers of neural network, window is the window length, min_count is used to ignore words that appear too few. worker is the number of threads and iter is the number of cycles
    model = word2vec.Word2Vec(x, size=250, window=5, min_count=5, workers=12, iter=10, sg=1)
    return model

if __name__ == "__main__":
    print("loading training data ...")
    train_x, y = load_training_data('training_label.txt')
    train_x_no_label = load_training_data('training_nolabel.txt')

    print("loading testing data ...")
    test_x = load_testing_data('testing_data.txt')

    model = train_word2vec(train_x + train_x_no_label + test_x)
    
    print("saving model ...")
    # model.save(os.path.join(path_prefix, 'model/w2v_all.model'))
    model.save(os.path.join(path_prefix, 'w2v_all.model')) #Saving the model can make the subsequent training more convenient, which is a good habit

4. Define data preprocessing class

Because we have to face text data, we must preprocess the data. For the convenience of subsequent operation, the author encapsulates it into a class here. Specifically include:

  • Read in the previously trained word2vec model and save the trained embedding (this embedding contains the parameters used in training the word2vec model)
  • Add "PAD" or "UNK" to embedding_matrix
  • Making embedding_matrix
  • The length of the input sentence becomes consistent, which is convenient for subsequent input into the neural network
  • Realize word2indx and turn the words in the sentence into the corresponding index
  • Convert label to tensor format
from torch import nn
from gensim.models import Word2Vec

class Preprocess():
    def __init__(self, sentences, sen_len, w2v_path="./w2v.model"): #First, define some properties of the class
        self.w2v_path = w2v_path
        self.sentences = sentences
        self.sen_len = sen_len
        self.idx2word = []
        self.word2idx = {}
        self.embedding_matrix = []
    def get_w2v_model(self):
        # Read in the previously trained word to vec model
        self.embedding = Word2Vec.load(self.w2v_path)
        self.embedding_dim = self.embedding.vector_size #The embedding dimension is the length of the vector in the trained Word2vec
    def add_embedding(self, word):
        # Add word ("< pad >" or "< unk >") into embedding and give it a randomly generated representation vector
        # Because we sometimes need to use "< pad >" or "< unk >", but they can't be trained in word2vec, and they don't need to generate a vector that can reflect their relationship with other words, so they are generated randomly
        vector = torch.empty(1, self.embedding_dim)#Generate empty
        torch.nn.init.uniform_(vector)#Random generation
        self.word2idx[word] = len(self.word2idx)#Put the corresponding index in word2idx
        self.idx2word.append(word)#Put the corresponding word in idx2word
        self.embedding_matrix = torch.cat([self.embedding_matrix, vector], 0)#In embedding_ Add a new vector to the matrix
    def make_embedding(self, load=True):
        print("Get embedding ...")
        # Get trained Word2vec word embedding
        if load:
            print("loading word to vec model ...")
            self.get_w2v_model()
        else:
            raise NotImplementedError
        # Make a dictionary of word2idx
        # Make an idx2word list
        # Make a list of word2vector
        for i, word in enumerate(self.embedding.wv.vocab):
            print('get words #{}'.format(i+1), end='\r')
            #e.g. self.word2index ['ha'] = 1 
            #e.g. self.index2word[1] = 'ha'
            #e.g. self.vectors[1] = 'ha' vector
            self.word2idx[word] = len(self.word2idx)
            self.idx2word.append(word)
            self.embedding_matrix.append(self.embedding[word])
        print('')
        self.embedding_matrix = torch.tensor(self.embedding_matrix)
        # Add "< pad >" and "< unk >" into embedding
        self.add_embedding("<PAD>")
        self.add_embedding("<UNK>")
        print("total words: {}".format(len(self.embedding_matrix)))
        return self.embedding_matrix
    def pad_sequence(self, sentence):
        # Make each sentence the same length
        if len(sentence) > self.sen_len: #Multiple direct truncation
            sentence = sentence[:self.sen_len]
        else:                            #Add "< pad >" less
            pad_len = self.sen_len - len(sentence)
            for _ in range(pad_len):
                sentence.append(self.word2idx["<PAD>"])
        assert len(sentence) == self.sen_len
        return sentence
    def sentence_word2idx(self):
        # Turn the words in the sentence into the corresponding index
        sentence_list = []
        for i, sen in enumerate(self.sentences):
            print('sentence count #{}'.format(i+1), end='\r')
            sentence_idx = []
            for word in sen:
                if (word in self.word2idx.keys()):
                    sentence_idx.append(self.word2idx[word])
                else:
                    sentence_idx.append(self.word2idx["<UNK>"])
            # Make each sentence the same length
            sentence_idx = self.pad_sequence(sentence_idx)
            sentence_list.append(sentence_idx)
        return torch.LongTensor(sentence_list)
    def labels_to_tensor(self, y):
        # Turn labels into tensor
        y = [int(label) for label in y]
        return torch.LongTensor(y)

5. Create Dataset

This step is relatively simple, just the Dataset class

# 'required for dataset created '__ init__', '__getitem__', '__len__'
# So that the dataloader can be used
import torch
from torch.utils import data

class TwitterDataset(data.Dataset):
    """
    Expected data shape like:(data_num, data_len)
    Data can be a list of numpy array or a list of lists
    input data shape : (data_num, seq_len, feature_dim)
    
    __len__ will return the number of data
    """
    def __init__(self, X, y):
        self.data = X
        self.label = y
    def __getitem__(self, idx):
        if self.label is None: return self.data[idx]
        return self.data[idx], self.label[idx]
    def __len__(self):
        return len(self.data)

6. Establish model

The LSTM model to be used after the establishment mainly includes three parts:

  • embedding layer
  • LSTM
  • Fully connected neural network

embedding layer can be understood as encoding our text so that LSTM can understand. The specific method used is word2vec model.

LSTM model mainly needs input:

  • input_size: enter the feature dimension, that is, the number of input elements in each line
  • hidden_size: the dimension of hidden layer state, that is, the number of hidden layer nodes, which is similar to the structure of single-layer perceptron.
  • num_ Layers: the number of layers stacked by the LSTM. The default value is 1 layer. If it is set to 2, the second LSTM receives the calculation result of the first LSTM.
  • batch_first: whether the first dimension of input and output is batch_size, the default value is False. Because in Torch, people are used to using the dataset contained in Torch. The dataloader continuously inputs data to the neural network model. There is a batch in it_ The size parameter indicates how many data are entered at a time. In the LSTM model, the input data must be a batch of data. In order to distinguish whether the batch data in the LSTM and the batch data in the dataloader have the same meaning, the LSTM model is distinguished by setting this parameter.
  • Dropout: the default value is 0. Whether to add dropout layer after other RNN layers except the last RNN layer.
  • Bidirectional: whether it is a bidirectional RNN. The default is: false. If it is true, then: num_directions=2, otherwise 1.

The fully connected neural network is mainly used to convert the output of LSTM and the final prediction.

import torch
from torch import nn
class LSTM_Net(nn.Module):
    def __init__(self, embedding, embedding_dim, hidden_dim, num_layers, dropout=0.5, fix_embedding=True):
        super(LSTM_Net, self).__init__()
        # Making embedded layer
        self.embedding = torch.nn.Embedding(embedding.size(0),embedding.size(1))
        self.embedding.weight = torch.nn.Parameter(embedding)#The parameters of the embedding layer directly call the parameters in the embedding we previously trained with word2vec
        # Whether to embed fix, if fix_embedding is False. embedding will also be trained during training
        self.embedding.weight.requires_grad = False if fix_embedding else True
        self.embedding_dim = embedding.size(1)
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        self.dropout = dropout
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, batch_first=True)
        self.classifier = nn.Sequential( nn.Dropout(dropout),
                                         nn.Linear(hidden_dim, 1),
                                         nn.Sigmoid() )
    def forward(self, inputs):
        inputs = self.embedding(inputs)
        x, _ = self.lstm(inputs, None)
        # Dimension of x (batch, seq_len, hidden_size)
        # Take the hidden state of the last LSTM (my understanding is that the output of the last one is the best for the understanding of the whole text)
        x = x[:, -1, :] 
        x = self.classifier(x)
        return x

7. Define model training function

This training process is similar to the training between. The notes are very detailed and can be understood through the notes.

import torch
from torch import nn
import torch.optim as optim
import torch.nn.functional as F

def training(batch_size, n_epoch, lr, model_dir, train, valid, model, device):
    total = sum(p.numel() for p in model.parameters())#Total parameters
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)#Parameters to be trained
    print('\nstart training, parameter total:{}, trainable:{}\n'.format(total, trainable))#Look at the parameters of the model
    model.train() # Set the mode of the model to train, so that optimizer can update the parameters of the model
    criterion = nn.BCELoss() # Define the loss function. Here, we use binary cross entry loss
    t_batch = len(train) 
    v_batch = len(valid) 
    optimizer = optim.Adam(model.parameters(), lr=lr) # Pass the parameters of the model to optimizer and assign appropriate learning rate
    total_loss, total_acc, best_acc = 0, 0, 0
    for epoch in range(n_epoch):
        total_loss, total_acc = 0, 0
        # Do training
        for i, (inputs, labels) in enumerate(train):
            inputs = inputs.to(device, dtype=torch.long) # device is "CUDA", and inputs are changed to torch.cuda.longtenser
            labels = labels.to(device, dtype=torch.float) # The device is "CUDA", and the labels are changed to torch.cuda.FloatTensor. Because the etc. should be put into the criterion, the type should be float
            optimizer.zero_grad() # Since the gradient s of loss.backward() will be accumulated, it is necessary to adjust zero after each batch
            outputs = model(inputs) # input to model
            outputs = outputs.squeeze() # Remove the outermost dimension so that outputs can be put into criterion()
            loss = criterion(outputs, labels) # Calculate the training loss of the model at this time
            loss.backward() # Calculate the gradient of loss
            optimizer.step() # Update parameters of training model
            correct = evaluation(outputs, labels) # Calculate the training accuracy of the model at this time
            total_acc += (correct / batch_size)
            total_loss += loss.item()
            print('[ Epoch{}: {}/{} ] loss:{:.3f} acc:{:.3f} '.format(
            	epoch+1, i+1, t_batch, loss.item(), correct*100/batch_size), end='\r')
        print('\nTrain | Loss:{:.5f} Acc: {:.3f}'.format(total_loss/t_batch, total_acc/t_batch*100))

        # This paragraph does validation
        model.eval() # Set the mode of model to eval, so that the parameters of model will be fixed
        with torch.no_grad():
            total_loss, total_acc = 0, 0
            for i, (inputs, labels) in enumerate(valid):
                inputs = inputs.to(device, dtype=torch.long) 
                labels = labels.to(device, dtype=torch.float)  
                outputs = model(inputs) 
                outputs = outputs.squeeze() 
                loss = criterion(outputs, labels) 
                correct = evaluation(outputs, labels) 
                total_acc += (correct / batch_size)
                total_loss += loss.item()

            print("Valid | Loss:{:.5f} Acc: {:.3f} ".format(total_loss/v_batch, total_acc/v_batch*100))
            if total_acc > best_acc:
                # If the result of validation is better than all the previous results, save the current model for subsequent prediction
                best_acc = total_acc
                #torch.save(model, "{}/val_acc_{:.3f}.model".format(model_dir,total_acc/v_batch*100))
                torch.save(model, "{}/ckpt.model".format(model_dir))
                print('saving model with acc {:.3f}'.format(total_acc/v_batch*100))
        print('-----------------------------------------------')
        model.train() # Set the mode of the model to train so that optimizer can update the parameters of the model (because it has just changed to eval mode)

8. Define model test function

import torch
from torch import nn
import torch.optim as optim
import torch.nn.functional as F

def testing(batch_size, test_loader, model, device):
    model.eval()
    ret_output = []
    with torch.no_grad():
        for i, inputs in enumerate(test_loader):
            inputs = inputs.to(device, dtype=torch.long)
            outputs = model(inputs)
            outputs = outputs.squeeze()
            outputs[outputs>=0.5] = 1 # Greater than or equal to 0.5 is negative
            outputs[outputs<0.5] = 0 # Less than 0.5 is positive
            ret_output += outputs.int().tolist()
    
    return ret_output

9. Call the previous functions to start training

  • Sort out the paths of each data
  • Define sentence length, whether to fix embedding, batch size, number of rounds to be trained, epoch, value of learning rate, and data saving path of model
  • read in data
  • input and labels are preprocessed
  • Make a model object
  • Divide data into training data and validation data (take part of training data as validation data)
  • Make the data into a dataset for the data loader to access
  • Convert data to batch of tensors
  • Start training
import os
import torch
import argparse
import numpy as np
from torch import nn
from gensim.models import word2vec
from sklearn.model_selection import train_test_split

# Via torch.cuda.is_ The returned value of available () is used to judge whether there is an environment using GPU. If there is any, device is set to "CUDA", and if there is no, it is set to "cpu"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Sort out the paths of each data
train_with_label = os.path.join(path_prefix, 'training_label.txt')
train_no_label = os.path.join(path_prefix, 'training_nolabel.txt')
testing_data = os.path.join(path_prefix, 'testing_data.txt')

w2v_path = os.path.join(path_prefix, 'w2v_all.model') # Sorting out the path of word2vec model

# Define sentence length, whether to fix embedding, batch size, number of rounds to be trained, epoch, value of learning rate, and data saving path of model
sen_len = 30
fix_embedding = True # fix embedding during training
batch_size = 128
epoch = 5
lr = 0.001
# model_dir = os.path.join(path_prefix, 'model/') # model directory for checkpoint model
model_dir = path_prefix # model directory for checkpoint model

print("loading data ...") # Put 'training_label.txt 'and' training_nolabel.txt 'read in
train_x, y = load_training_data(train_with_label)
train_x_no_label = load_training_data(train_no_label)

# Preprocess input and labels
preprocess = Preprocess(train_x, sen_len, w2v_path=w2v_path)
embedding = preprocess.make_embedding(load=True)
train_x = preprocess.sentence_word2idx()
y = preprocess.labels_to_tensor(y)

# Make a model object
model = LSTM_Net(embedding, embedding_dim=250, hidden_dim=250, num_layers=1, dropout=0.5, fix_embedding=fix_embedding)
model = model.to(device) # The device is "cuda", and the model uses GPU for training (the inputs put in also need to be cuda tensor)

# Divide data into training data and validation data (take part of training data as validation data)
X_train, X_val, y_train, y_val = train_x[:190000], train_x[190000:], y[:190000], y[190000:]

# Make the data into a dataset for the data loader to access
train_dataset = TwitterDataset(X=X_train, y=y_train)
val_dataset = TwitterDataset(X=X_val, y=y_val)

# Convert data to batch of tensors
train_loader = torch.utils.data.DataLoader(dataset = train_dataset,
                                            batch_size = batch_size,
                                            shuffle = True,
                                            num_workers = 8)

val_loader = torch.utils.data.DataLoader(dataset = val_dataset,
                                            batch_size = batch_size,
                                            shuffle = False,
                                            num_workers = 8)

# Start training
training(batch_size, epoch, lr, model_dir, train_loader, val_loader, model, device)

The training results are as follows:

10. Make predictions and save results

print("loading testing data ...")
test_x = load_testing_data(testing_data)
preprocess = Preprocess(test_x, sen_len, w2v_path=w2v_path)
embedding = preprocess.make_embedding(load=True)
test_x = preprocess.sentence_word2idx()
test_dataset = TwitterDataset(X=test_x, y=None)
test_loader = torch.utils.data.DataLoader(dataset = test_dataset,
                                            batch_size = batch_size,
                                            shuffle = False,
                                            num_workers = 8)
print('\nload model ...')
model = torch.load(os.path.join(model_dir, 'ckpt.model'))
outputs = testing(batch_size, test_loader, model, device)

# Save to csv
tmp = pd.DataFrame({"id":[str(i) for i in range(len(test_x))],"label":outputs})
print("save csv ...")
tmp.to_csv(os.path.join(path_prefix, 'predict.csv'), index=False)
print("Finish Predicting")


Tags: Pytorch Deep Learning lstm

Posted on Sun, 31 Oct 2021 10:09:13 -0400 by Hepp