# 0. Introduction

First of all, it should be pointed out that the code is downloaded from Mr. Li Hongyi's course, not my own code. This article mainly adds some explanations and comments to the original code, and changes traditional characters into simplified characters.
The problem we need to deal with is to divide text comments on Twitter into positive and negative. Specific requirements are as follows:

The model we used is as follows:

Among them, word embedding is to convert words into vectors for subsequent training in LSTM. In the following code, the author selects the word2vec model (skip gram, CBOW, etc.) to complete this transformation. You can learn the specific algorithm by searching the articles of big guys in CSDN or B station.

```path_prefix = './'
!gdown --id '1lz0Wtwxsh5YCPdqQ3E3l_nbfJT1N13V8' --output data.zip
!unzip data.zip
!ls
```
```# this is for filtering the warnings
import warnings
warnings.filterwarnings('ignore')
```

Because the data format is not a general format, you need to write your own reading function

```import torch
import numpy as np
import pandas as pd
import torch.optim as optim
import torch.nn.functional as F

# Read in the data required for training
# If 'training'_ Label. TXT ', you need to read its label. If it is' training_nolabel.txt ', no need to read label (there is no label itself)
if 'training_label' in path: #Judgment training_ The words label are not in the path to determine whether it is necessary to read label
#Common ways to read text data stored in txt
with open(path, 'r') as f:
lines = [line.strip('\n').split(' ') for line in lines]
x = [line[2:] for line in lines] #The second column is followed by text data
y = [line[0] for line in lines] #The first column is label
return x, y
else:
with open(path, 'r') as f:
x = [line.strip('\n').split(' ') for line in lines]
return x

# Read in the data required for testing
with open(path, 'r') as f:
X = ["".join(line.strip('\n').split(",")[1:]).strip() for line in lines[1:]]
X = [sen.split(' ') for sen in X]
return X

def evaluation(outputs, labels): #Define your own evaluation function and evaluate it with the accuracy of classification
#outputs => probability (float)
#labels => labels
outputs[outputs>=0.5] = 1 # Greater than or equal to 0.5 is malicious
outputs[outputs<0.5] = 0 # Less than 0.5 is harmless
correct = torch.sum(torch.eq(outputs, labels)).item()
return correct
```

# 3. Define word2vec model

Word2vec model can convert words into vectors, and can magically preserve the similarity of words. For the specific algorithm flow, you can search the articles of big guys on csdn, Zhihu or B station. We use word2vec here to convert the text into a vector, so as to input the corresponding neural network for learning. (neural networks only recognize numbers, not English)

```# This block is used to train word embedding of word to vector
# be careful! This block uses cpu when training word to vector, which may take more than 10 minutes (I tried, and it really takes a long time)
import os
import numpy as np
import pandas as pd
import argparse
from gensim.models import word2vec

def train_word2vec(x):
# Training word embedding of word to vector
#size is the number of layers of neural network, window is the window length, min_count is used to ignore words that appear too few. worker is the number of threads and iter is the number of cycles
model = word2vec.Word2Vec(x, size=250, window=5, min_count=5, workers=12, iter=10, sg=1)
return model

if __name__ == "__main__":

model = train_word2vec(train_x + train_x_no_label + test_x)

print("saving model ...")
# model.save(os.path.join(path_prefix, 'model/w2v_all.model'))
model.save(os.path.join(path_prefix, 'w2v_all.model')) #Saving the model can make the subsequent training more convenient, which is a good habit
```

# 4. Define data preprocessing class

Because we have to face text data, we must preprocess the data. For the convenience of subsequent operation, the author encapsulates it into a class here. Specifically include:

• Read in the previously trained word2vec model and save the trained embedding (this embedding contains the parameters used in training the word2vec model)
• Making embedding_matrix
• The length of the input sentence becomes consistent, which is convenient for subsequent input into the neural network
• Realize word2indx and turn the words in the sentence into the corresponding index
• Convert label to tensor format
```from torch import nn
from gensim.models import Word2Vec

class Preprocess():
def __init__(self, sentences, sen_len, w2v_path="./w2v.model"): #First, define some properties of the class
self.w2v_path = w2v_path
self.sentences = sentences
self.sen_len = sen_len
self.idx2word = []
self.word2idx = {}
self.embedding_matrix = []
def get_w2v_model(self):
# Read in the previously trained word to vec model
self.embedding_dim = self.embedding.vector_size #The embedding dimension is the length of the vector in the trained Word2vec
# Add word ("< pad >" or "< unk >") into embedding and give it a randomly generated representation vector
# Because we sometimes need to use "< pad >" or "< unk >", but they can't be trained in word2vec, and they don't need to generate a vector that can reflect their relationship with other words, so they are generated randomly
vector = torch.empty(1, self.embedding_dim)#Generate empty
torch.nn.init.uniform_(vector)#Random generation
self.word2idx[word] = len(self.word2idx)#Put the corresponding index in word2idx
self.idx2word.append(word)#Put the corresponding word in idx2word
self.embedding_matrix = torch.cat([self.embedding_matrix, vector], 0)#In embedding_ Add a new vector to the matrix
print("Get embedding ...")
# Get trained Word2vec word embedding
self.get_w2v_model()
else:
raise NotImplementedError
# Make a dictionary of word2idx
# Make an idx2word list
# Make a list of word2vector
for i, word in enumerate(self.embedding.wv.vocab):
print('get words #{}'.format(i+1), end='\r')
#e.g. self.word2index ['ha'] = 1
#e.g. self.index2word[1] = 'ha'
#e.g. self.vectors[1] = 'ha' vector
self.word2idx[word] = len(self.word2idx)
self.idx2word.append(word)
self.embedding_matrix.append(self.embedding[word])
print('')
self.embedding_matrix = torch.tensor(self.embedding_matrix)
print("total words: {}".format(len(self.embedding_matrix)))
return self.embedding_matrix
# Make each sentence the same length
if len(sentence) > self.sen_len: #Multiple direct truncation
sentence = sentence[:self.sen_len]
assert len(sentence) == self.sen_len
return sentence
def sentence_word2idx(self):
# Turn the words in the sentence into the corresponding index
sentence_list = []
for i, sen in enumerate(self.sentences):
print('sentence count #{}'.format(i+1), end='\r')
sentence_idx = []
for word in sen:
if (word in self.word2idx.keys()):
sentence_idx.append(self.word2idx[word])
else:
sentence_idx.append(self.word2idx["<UNK>"])
# Make each sentence the same length
sentence_list.append(sentence_idx)
def labels_to_tensor(self, y):
# Turn labels into tensor
y = [int(label) for label in y]

```

# 5. Create Dataset

This step is relatively simple, just the Dataset class

```# 'required for dataset created '__ init__', '__getitem__', '__len__'
# So that the dataloader can be used
import torch
from torch.utils import data

"""
Expected data shape like:(data_num, data_len)
Data can be a list of numpy array or a list of lists
input data shape : (data_num, seq_len, feature_dim)

__len__ will return the number of data
"""
def __init__(self, X, y):
self.data = X
self.label = y
def __getitem__(self, idx):
if self.label is None: return self.data[idx]
return self.data[idx], self.label[idx]
def __len__(self):
return len(self.data)
```

# 6. Establish model

The LSTM model to be used after the establishment mainly includes three parts:

• embedding layer
• LSTM
• Fully connected neural network

embedding layer can be understood as encoding our text so that LSTM can understand. The specific method used is word2vec model.

LSTM model mainly needs input:

• input_size: enter the feature dimension, that is, the number of input elements in each line
• hidden_size: the dimension of hidden layer state, that is, the number of hidden layer nodes, which is similar to the structure of single-layer perceptron.
• num_ Layers: the number of layers stacked by the LSTM. The default value is 1 layer. If it is set to 2, the second LSTM receives the calculation result of the first LSTM.
• batch_first: whether the first dimension of input and output is batch_size, the default value is False. Because in Torch, people are used to using the dataset contained in Torch. The dataloader continuously inputs data to the neural network model. There is a batch in it_ The size parameter indicates how many data are entered at a time. In the LSTM model, the input data must be a batch of data. In order to distinguish whether the batch data in the LSTM and the batch data in the dataloader have the same meaning, the LSTM model is distinguished by setting this parameter.
• Dropout: the default value is 0. Whether to add dropout layer after other RNN layers except the last RNN layer.
• Bidirectional: whether it is a bidirectional RNN. The default is: false. If it is true, then: num_directions=2, otherwise 1.

The fully connected neural network is mainly used to convert the output of LSTM and the final prediction.

```import torch
from torch import nn
class LSTM_Net(nn.Module):
def __init__(self, embedding, embedding_dim, hidden_dim, num_layers, dropout=0.5, fix_embedding=True):
super(LSTM_Net, self).__init__()
# Making embedded layer
self.embedding = torch.nn.Embedding(embedding.size(0),embedding.size(1))
self.embedding.weight = torch.nn.Parameter(embedding)#The parameters of the embedding layer directly call the parameters in the embedding we previously trained with word2vec
# Whether to embed fix, if fix_embedding is False. embedding will also be trained during training
self.embedding.weight.requires_grad = False if fix_embedding else True
self.embedding_dim = embedding.size(1)
self.hidden_dim = hidden_dim
self.num_layers = num_layers
self.dropout = dropout
self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, batch_first=True)
self.classifier = nn.Sequential( nn.Dropout(dropout),
nn.Linear(hidden_dim, 1),
nn.Sigmoid() )
def forward(self, inputs):
inputs = self.embedding(inputs)
x, _ = self.lstm(inputs, None)
# Dimension of x (batch, seq_len, hidden_size)
# Take the hidden state of the last LSTM (my understanding is that the output of the last one is the best for the understanding of the whole text)
x = x[:, -1, :]
x = self.classifier(x)
return x
```

# 7. Define model training function

This training process is similar to the training between. The notes are very detailed and can be understood through the notes.

```import torch
from torch import nn
import torch.optim as optim
import torch.nn.functional as F

def training(batch_size, n_epoch, lr, model_dir, train, valid, model, device):
total = sum(p.numel() for p in model.parameters())#Total parameters
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)#Parameters to be trained
print('\nstart training, parameter total:{}, trainable:{}\n'.format(total, trainable))#Look at the parameters of the model
model.train() # Set the mode of the model to train, so that optimizer can update the parameters of the model
criterion = nn.BCELoss() # Define the loss function. Here, we use binary cross entry loss
t_batch = len(train)
v_batch = len(valid)
optimizer = optim.Adam(model.parameters(), lr=lr) # Pass the parameters of the model to optimizer and assign appropriate learning rate
total_loss, total_acc, best_acc = 0, 0, 0
for epoch in range(n_epoch):
total_loss, total_acc = 0, 0
# Do training
for i, (inputs, labels) in enumerate(train):
inputs = inputs.to(device, dtype=torch.long) # device is "CUDA", and inputs are changed to torch.cuda.longtenser
labels = labels.to(device, dtype=torch.float) # The device is "CUDA", and the labels are changed to torch.cuda.FloatTensor. Because the etc. should be put into the criterion, the type should be float
optimizer.zero_grad() # Since the gradient s of loss.backward() will be accumulated, it is necessary to adjust zero after each batch
outputs = model(inputs) # input to model
outputs = outputs.squeeze() # Remove the outermost dimension so that outputs can be put into criterion()
loss = criterion(outputs, labels) # Calculate the training loss of the model at this time
loss.backward() # Calculate the gradient of loss
optimizer.step() # Update parameters of training model
correct = evaluation(outputs, labels) # Calculate the training accuracy of the model at this time
total_acc += (correct / batch_size)
total_loss += loss.item()
print('[ Epoch{}: {}/{} ] loss:{:.3f} acc:{:.3f} '.format(
epoch+1, i+1, t_batch, loss.item(), correct*100/batch_size), end='\r')
print('\nTrain | Loss:{:.5f} Acc: {:.3f}'.format(total_loss/t_batch, total_acc/t_batch*100))

# This paragraph does validation
model.eval() # Set the mode of model to eval, so that the parameters of model will be fixed
total_loss, total_acc = 0, 0
for i, (inputs, labels) in enumerate(valid):
inputs = inputs.to(device, dtype=torch.long)
labels = labels.to(device, dtype=torch.float)
outputs = model(inputs)
outputs = outputs.squeeze()
loss = criterion(outputs, labels)
correct = evaluation(outputs, labels)
total_acc += (correct / batch_size)
total_loss += loss.item()

print("Valid | Loss:{:.5f} Acc: {:.3f} ".format(total_loss/v_batch, total_acc/v_batch*100))
if total_acc > best_acc:
# If the result of validation is better than all the previous results, save the current model for subsequent prediction
best_acc = total_acc
#torch.save(model, "{}/val_acc_{:.3f}.model".format(model_dir,total_acc/v_batch*100))
torch.save(model, "{}/ckpt.model".format(model_dir))
print('saving model with acc {:.3f}'.format(total_acc/v_batch*100))
print('-----------------------------------------------')
model.train() # Set the mode of the model to train so that optimizer can update the parameters of the model (because it has just changed to eval mode)
```

# 8. Define model test function

```import torch
from torch import nn
import torch.optim as optim
import torch.nn.functional as F

model.eval()
ret_output = []
inputs = inputs.to(device, dtype=torch.long)
outputs = model(inputs)
outputs = outputs.squeeze()
outputs[outputs>=0.5] = 1 # Greater than or equal to 0.5 is negative
outputs[outputs<0.5] = 0 # Less than 0.5 is positive
ret_output += outputs.int().tolist()

return ret_output
```

# 9. Call the previous functions to start training

• Sort out the paths of each data
• Define sentence length, whether to fix embedding, batch size, number of rounds to be trained, epoch, value of learning rate, and data saving path of model
• input and labels are preprocessed
• Make a model object
• Divide data into training data and validation data (take part of training data as validation data)
• Make the data into a dataset for the data loader to access
• Convert data to batch of tensors
• Start training
```import os
import torch
import argparse
import numpy as np
from torch import nn
from gensim.models import word2vec
from sklearn.model_selection import train_test_split

# Via torch.cuda.is_ The returned value of available () is used to judge whether there is an environment using GPU. If there is any, device is set to "CUDA", and if there is no, it is set to "cpu"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Sort out the paths of each data
train_with_label = os.path.join(path_prefix, 'training_label.txt')
train_no_label = os.path.join(path_prefix, 'training_nolabel.txt')
testing_data = os.path.join(path_prefix, 'testing_data.txt')

w2v_path = os.path.join(path_prefix, 'w2v_all.model') # Sorting out the path of word2vec model

# Define sentence length, whether to fix embedding, batch size, number of rounds to be trained, epoch, value of learning rate, and data saving path of model
sen_len = 30
fix_embedding = True # fix embedding during training
batch_size = 128
epoch = 5
lr = 0.001
# model_dir = os.path.join(path_prefix, 'model/') # model directory for checkpoint model
model_dir = path_prefix # model directory for checkpoint model

# Preprocess input and labels
preprocess = Preprocess(train_x, sen_len, w2v_path=w2v_path)
train_x = preprocess.sentence_word2idx()
y = preprocess.labels_to_tensor(y)

# Make a model object
model = LSTM_Net(embedding, embedding_dim=250, hidden_dim=250, num_layers=1, dropout=0.5, fix_embedding=fix_embedding)
model = model.to(device) # The device is "cuda", and the model uses GPU for training (the inputs put in also need to be cuda tensor)

# Divide data into training data and validation data (take part of training data as validation data)
X_train, X_val, y_train, y_val = train_x[:190000], train_x[190000:], y[:190000], y[190000:]

# Make the data into a dataset for the data loader to access

# Convert data to batch of tensors
batch_size = batch_size,
shuffle = True,
num_workers = 8)

batch_size = batch_size,
shuffle = False,
num_workers = 8)

# Start training
```

The training results are as follows:

# 10. Make predictions and save results

```print("loading testing data ...")
preprocess = Preprocess(test_x, sen_len, w2v_path=w2v_path)
test_x = preprocess.sentence_word2idx()
batch_size = batch_size,
shuffle = False,
num_workers = 8)