2021SC@SDUSC

## TextCNN

TextCNN is a model proposed by Yoon Kim in 2014, which pioneered the use of CNN to encode n-gram features.

The model structure is shown in the figure. The convolution in the image is two-dimensional, while TextCNN uses "one-dimensional convolution", that is, filter_size * embedding_dim has a dimension equal to embedding. So filter_size can extract n-gram information. Taking one sample as an example, the overall forward logic is:

[seq_length, embedding_dim] seq_length-filter_size+1 1x1

In the practice of TextCNN, there are many places to optimize:

Filter size: this parameter determines the length of the extracted n-gram feature. This parameter is mainly related to the data. If the average length is less than 50, it can be less than 10, otherwise it can be longer. When adjusting parameters, you can use a size grid search to find an optimal size, and then try to combine the optimal size with the nearby size

Number of filters: this parameter will affect the dimension of the final feature. If the dimension is too large, the training speed will slow down. Here, you can adjust the parameter between 100-600

CNN activation function: you can try Identity, ReLU and tanh

Regularization: it refers to the regularization of CNN parameters. Dropout or L2 can be used, but it can play a small role. Try a small dropout rate (< 0.5) and a larger L2 limit

Pooling method: select mean, max and k-max pooling according to the situation. Most of the time, max performs very well, because the classification task does not have high requirements for fine-grained semantics, so it is good to only grasp the largest features

TextCNN is a strong baseline suitable for medium and short text scenes, but it is not suitable for long text, because the convolution kernel size is usually not set to be large, so it is unable to capture long-distance features. At the same time, Max pooling also has limitations and will lose some useful features. In addition, if you think about it carefully, TextCNN is essentially the same as the traditional n-gram word bag model. Most of its good effects come from the introduction of word vector [3], which solves the sparsity problem of word bag model.

class TextCNN(nn.Module): def __init__(self, config): super(TextCNN, self).__init__() self.config = config self.filter_sizes = [1,2,4,8] self.embedding_dim = 128 dim_cnn_out = 128 filter_num = 128 if config.type == "DNA" or config.type == "RNA": vocab_size = 6 elif config.type == "prot": vocab_size = 26 # self.filter_sizes = [int(fsz) for fsz in self.filter_sizes.split(',')] self.embedding = nn.Embedding(vocab_size, self.embedding_dim, padding_idx=0) self.convs = nn.ModuleList( [nn.Conv2d(1, filter_num, (fsz, self.embedding_dim)) for fsz in self.filter_sizes]) self.dropout = nn.Dropout(0.2) self.linear = nn.Linear(len(self.filter_sizes) * filter_num, dim_cnn_out) self.classification = nn.Linear(dim_cnn_out, 2) # label_num: 28 def forward(self, x): x = x.cuda() # Enter the dimension of x as (batch_size, max_len), max_len can be set or automatically obtained as the maximum = length of the training sample through torchtext # print('raw x', x.size()) input_ids = x x = self.embedding(x) # After embedding, the dimension of X is (batch_size, max_len, embedding_dim) # print('embedding x', x.size()) # After the view function, the dimension of x becomes (batch_size, input_chanel=1, w=max_len, h=embedding_dim) x = x.view(x.size(0), 1, x.size(1), self.embedding_dim) # print('view x', x.size()) # After convolution operation, the dimension of each operation result in x is (batch_size, out_chanel, w, h=1) x = [F.relu(conv(x)) for conv in self.convs] # After the maximum pool layer, the dimension becomes (batch_size, out_chanel, w=1, h=1) x = [F.max_pool2d(input=x_item, kernel_size=(x_item.size(2), x_item.size(3))) for x_item in x] # print('max_pool2d x', len(x), [x_item.size() for x_item in x]) # Flatten the dimensions (batch, out_chanel,w,h=1) of different convolution kernel operation results to (batch, outchanel*w*h) x = [x_item.view(x_item.size(0), -1) for x_item in x] # print('flatten x', len(x), [x_item.size() for x_item in x]) # Combining the features extracted from different convolution kernels, the dimension becomes (batch, sum:outchanel*w*h) x = torch.cat(x, 1) # print('concat x', x.size()) torch.Size([320, 1024]) # dropout layer x = self.dropout(x) # Full connection layer representation = self.linear(x) output = self.classification(representation) return output, representation