[NLP] text feature processing & text data enhancement

1. Text feature processing

  • Understand the role of text feature processing
  • Master the specific methods of common text feature processing

Functions of text feature processing:

  • Text feature processing includes adding universal text features to the corpus, such as n-gram features, and necessary processing of the text corpus after adding features, such as length specification. These feature processing work can effectively add important text features to model training and enhance model evaluation indicators

Common text feature processing methods:

  • Add n-gram feature
  • Text length specification

1.1 what are n-gram features

Given a text sequence, the adjacent co-occurrence features of n words or words are n-gram features. The commonly used n-gram features are Bi gram and tri gram features, corresponding to N 2 and 3, respectively

For example, chestnuts:

Suppose a word segmentation list is given: ["Who is it?", "Knock", "My heart"]

The corresponding numerical mapping list is: [1, 34, 21]

We can consider each number in the numerical mapping list as a lexical feature.

besides, We can also"Who is it?"and"Knock"When two words appear together and are adjacent, they are also added to the sequence list as a feature,

Suppose 1000 represents"Who is it?"and"Knock"Co occurring and adjacent

At this point, the value mapping list becomes 2-gram Feature list of features: [1, 34, 21, 1000]

there"Who is it?"and"Knock"Common and adjacent bi-gram One of the features.

"Knock"and"My heart"It is also a co-occurrence and adjacent two words, So they are bi-gram features.

Suppose 1001 represents"Knock"and"My heart"Co occurring and adjacent

that, Finally, the original numeric mapping list [1, 34, 21] Added bi-gram The feature then becomes [1, 34, 21, 1000, 1001]

Extract n-gram features:

# In general, N in n-gram takes 2 or 3, here take 2 as an example
ngram_range = 2

def create_ngram_set(input_list):
    """
    description: Extract all values from the list of values n-gram features
    :param input_list: List of values entered, It can be regarded as a list after vocabulary mapping, 
                       The value range of each number is[1, 25000]
    :return: n-gram A collection of features

    eg:
    >>> create_ngram_set([1, 4, 9, 4, 1, 4])
    {(4, 9), (4, 1), (1, 4), (9, 4)}
    """ 
    return set(zip(*[input_list[i:] for i in range(ngram_range)]))

Call:

input_list = [1, 3, 2, 1, 5, 3]
res = create_ngram_set(input_list)
print(res)

Output effect:

# All Bi gram features of this input list
{(3, 2), (1, 3), (2, 1), (1, 5), (5, 3)}

1.2 text length specification and its function

The input of the general model requires a matrix of the same size, so it is necessary to standardize the length of each text after numerical mapping before entering the model. At this time, the reasonable length covering most of the text will be analyzed according to the sentence length distribution, the super long text will be truncated and the insufficient text will be supplemented (the number 0 is generally used). This process is the text length specification

Implementation of text length specification:

from keras.preprocessing import sequence

# According to the sentence length distribution in the data analysis, cutlen covers about 90% of the shortest length of the corpus
# It is assumed here that cutlen is 10
cutlen = 10

def padding(x_train):
    """
    description: Length specification of input text tensor
    :param x_train: Tensor representation of text, Shape such as: [[1, 32, 32, 61], [2, 54, 21, 7, 19]]
    :return: Text tensor representation after truncation and complement 
    """
    # Use sequence.pad_sequences can be completed
    return sequence.pad_sequences(x_train, cutlen)

Call:

# Assume x_ There are two pieces of text in the train. One is longer than 10 and one day is less than 10
x_train = [[1, 23, 5, 32, 55, 63, 2, 21, 78, 32, 23, 1],
           [2, 32, 1, 23, 1]]

res = padding(x_train)
print(res)

Output effect:

[[ 5 32 55 63  2 21 78 32 23  1]
 [ 0  0  0  0  0  2 32  1 23  1]]

2. Text data enhancement

Learning objectives

  • Understand the role of text data enhancement
  • Master the specific methods of common text data enhancement
  • Common text data enhancement methods:
    • Back translation data enhancement method

2.1 what is retranslation data enhancement

  • At present, back translation data enhancement is an effective enhancement method for text data enhancement. Generally, based on google translation interface, text data is translated into another language (generally small language), and then translated back to the original language. It can be considered that a new corpus with the same label as the original language is obtained. Adding the new corpus to the original data set can be considered as data enhancement of the original data set
  • Enhanced advantages of retranslation data:
    • Simple operation and high quality of new language materials
  • Problems in back translation data enhancement
    • In the process of short text back translation, there may be a high repetition rate between the new corpus and the source material, which can not effectively increase the feature space of the sample
  • High repetition rate solution:
    • Carry out continuous multi language translation, such as Chinese – > Korean – > Japanese – > English – > Chinese. According to experience, only three consecutive translations are used at most, and more translations will lead to problems such as low efficiency and semantic distortion

2.2 implementation of back translation data enhancement

# Suppose two existing positive samples and two negative samples are taken
# Based on these four samples, four new samples with the same label will be generated
p_sample1 = "The hotel facilities are very good"
p_sample2 = "This one is very cheap"
n_sample1 = "The slippers are moldy, Too bad"
n_sample2 = "TV doesn't work well, I didn't see the football"

# Import google translation interface tool
from googletrans import Translator
# Instantiate translation object
translator = Translator()
# For the first batch translation, the translation target is Korean
translations = translator.translate([p_sample1, p_sample2, n_sample1, n_sample2], dest='ko')
# Obtain the translated results
ko_res = list(map(lambda x: x.text, translations))
# Print results
print("Intermediate translation results:")
print(ko_res)


# Finally, after translating back to Chinese, complete the whole process of back translation
translations = translator.translate(ko_res, dest='zh-cn')
cn_res = list(map(lambda x: x.text, translations))
print("Enhanced data from back translation:")
print(cn_res)

Output effect:

Intermediate translation results:
['호텔 시설은 아주 좋다', '이 가격은 매우 저렴합니다', '슬리퍼 곰팡이가 핀이다, 나쁜', 'TV가 잘 작동하지 않습니다, 나는 축구를 볼 수 없습니다']
Enhanced data from back translation:
['The hotel facilities are very good', 'This price is very affordable', 'The slippers are moldy and broken', 'TV doesn't work. I can't watch football']

summary

  • Learned the role of text feature processing:
    Text feature processing includes adding universal text features to the corpus, such as n-gram features, and necessary processing of the text corpus after adding features, such as length specification. These feature processing work can effectively add important text features to model training and enhance model evaluation indicators

  • Learned common text feature processing methods:

    • Add n-gram feature
    • Text length specification
  • What are the n-gram features learned:

    • Given a text sequence, the adjacent co-occurrence features of n words or words are n-gram features. The commonly used n-gram features are Bi gram and tri gram features, corresponding to N 2 and 3, respectively
  • Learned the function to extract n-gram features: create_ngram_set

  • Learned the text length specification and its function:

    • The input of the general model requires a matrix of the same size, so it is necessary to standardize the length of each text after numerical mapping before entering the model. At this time, the reasonable length covering most of the text will be analyzed according to the sentence length distribution, the super long text will be truncated and the insufficient text will be supplemented (the number 0 is generally used). This process is the text length specification
  • Learned the implementation function of text length specification: padding

  • Learned common text data enhancement methods:
    Back translation data enhancement method

  • Learned the advantages of back translation data enhancement:
    Simple operation and high quality of new language materials

  • Learned the problems of back translation data enhancement:
    In the process of short text back translation, there may be a high repetition rate between the new corpus and the source material, which can not effectively increase the feature space of the sample

  • Learned high repetition rate solutions:
    Carry out continuous multi language translation, such as Chinese – > Korean – > Japanese – > English – > Chinese. According to experience, only three consecutive translations are used at most, and more translations will lead to problems such as low efficiency and semantic distortion

come on.

thank!

strive!

Tags: Machine Learning Deep Learning NLP

Posted on Sun, 03 Oct 2021 17:43:37 -0400 by sovtek