Text is a kind of sequence data. An article can be regarded as a sequence of characters or words. This section will introduce common preprocessing steps of text data, which usually includes four steps:
- Read in text
- Build a dictionary to map each word to a unique index
- The text is transformed from the sequence of words to the sequence of index, which is convenient for input model
We use an English novel, Time Machine of H. G. Well, as an example to show the specific process of text preprocessing.
import collections import re def read_time_machine(): with open('/home/kesci/input/timemachine7163/timemachine.txt', 'r') as f: lines = [re.sub('[^a-z]+', ' ', line.strip().lower()) for line in f] return lines lines = read_time_machine() print('# sentences %d' % len(lines))
We divide each sentence into words, that is to say, we divide a sentence into several words (token s) and transform them into a sequence of words.
def tokenize(sentences, token='word'): """Split sentences into word or char tokens""" if token == 'word': return [sentence.split(' ') for sentence in sentences] elif token == 'char': return [list(sentence) for sentence in sentences] else: print('ERROR: unkown token type '+token) tokens = tokenize(lines)
To facilitate model processing, we need to convert strings to numbers. So we need to build a vocabulary first, mapping each word to a unique index number.
class Vocab(object): def __init__(self, tokens, min_freq=0, use_special_tokens=False): counter = count_corpus(tokens) # : self.token_freqs = list(counter.items()) self.idx_to_token =  if use_special_tokens: # padding, begin of sentence, end of sentence, unknown self.pad, self.bos, self.eos, self.unk = (0, 1, 2, 3) self.idx_to_token += ['', '', '', ''] else: self.unk = 0 self.idx_to_token += [''] self.idx_to_token += [token for token, freq in self.token_freqs if freq >= min_freq and token not in self.idx_to_token] self.token_to_idx = dict() for idx, token in enumerate(self.idx_to_token): self.token_to_idx[token] = idx def __len__(self): return len(self.idx_to_token) def __getitem__(self, tokens): if not isinstance(tokens, (list, tuple)): return self.token_to_idx.get(tokens, self.unk) return [self.__getitem__(token) for token in tokens] def to_tokens(self, indices): if not isinstance(indices, (list, tuple)): return self.idx_to_token[indices] return [self.idx_to_token[index] for index in indices] def count_corpus(sentences): tokens = [tk for st in sentences for tk in st] return collections.Counter(tokens) # Returns a dictionary that records the number of occurrences of each word
Using dictionaries, we can convert sentences in the original text from word sequences to index sequences
for i in range(8, 10): print('words:', tokens[i]) print('indices:', vocab[tokens[i]])
The segmentation method we introduced earlier is very simple. It has at least the following disadvantages:
- Punctuation usually provides semantic information, but our method directly discards it
- Words like "shouldn't", "doesn't" can be mishandled
- Words such as "Mr.", "Dr.", etc. will be mishandled
We can solve these problems by introducing more complex rules, but in fact, there are some existing tools that can do word segmentation well. Here we briefly introduce two of them: spaCy and NLTK.
A natural language text can be regarded as a discrete time series. Given a sequence of words w 1, w 2 with length T The goal of the language model is to evaluate whether the sequence is reasonable, that is, to calculate the probability of the sequence: P(w1,w2,...) (wT).
In this section, we introduce the language model based on statistics, mainly n-gram. In the following content, we will introduce the language model based on neural network.
Thinking: what are the possible defects of n-ary grammar?
- Parameter space is too large
- Data sparseness
In training, we need to read a small batch of samples and labels at random every time. Unlike the experimental data in the previous chapters, a sample of sequential data usually contains consecutive characters. Suppose the time step number is 5, and the sample sequence is 5 characters, that is, "want", "want", "have", "straight" and "rise". The tag sequence of this sample is the next character of these characters in the training set, that is, "to", "have", "direct", "upgrade" and "aircraft", that is, X = to have a direct upgrade ", Y = to have a helicopter".
Now let's consider the sequence "want to have a helicopter, want to fly with you to the universe". If the time step is 5, there are the following possible samples and labels:
It can be seen that if the length of the sequence is T and the number of time steps is n, then there are a total of T − n legal samples, but these samples have a large number of overlaps, we usually adopt a more efficient sampling method. There are two ways to sample time series data: random sampling and adjacent sampling.
- Random sampling
The following code randomly samples a small batch of data at a time. Batch size is the number of samples in each small batch, and num steps is the number of time steps in each sample.
Note: in random sampling, each sample is a segment of sequence randomly cut from the original sequence, and the positions of two adjacent random small batches on the original sequence are not necessarily adjacent.
import torch import random def data_iter_random(corpus_indices, batch_size, num_steps, device=None): # Minus 1 is because for a sequence of length N, X contains at most the first n - 1 characters num_examples = (len(corpus_indices) - 1) // num_steps # The number of samples without overlapping is obtained by rounding example_indices = [i * num_steps for i in range(num_examples)] # The subscript of the first character of each sample in corpus indexes random.shuffle(example_indices) def _data(i): # Returns a sequence of num steps from i return corpus_indices[i: i + num_steps] if device is None: device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') for i in range(0, num_examples, batch_size): # Each time, select batch_size random samples batch_indices = example_indices[i: i + batch_size] # Subscript of the first character of each sample of the current batch X = [_data(j) for j in batch_indices] Y = [_data(j + 1) for j in batch_indices] yield torch.tensor(X, device=device), torch.tensor(Y, device=device)
#To test this function, we input a continuous integer from 0 to 29 as a manual sequence, and set the batch size and time #The steps are 2 and 6, respectively, printing the input X and label Y of the small batch samples read at random each time. my_seq = list(range(30)) for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6): print('X: ', X, '\nY:', Y, '\n') #The output is as follows: X: tensor([[ 6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17]]) Y: tensor([[ 7, 8, 9, 10, 11, 12], [13, 14, 15, 16, 17, 18]]) X: tensor([[ 0, 1, 2, 3, 4, 5], [18, 19, 20, 21, 22, 23]]) Y: tensor([[ 1, 2, 3, 4, 5, 6], [19, 20, 21, 22, 23, 24]])
- Adjacent sampling
In adjacent sampling, two adjacent random small batches are adjacent to each other in the original sequence.
import torch import random def data_iter_consecutive(corpus_indices, batch_size, num_steps, device=None): if device is None: device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') corpus_len = len(corpus_indices) // batch_size * batch_size # The length of the remaining sequence corpus_indices = corpus_indices[: corpus_len] # Only the first corpus'len characters are reserved indices = torch.tensor(corpus_indices, device=device) indices = indices.view(batch_size, -1) # resize into (batch_size,) batch_num = (indices.shape - 1) // num_steps for i in range(batch_num): i = i * num_steps X = indices[:, i: i + num_steps] Y = indices[:, i + 1: i + num_steps + 1] yield X, Y
#Under the same setting, print the input X and label Y of small batch samples read each time for adjacent samples. Two adjacent #The position of random small batch is adjacent to the original sequence. for X, Y in data_iter_consecutive(my_seq, batch_size=2, num_steps=6): print('X: ', X, '\nY:', Y, '\n') #The output is as follows: X: tensor([[ 0, 1, 2, 3, 4, 5], [15, 16, 17, 18, 19, 20]]) Y: tensor([[ 1, 2, 3, 4, 5, 6], [16, 17, 18, 19, 20, 21]]) X: tensor([[ 6, 7, 8, 9, 10, 11], [21, 22, 23, 24, 25, 26]]) Y: tensor([[ 7, 8, 9, 10, 11, 12], [22, 23, 24, 25, 26, 27]])
We need to represent the characters as vectors. Here we use the one hot vector. Assuming that the dictionary size is n, each character corresponds to a unique index from 0 to n − 1, then the vector of the character is a vector of length N, if the index of the character is i, then the ith position of the vector is 1, and other positions are 0. The one hot vectors with indexes 0 and 2 are shown below. The length of the vectors is equal to the dictionary size.
def one_hot(x, n_class, dtype=torch.float32): result = torch.zeros(x.shape, n_class, dtype=dtype, device=x.device) # shape: (n, n_class) result.scatter_(1, x.long().view(-1, 1), 1) # result[i, x[i, 0]] = 1 return result x = torch.tensor([0, 2]) x_one_hot = one_hot(x, vocab_size) print(x_one_hot) print(x_one_hot.shape) print(x_one_hot.sum(axis=1)) #Output is: tensor([[1., 0., 0., ..., 0., 0., 0.], [0., 0., 1., ..., 0., 0., 0.]]) torch.Size([2, 1027]) tensor([1., 1.])
In the cyclic neural network, gradient decay or gradient explosion are more likely to occur, which will make the network almost unable to train. clip gradient is a method to deal with gradient explosion. Suppose we splice the gradients of all model parameters into a vector g, and set the clipping threshold to be θ. The cut gradient
def grad_clipping(params, theta, device): norm = torch.tensor([0.0], device=device) for param in params: norm += (param.grad.data ** 2).sum() norm = norm.sqrt().item() if norm > theta: for param in params: param.grad.data *= (theta / norm)
We usually use perplexity to evaluate the language model. Recall the definition of the cross entropy loss function in the "softmax regression" section. The degree of perplexity is the value obtained by exponential operation of cross entropy loss function. In particular,
- In the best case, the model always predicts the probability of label category as 1, and the degree of confusion is 1;
- In the worst case, the model always predicts the probability of label category as 0, and the degree of confusion is positive and infinite;
- In the baseline case, the probability of all categories predicted by the model is the same, and the degree of confusion is the number of categories.
Obviously, the confusion of any effective model must be less than the number of categories. In this case, the confusion must be less than the dictionary size, vocab? Size.
Compared with the model training function in the previous chapter, the model training function here has the following differences:
- The evaluation model of perplexity is used.
- Cut the gradient before iterating the model parameters.
- Different sampling methods for time series data will lead to different initialization of hidden state.
def train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens, vocab_size, device, corpus_indices, idx_to_char, char_to_idx, is_random_iter, num_epochs, num_steps, lr, clipping_theta, batch_size, pred_period, pred_len, prefixes): if is_random_iter: data_iter_fn = d2l.data_iter_random else: data_iter_fn = d2l.data_iter_consecutive params = get_params() loss = nn.CrossEntropyLoss() for epoch in range(num_epochs): if not is_random_iter: # If adjacent sampling is used, the hidden state is initialized at the beginning of epoch state = init_rnn_state(batch_size, num_hiddens, device) l_sum, n, start = 0.0, 0, time.time() data_iter = data_iter_fn(corpus_indices, batch_size, num_steps, device) for X, Y in data_iter: if is_random_iter: # If random sampling is used, the hidden state is initialized before each small batch update state = init_rnn_state(batch_size, num_hiddens, device) else: # Otherwise, you need to use the detach function to separate the hidden state from the calculation graph for s in state: s.detach_() # Input is a matrix whose shapes are (batch size, vocab size) inputs = to_onehot(X, vocab_size) # outputs have num steps matrices with the shape (batch size, vocab size) (outputs, state) = rnn(inputs, state, params) # After splicing, the shape is (Num ﹐ steps * batch ﹐ size, vocab ﹐ size) outputs = torch.cat(outputs, dim=0) # The shape of Y is (batch_size, num_steps), which is transformed to # The vector of (Num ﹐ steps * batch ﹐ size,) so that it corresponds to the output line one by one y = torch.flatten(Y.T) # Using cross entropy loss to calculate average classification error l = loss(outputs, y.long()) # Gradient Qing 0 if params.grad is not None: for param in params: param.grad.data.zero_() l.backward() grad_clipping(params, clipping_theta, device) # Clipping gradient d2l.sgd(params, lr, 1) # Because the error has been averaged, the gradient does not need to be averaged l_sum += l.item() * y.shape n += y.shape if (epoch + 1) % pred_period == 0: print('epoch %d, perplexity %f, time %.2f sec' % ( epoch + 1, math.exp(l_sum / n), time.time() - start)) for prefix in prefixes: print(' -', predict_rnn(prefix, pred_len, rnn, params, init_rnn_state, num_hiddens, vocab_size, device, idx_to_char, char_to_idx))
As shown in the code:
- When using adjacent sampling: initialize the hidden state at the beginning of each period epoch
- When use is adopted: initialize the hidden state before each small batch update
According to the different sampling methods, the initialization hidden state is different, but why? It's not clear here. Why on earth?
Supplementary answer: to be continued~~~
Construct an nn.RNN instance as follows, and use a simple example to see the shape of the output.
rnn_layer = nn.RNN(input_size=vocab_size, hidden_size=num_hiddens) num_steps, batch_size = 35, 2 X = torch.rand(num_steps, batch_size, vocab_size) state = None Y, state_new = rnn_layer(X, state) print(Y.shape, state_new.shape) #The output is: torch.Size([35, 2, 256]) torch.Size([1, 2, 256])
We define a complete language model based on cyclic neural network.
def __init__(self, rnn_layer, vocab_size): super(RNNModel, self).__init__() self.rnn = rnn_layer self.hidden_size = rnn_layer.hidden_size * (2 if rnn_layer.bidirectional else 1) self.vocab_size = vocab_size self.dense = nn.Linear(self.hidden_size, vocab_size) def forward(self, inputs, state): # inputs.shape: (batch_size, num_steps) X = to_onehot(inputs, vocab_size) X = torch.stack(X) # X.shape: (num_steps, batch_size, vocab_size) hiddens, state = self.rnn(X, state) hiddens = hiddens.view(-1, hiddens.shape[-1]) # hiddens.shape: (num_steps * batch_size, hidden_size) output = self.dense(hiddens) return output, state
Similarly, we need to implement a prediction function, which is different from the previous one in forward calculation and initialization of hidden state.
def predict_rnn_pytorch(prefix, num_chars, model, vocab_size, device, idx_to_char, char_to_idx): state = None output = [char_to_idx[prefix]] # output record prefix plus predicted num ﹐ chars characters for t in range(num_chars + len(prefix) - 1): X = torch.tensor([output[-1]], device=device).view(1, 1) (Y, state) = model(X, state) # Forward calculation does not need to pass in model parameters if t < len(prefix) - 1: output.append(char_to_idx[prefix[t + 1]]) else: output.append(Y.argmax(dim=1).item()) return ''.join([idx_to_char[i] for i in output])
Use a model with a random weight to predict once.
model = RNNModel(rnn_layer, vocab_size).to(device) predict_rnn_pytorch('Separate', 10, model, vocab_size, device, idx_to_char, char_to_idx) #Different output~
The next step is to implement the training function, where only adjacent samples are used.
def train_and_predict_rnn_pytorch(model, num_hiddens, vocab_size, device, corpus_indices, idx_to_char, char_to_idx, num_epochs, num_steps, lr, clipping_theta, batch_size, pred_period, pred_len, prefixes): loss = nn.CrossEntropyLoss() optimizer = torch.optim.Adam(model.parameters(), lr=lr) model.to(device) for epoch in range(num_epochs): l_sum, n, start = 0.0, 0, time.time() data_iter = d2l.data_iter_consecutive(corpus_indices, batch_size, num_steps, device) # Adjacent sampling state = None for X, Y in data_iter: if state is not None: # Use detach function to separate hidden state from calculation graph if isinstance (state, tuple): # LSTM, state:(h, c) state.detach_() state.detach_() else: state.detach_() (output, state) = model(X, state) # output.shape: (num_steps * batch_size, vocab_size) y = torch.flatten(Y.T) l = loss(output, y.long()) optimizer.zero_grad() l.backward() grad_clipping(model.parameters(), clipping_theta, device) optimizer.step() l_sum += l.item() * y.shape n += y.shape if (epoch + 1) % pred_period == 0: print('epoch %d, perplexity %f, time %.2f sec' % ( epoch + 1, math.exp(l_sum / n), time.time() - start)) for prefix in prefixes: print(' -', predict_rnn_pytorch( prefix, pred_len, model, vocab_size, device, idx_to_char, char_to_idx))
num_epochs, batch_size, lr, clipping_theta = 250, 32, 1e-3, 1e-2 pred_period, pred_len, prefixes = 50, 50, ['Separate', 'No separation'] train_and_predict_rnn_pytorch(model, num_hiddens, vocab_size, device, corpus_indices, idx_to_char, char_to_idx, num_epochs, num_steps, lr, clipping_theta, batch_size, pred_period, pred_len, prefixes)