Pumpkin book "hands on machine learning public welfare training camp" - Introduction to NLP

1 text preprocessing

This section describes the common preprocessing steps of text data, which usually include four steps:

  1. Read in text
  2. participle
  3. Build a dictionary to map each word to a unique index
  4. The text is transformed from the sequence of words to the sequence of index, which is convenient for input model
1 read in text
import collections
import re

def read_time_machine():
    with open('/home/kesci/input/timemachine7163/timemachine.txt', 'r') as f:
        lines = [re.sub('[^a-z]+', ' ', line.strip().lower()) for line in f]
    return lines

lines = read_time_machine()
print('# sentences %d' % len(lines))
2 participle

We divide each sentence into words, that is to say, we divide a sentence into several words (token s) and transform them into a sequence of words.

def tokenize(sentences, token='word'):
    """Split sentences into word or char tokens"""
    if token == 'word':
        return [sentence.split(' ') for sentence in sentences]
    elif token == 'char':
        return [list(sentence) for sentence in sentences]
        print('ERROR: unkown token type '+token)

tokens = tokenize(lines)
3 building a dictionary

To facilitate model processing, we need to convert strings to numbers. So we need to build a vocabulary first, mapping each word to a unique index number.

class Vocab(object):
    def __init__(self, tokens, min_freq=0, use_special_tokens=False):
        counter = count_corpus(tokens)  # : 
        self.token_freqs = list(counter.items())
        self.idx_to_token = []
        if use_special_tokens:
            # padding, begin of sentence, end of sentence, unknown
            self.pad, self.bos, self.eos, self.unk = (0, 1, 2, 3)
            self.idx_to_token += ['', '', '', '']
            self.unk = 0
            self.idx_to_token += ['']
        self.idx_to_token += [token for token, freq in self.token_freqs
                        if freq >= min_freq and token not in self.idx_to_token]
        self.token_to_idx = dict()
        for idx, token in enumerate(self.idx_to_token):
            self.token_to_idx[token] = idx

    def __len__(self):
        return len(self.idx_to_token)

    def __getitem__(self, tokens):
        if not isinstance(tokens, (list, tuple)):
            return self.token_to_idx.get(tokens, self.unk)
        return [self.__getitem__(token) for token in tokens]

    def to_tokens(self, indices):
        if not isinstance(indices, (list, tuple)):
            return self.idx_to_token[indices]
        return [self.idx_to_token[index] for index in indices]

def count_corpus(sentences):
    tokens = [tk for st in sentences for tk in st]
    return collections.Counter(tokens)  # Returns a dictionary that records the number of occurrences of each word
The sequence of four words is transformed into index sequence, which is convenient for input model
for i in range(8, 10):
    print('words:', tokens[i])
    print('indices:', vocab[tokens[i]])

Existing segmentation tools


import spacy
text = "Mr. Chen doesn't agree with my suggestion."
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
print([token.text for token in doc])


from nltk.tokenize import word_tokenize
from nltk import data

Exercises after class

2 language model

Language model

Suppose sequence w1,w2 ,wTw_1, w_2, \ldots, w_Tw1​,w2​,… , each word in wT is generated in turn, we have


P(w_1, w_2, \ldots, w_T)
&= \prod_{t=1}^T P(w_t \mid w_1, \ldots, w_{t-1})\
&= P(w_1)P(w_2 \mid w_1) \cdots P(w_T \mid w_1w_2\cdots w_{T-1})


For example, the probability of a text sequence containing four words

P(w1,w2,w3,w4)=P(w1)P(w2∣w1)P(w3∣w1,w2)P(w4∣w1,w2,w3). P(w_1, w_2, w_3, w_4) = P(w_1) P(w_2 \mid w_1) P(w_3 \mid w_1, w_2) P(w_4 \mid w_1, w_2, w_3). P(w1​,w2​,w3​,w4​)=P(w1​)P(w2​∣w1​)P(w3​∣w1​,w2​)P(w4​∣w1​,w2​,w3​).

The parameters of the language model are the probability of words and the conditional probability given the first few words. Suppose the training data set is a large text corpus, such as all items of Wikipedia, the probability of the word can be calculated by the relative word frequency of the word in the training data set, for example, the probability of w1w_1w1 can be calculated as:


\hat P(w_1) = \frac{n(w_1)}{n}


Where n (W1) n (W1) n (W1) is the number of texts in the corpus with w1w 1 as the first word, and nnn is the total number of texts in the corpus.

Similarly, given the case of w1w, the conditional probability of W2W can be calculated as:


\hat P(w_2 \mid w_1) = \frac{n(w_1, w_2)}{n(w_1)}


Among them, n (W 1, w 2) n (W 1, w 2) n (W 1, w 2) is the number of texts with W 1W 1 as the first word and W 2W 2 as the second word in the corpus.

n meta syntax

With the increase of the length of the sequence, the complexity of calculating and storing the probability of the occurrence of multiple words increases exponentially. 8739w2). Based on the Markov chain of order n − 1n-1n − 1, we can rewrite the language model as

P(w1,w2,...,wT)=∏t=1TP(wt∣wt−(n−1),...,wt−1). P(w_1, w_2, \ldots, w_T) = \prod_{t=1}^T P(w_t \mid w_{t-(n-1)}, \ldots, w_{t-1}) . P(w1​,w2​,...,wT​)=t=1∏T​P(wt​∣wt−(n−1)​,...,wt−1​).

The above is also called n n n grams. It is a probabilistic language model based on N − 1n - 1n − 1-order Markov chain. For example, when n=2n=2n=2, the probability of a text sequence with four words can be rewritten as:


P(w_1, w_2, w_3, w_4)
&= P(w_1) P(w_2 \mid w_1) P(w_3 \mid w_1, w_2) P(w_4 \mid w_1, w_2, w_3)\
&= P(w_1) P(w_2 \mid w_1) P(w_3 \mid w_2) P(w_4 \mid w_3)


When nnn is 1, 2 and 3, we call it unigram, bigram and trigram respectively. For example, the probabilities of sequences W1, w2, w3, w4w 1, w2, w3, w4w1, w2, w3, w4 with length 4 in unary, binary and ternary grammars are respectively


P(w_1, w_2, w_3, w_4) &= P(w_1) P(w_2) P(w_3) P(w_4) ,\
P(w_1, w_2, w_3, w_4) &= P(w_1) P(w_2 \mid w_1) P(w_3 \mid w_2) P(w_4 \mid w_3) ,\
P(w_1, w_2, w_3, w_4) &= P(w_1) P(w_2 \mid w_1) P(w_3 \mid w_1, w_2) P(w_4 \mid w_2, w_3) .


When nnn is small, the nnn meta syntax is often inaccurate. For example, in unitary grammar, the probability of "you go first" and "you go first" is the same. However, when nnn is large, it needs to compute and store a large number of word frequencies and multi word adjacent frequencies.

Sampling of time series data

In training, we need to read a small batch of samples and labels at random every time. Unlike the experimental data in the previous chapters, a sample of sequential data usually contains consecutive characters. Suppose the time step number is 5, and the sample sequence is 5 characters, that is, "want", "want", "have", "straight" and "rise". The tag sequence of the sample is the next character in the training set, i.e. "to have", "to have", "to directly", "to upgrade" and "to have a helicopter", i.e. XXX = "to have a helicopter", YYY = "to have a helicopter".

Now let's consider the sequence "want to have a helicopter, want to fly with you to the universe". If the time step is 5, there are the following possible samples and labels:

  • 30: "To have a helicopter", YYY: "to have a helicopter"
  • 30: "Helicopter", YYY: "helicopter,"
  • 30: "There's a helicopter,", YYY: "helicopter, think"
  • ...
  • 30: "Fly with you", YYY: "fly with you to Yu"
  • 30: "Fly to the universe with you", YYY: "fly to the universe with you"
  • 30: "You fly to the universe", YYY: "fly to the universe"

It can be seen that if the length of the sequence is TTT and the number of time steps is n n n, then there are a total of T − NT NT − n legal samples, but these samples have a large number of overlaps, we usually use a more efficient sampling method. There are two ways to sample time series data: random sampling and adjacent sampling.

Random sampling

The following code randomly samples a small batch of data at a time. Batch size is the number of samples in each small batch, and num steps is the number of time steps in each sample.
In random sampling, each sample is an arbitrary sequence cut from the original sequence, and the positions of two adjacent random small batches on the original sequence are not necessarily adjacent.

import torch
import random
def data_iter_random(corpus_indices, batch_size, num_steps, device=None):
    # Minus 1 is because for a sequence of length N, X contains at most the first n - 1 characters
    num_examples = (len(corpus_indices) - 1) // num_steps  # The number of samples without overlapping is obtained by rounding
    example_indices = [i * num_steps for i in range(num_examples)]  # The subscript of the first character of each sample in corpus  indexes

    def _data(i):
        # Returns a sequence of num steps from i
        return corpus_indices[i: i + num_steps]
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    for i in range(0, num_examples, batch_size):
        # Each time, select batch_size random samples
        batch_indices = example_indices[i: i + batch_size]  # Subscript of the first character of each sample of the current batch
        X = [_data(j) for j in batch_indices]
        Y = [_data(j + 1) for j in batch_indices]
        yield torch.tensor(X, device=device), torch.tensor(Y, device=device)
my_seq = list(range(30))
for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6):
    print('X: ', X, '\nY:', Y, '\n')

Adjacent sampling

In adjacent sampling, two adjacent random small batches are adjacent to each other in the original sequence.

def data_iter_consecutive(corpus_indices, batch_size, num_steps, device=None):
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    corpus_len = len(corpus_indices) // batch_size * batch_size  # The length of the remaining sequence
    corpus_indices = corpus_indices[: corpus_len]  # Only the first corpus'len characters are reserved
    indices = torch.tensor(corpus_indices, device=device)
    indices = indices.view(batch_size, -1)  # resize into (batch_size,)
    batch_num = (indices.shape[1] - 1) // num_steps
    for i in range(batch_num):
        i = i * num_steps
        X = indices[:, i: i + num_steps]
        Y = indices[:, i + 1: i + num_steps + 1]
        yield X, Y

Exercises after class

When n=3, i.e. based on the second-order Markov chain, each word is only related to its first two words

3 basis of cyclic neural network

This section introduces the cyclic neural network, and the figure below shows how to implement the language model based on the cyclic neural network. Our goal is to predict the next character of the sequence based on the current and past input sequences. A hidden variable HHH is introduced into the recurrent neural network. The value of HHH in time step t t t is represented by HTH {t} Ht. The calculation of HTH {t} Ht is based on XTX {t} XT and Ht − 1H {T-1} Ht − 1. It can be considered that HTH {t} Ht records the sequence information up to the current character and uses HTH {t} Ht to predict the next character of the sequence.

Construction of cyclic neural network

Let's first look at the specific structure of the cyclic neural network. Suppose that Xt ∈ Rn × D \ boldsymbol {x} u t \ in \ mathbb {r} ^ {n \ times d} Xt ∈ Rn × D is a small batch input of time step ttt, and Ht ∈ Rn × h \ boldsymbol {h} u t \ in \ mathbb {r} ^ {n \ times h} Ht ∈ Rn × h is a hidden variable of this time step

Ht=ϕ(XtWxh+Ht−1Whh+bh). \boldsymbol{H}_t = \phi(\boldsymbol{X}_t \boldsymbol{W}_{xh} + \boldsymbol{H}_{t-1} \boldsymbol{W}_{hh} + \boldsymbol{b}_h). Ht​=ϕ(Xt​Wxh​+Ht−1​Whh​+bh​).

^ {1 \ times h} bh ∈ R1 × h, ϕ \ phi ϕ function is a non-linear activation function. Because of the introduction of Ht − 1wh \ boldsymbol {h} {T-1} \ boldsymbol {w} {HH} Ht − 1wh, HTH {t} Ht can capture the historical information of the sequence up to the current time step, just like the state or memory of the current time step of the neural network. Since the calculation of HTH {t} Ht is based on Ht − 1H {T-1} Ht − 1, the calculation of the above formula is cyclic, and the network using cyclic calculation is the recurrent neural network.

In the time step ttt, the output of the output layer is:

Ot=HtWhq+bq. \boldsymbol{O}_t = \boldsymbol{H}_t \boldsymbol{W}_{hq} + \boldsymbol{b}_q. Ot​=Ht​Whq​+bq​.

Among them, Whq ∈ Rh × q \ boldsymbol {w} {HQ} \ in \ mathbb {r} ^ {h \ times q} Whq ∈ Rh × q, bq ∈ R1 × q \ boldsymbol {B} u q \ in \ mathbb {r} ^ {1 \ times q} bq ∈ R1 × q.

Realization of cyclic neural network from zero

First, we try to implement a language model based on character level cyclic neural network from scratch. Here we use Jay Chou's lyrics as the corpus. First, we read in the data:

import torch
import torch.nn as nn
import time
import math
import sys
import d2l_jay9460 as d2l
(corpus_indices, char_to_idx, idx_to_char, vocab_size) = d2l.load_data_jay_lyrics()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

One hot vector

We need to represent the characters as vectors. Here we use the one hot vector. Assuming that the dictionary size is NNN, each character corresponds to a unique index from 000 to N − 1N-1N − 1, then the vector of the character is a vector of NNN length, if the index of the character is iii, then the third position of the vector is 111, and other positions are 000. The one hot vectors with indexes 0 and 2 are shown below. The length of the vectors is equal to the dictionary size.

def one_hot(x, n_class, dtype=torch.float32):
    result = torch.zeros(x.shape[0], n_class, dtype=dtype, device=x.device)  # shape: (n, n_class)
    result.scatter_(1, x.long().view(-1, 1), 1)  # result[i, x[i, 0]] = 1
    return result
x = torch.tensor([0, 2])
x_one_hot = one_hot(x, vocab_size)

Initialize model parameters

num_inputs, num_hiddens, num_outputs = vocab_size, 256, vocab_size
# num_inputs: d
# Num? Hidden: H, the number of hidden cells is a super parameter
# num_outputs: q

def get_params():
    def _one(shape):
        param = torch.zeros(shape, device=device, dtype=torch.float32)
        nn.init.normal_(param, 0, 0.01)
        return torch.nn.Parameter(param)

    # Hide layer parameters
    W_xh = _one((num_inputs, num_hiddens))
    W_hh = _one((num_hiddens, num_hiddens))
    b_h = torch.nn.Parameter(torch.zeros(num_hiddens, device=device))
    # Output layer parameters
    W_hq = _one((num_hiddens, num_outputs))
    b_q = torch.nn.Parameter(torch.zeros(num_outputs, device=device))
    return (W_xh, W_hh, b_h, W_hq, b_q)

Definition model

Function rnn completes the calculation of each time step of the recurrent neural network in a cyclic way.

def rnn(inputs, state, params):
    # Input and output are both num steps and matrix with shape (batch size, vocab size)
    W_xh, W_hh, b_h, W_hq, b_q = params
    H, = state
    outputs = []
    for X in inputs:
        H = torch.tanh(torch.matmul(X, W_xh) + torch.matmul(H, W_hh) + b_h)
        Y = torch.matmul(H, W_hq) + b_q
    return outputs, (H,)

The init RNN state function initializes the hidden variable, where the return value is a tuple.

def init_rnn_state(batch_size, num_hiddens, device):
    return (torch.zeros((batch_size, num_hiddens), device=device), )

Clipping gradient

In the cyclic neural network, gradient decay or gradient explosion are more likely to occur, which will make the network almost unable to train. clip gradient is a method to deal with gradient explosion. Suppose we splice the gradients of all model parameters into a vector g\boldsymbol{g}g, and set the clipping threshold to be θ \ theta θ. Cropped gradient

min⁡(θ∥g∥,1)g \min\left(\frac{\theta}{\|\boldsymbol{g}\|}, 1\right)\boldsymbol{g} min(∥g∥θ​,1)g

The L2L norm of does not exceed θ \ theta θ.

def grad_clipping(params, theta, device):
    norm = torch.tensor([0.0], device=device)
    for param in params:
        norm += (param.grad.data ** 2).sum()
    norm = norm.sqrt().item()
    if norm > theta:
        for param in params:
            param.grad.data *= (theta / norm)

Define prediction function

The following function predicts the next num? Chars characters based on the prefix prefix prefix (a string of several characters). This function is a little complex, in which we set rnn as a function parameter, so that we can use this function repeatedly when we introduce other recurrent neural networks in the next section.

def predict_rnn(prefix, num_chars, rnn, params, init_rnn_state,
                num_hiddens, vocab_size, device, idx_to_char, char_to_idx):
    state = init_rnn_state(1, num_hiddens, device)
    output = [char_to_idx[prefix[0]]]   # output record prefix plus predicted num ﹐ chars characters
    for t in range(num_chars + len(prefix) - 1):
        # Take the output of the previous time step as the input of the current time step
        X = to_onehot(torch.tensor([[output[-1]]], device=device), vocab_size)
        # Calculate output and update hidden state
        (Y, state) = rnn(X, state, params)
        # The next time step is to input the characters in the prefix or the current best prediction character
        if t < len(prefix) - 1:
            output.append(char_to_idx[prefix[t + 1]])
    return ''.join([idx_to_char[i] for i in output])
predict_rnn('Separate', 10, rnn, params, init_rnn_state, num_hiddens, vocab_size,
            device, idx_to_char, char_to_idx)


We usually use perplexity to evaluate the language model. Recall "softmax regression" The definition of cross entropy loss function. The degree of perplexity is the value obtained by exponential operation of cross entropy loss function. In particular,

  • In the best case, the model always predicts the probability of label category as 1, and the degree of confusion is 1;
  • In the worst case, the model always predicts the probability of label category as 0, and the degree of confusion is positive and infinite;
  • In the baseline case, the probability of all categories predicted by the model is the same, and the degree of confusion is the number of categories.

Obviously, the confusion of any effective model must be less than the number of categories. In this case, the confusion must be less than the dictionary size, vocab? Size.

Define model training function

Compared with the model training function in the previous chapter, the model training function here has the following differences:

  1. The evaluation model of perplexity is used.
  2. Cut the gradient before iterating the model parameters.
  3. Different sampling methods for time series data will lead to different initialization of hidden state.
def train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
                          vocab_size, device, corpus_indices, idx_to_char,
                          char_to_idx, is_random_iter, num_epochs, num_steps,
                          lr, clipping_theta, batch_size, pred_period,
                          pred_len, prefixes):
    if is_random_iter:
        data_iter_fn = d2l.data_iter_random
        data_iter_fn = d2l.data_iter_consecutive
    params = get_params()
    loss = nn.CrossEntropyLoss()

    for epoch in range(num_epochs):
        if not is_random_iter:  # If adjacent sampling is used, the hidden state is initialized at the beginning of epoch
            state = init_rnn_state(batch_size, num_hiddens, device)
        l_sum, n, start = 0.0, 0, time.time()
        data_iter = data_iter_fn(corpus_indices, batch_size, num_steps, device)
        for X, Y in data_iter:
            if is_random_iter:  # If random sampling is used, the hidden state is initialized before each small batch update
                state = init_rnn_state(batch_size, num_hiddens, device)
            else:  # Otherwise, you need to use the detach function to separate the hidden state from the calculation graph
                for s in state:
            # Input is a matrix whose shapes are (batch size, vocab size)
            inputs = to_onehot(X, vocab_size)
            # outputs have num steps matrices with the shape (batch size, vocab size)
            (outputs, state) = rnn(inputs, state, params)
            # After splicing, the shape is (Num ﹐ steps * batch ﹐ size, vocab ﹐ size)
            outputs = torch.cat(outputs, dim=0)
            # The shape of Y is (batch_size, num_steps), which is transformed to
            # The vector of (Num ﹐ steps * batch ﹐ size,) so that it corresponds to the output line one by one
            y = torch.flatten(Y.T)
            # Using cross entropy loss to calculate average classification error
            l = loss(outputs, y.long())
            # Gradient Qing 0
            if params[0].grad is not None:
                for param in params:
            grad_clipping(params, clipping_theta, device)  # Clipping gradient
            d2l.sgd(params, lr, 1)  # Because the error has been averaged, the gradient does not need to be averaged
            l_sum += l.item() * y.shape[0]
            n += y.shape[0]

        if (epoch + 1) % pred_period == 0:
            print('epoch %d, perplexity %f, time %.2f sec' % (
                epoch + 1, math.exp(l_sum / n), time.time() - start))
            for prefix in prefixes:
                print(' -', predict_rnn(prefix, pred_len, rnn, params, init_rnn_state,
                    num_hiddens, vocab_size, device, idx_to_char, char_to_idx))

Train models and create lyrics

Now we can train the model. First, set the model super parameters. We will create a lyrics with a length of 50 characters (regardless of prefix length) according to the prefix "separate" and "do not separate". Every 50 iterations, we create a lyric based on the current training model.

num_epochs, num_steps, batch_size, lr, clipping_theta = 250, 35, 32, 1e2, 1e-2
pred_period, pred_len, prefixes = 50, 50, ['Separate', 'No separation']
# The following uses the random sampling training model and creates lyrics.
train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
                      vocab_size, device, corpus_indices, idx_to_char,
                      char_to_idx, True, num_epochs, num_steps, lr,
                      clipping_theta, batch_size, pred_period, pred_len,

Exercises after class

Published 2 original articles, praised 0 and visited 59
Private letter follow

Tags: network less

Posted on Fri, 14 Feb 2020 04:23:49 -0500 by Liquid Fire