## Cyclic neural network

This section introduces the cyclic neural network, and the figure below shows how to implement the language model based on the cyclic neural network. Its purpose is to predict the next character of the sequence based on the current input and the past input sequence. A hidden variable HHH is introduced into the recurrent neural network, and the value of HHH in time step ttt is represented by HTH ﹐ tht. The calculation of HTH ﹐ t is based on XTX ﹐ t ﹐ t ﹐ T-1} and Ht − 1H ﹐ T-1} Ht − 1. It can be considered that HTH ﹐ t records the sequence information up to the current character, and uses HTH ﹐ t to predict the next character of the sequence.

####Construction of cyclic neural networkAssuming that Xt ∈ Rn × DX ﹣ t \ in R ^ {n \ times d} Xt ∈ Rn × D is the small batch input of time step ttt, and Ht ∈ Rn × h h ﹣ t \ in R ^ {n \ times h} Ht ∈ Rn × h is the hidden variable of time step, then:

Rh × HW {h h} \ in R ^ {h \ times h} Whh ∈ Rh × h, bh ∈ R1 × H B {h} \ in R ^ {1 \ times h} bh ∈ R1 × h, ϕ phi ϕ function is a nonlinear activation function. Because of the introduction of Ht ￣ 1whhh ￣ T-1} w {HH} Ht ￣ 1Whh, HTH ￣ UHT can capture the historical information of the sequence up to the current time step, just like the state or memory of the current time step of the neural network. Since the calculation of HTH ﹣ T-1} is based on Ht − 1H ﹣ T-1} Ht − 1, the calculation of the above formula is cyclic, and the network using cyclic calculation is the recurrent neural network.

In the time step ttt, the output of the output layer is:

Ot = htwhq + bqo_t = h {t} w {h q} + B {qot = Ht Whq + bq, where Whq ∈ Rh × QW {HQ} \ in R {h \ times q} Whq ∈ Rh × q, bq ∈ R1 × QB {q} \ in R {1 \ times q} bq ∈ R1 × q.

#### Realization of cyclic neural network from zero

This paper attempts to implement a language model based on character level cyclic neural network from scratch. Here, we use Jay Chou's lyrics as the corpus, and first read in the data:

import torch import torch.nn as nn import time import math import sys sys.path.append("/home/kesci/input") import d2l_jay9460 as d2l (corpus_indices, char_to_idx, idx_to_char, vocab_size) = d2l.load_data_jay_lyrics() device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

###### One hot vector

Here we need to use one hot vector to represent characters as vectors. Assuming that the dictionary size is NNN, each character corresponds to a unique index from 000 to N − 1N-1N − 1, then the vector of the character is a vector of NNN length, if the index of the character is iii, then the third position of the vector is 111, and other positions are 000. The one hot vectors with indexes of 000 and 222 are shown below, and the length of the vectors is equal to the dictionary size.

def one_hot(x, n_class, dtype=torch.float32): result = torch.zeros(x.shape[0], n_class, dtype=dtype, device=x.device) # shape: (n, n_class) result.scatter_(1, x.long().view(-1, 1), 1) # result[i, x[i, 0]] = 1 return result x = torch.tensor([0, 2]) x_one_hot = one_hot(x, vocab_size) print(x_one_hot) print(x_one_hot.shape) print(x_one_hot.sum(axis=1))

tensor([[1., 0., 0., ..., 0., 0., 0.], [0., 0., 1., ..., 0., 0., 0.]]) torch.Size([2, 1027]) tensor([1., 1.])

The shape of the small batch for each sampling is (batch size, time steps). The following function transforms such a small batch into several matrices with the shape of (batch size, dictionary size), and the number of matrices equals to the number of time steps. That is to say, the input of t t t in time step is Xt ∈ Rn × DX ﹣ t \ in R ^ {n \ times d} Xt ∈ Rn × D, where nnn is the batch size, ddd is the word vector size, that is, the one hot vector length (Dictionary size).

def to_onehot(X, n_class): return [one_hot(X[:, i], n_class) for i in range(X.shape[1])] X = torch.arange(10).view(2, 5) inputs = to_onehot(X, vocab_size) print(len(inputs), inputs[0].shape)

5 torch.Size([2, 1027])

###### Initialize model parameters

num_inputs, num_hiddens, num_outputs = vocab_size, 256, vocab_size # num_inputs: d # Num? Hidden: H, the number of hidden cells is a super parameter # num_outputs: q def get_params(): def _one(shape): param = torch.zeros(shape, device=device, dtype=torch.float32) nn.init.normal_(param, 0, 0.01) # Non zero elements are initialized by normal distribution N(0, 0.01) return torch.nn.Parameter(param) # Hide layer parameters W_xh = _one((num_inputs, num_hiddens)) W_hh = _one((num_hiddens, num_hiddens)) b_h = torch.nn.Parameter(torch.zeros(num_hiddens, device=device)) # Output layer parameters W_hq = _one((num_hiddens, num_outputs)) b_q = torch.nn.Parameter(torch.zeros(num_outputs, device=device)) return (W_xh, W_hh, b_h, W_hq, b_q)

###### Definition model

Function rnn completes the calculation of each time step of the recurrent neural network in a cyclic way.

def rnn(inputs, state, params): # Input and output are both num steps and matrix with shape (batch size, vocab size) W_xh, W_hh, b_h, W_hq, b_q = params H, = state outputs = [] for X in inputs: H = torch.tanh(torch.matmul(X, W_xh) + torch.matmul(H, W_hh) + b_h) Y = torch.matmul(H, W_hq) + b_q outputs.append(Y) return outputs, (H,)

The init RNN state function initializes the hidden variable, where the return value is a tuple.

def init_rnn_state(batch_size, num_hiddens, device): return (torch.zeros((batch_size, num_hiddens), device=device), )

''' //Make a simple test to observe the number of output results (time steps), //And the shape of the output layer output and the shape of the hidden state of the first time step. ''' print(X.shape) print(num_hiddens) print(vocab_size) state = init_rnn_state(X.shape[0], num_hiddens, device) inputs = to_onehot(X.to(device), vocab_size) params = get_params() outputs, state_new = rnn(inputs, state, params) print(len(inputs), inputs[0].shape) print(len(outputs), outputs[0].shape) print(len(state), state[0].shape) print(len(state_new), state_new[0].shape)

torch.Size([2, 5]) 256 1027 5 torch.Size([2, 1027]) 5 torch.Size([2, 1027]) 1 torch.Size([2, 256]) 1 torch.Size([2, 256])

###### Clipping gradient

In the cyclic neural network, gradient decay or gradient explosion are more likely to occur, which will make the network almost unable to train. Clip gradient is a method to deal with gradient explosion. Suppose that the gradient of all model parameters is spliced into a vector ggg, and the threshold value of clipping is θ \ theta θ. The gradient min(θ∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣.

def grad_clipping(params, theta, device): norm = torch.tensor([0.0], device=device) for param in params: norm += (param.grad.data ** 2).sum() norm = norm.sqrt().item() if norm > theta: for param in params: param.grad.data *= (theta / norm)

###### Define prediction function

The following function predicts the next num? Chars characters based on the prefix prefix prefix (a string of several characters). This function is a little complex, in which rnn is set as a function parameter, so that it can be used repeatedly when other recurrent neural networks are introduced in the next section.

def predict_rnn(prefix, num_chars, rnn, params, init_rnn_state, num_hiddens, vocab_size, device, idx_to_char, char_to_idx): state = init_rnn_state(1, num_hiddens, device) output = [char_to_idx[prefix[0]]] # output record prefix plus predicted num ﹐ chars characters for t in range(num_chars + len(prefix) - 1): # Take the output of the previous time step as the input of the current time step X = to_onehot(torch.tensor([[output[-1]]], device=device), vocab_size) # Calculate output and update hidden state (Y, state) = rnn(X, state, params) # The next time step is to input the characters in the prefix or the current best prediction character if t < len(prefix) - 1: output.append(char_to_idx[prefix[t + 1]]) else: output.append(Y[0].argmax(dim=1).item()) return ''.join([idx_to_char[i] for i in output])

''' First test the predict gun function. Create a lyric with a length of 10 characters (regardless of prefix length) according to the prefix "separate". Because the parameters of the model are random, the prediction results are also random. ''' Predict, 10, RNN, params, init, RNN, num, hidden, vocab, size, device, idx_to_char, char_to_idx)

```
'When we are apart, when we are apart'
```

###### Perplexity

Perplexity is usually used to evaluate the language model. The degree of perplexity is the value obtained by exponential operation of cross entropy loss function. In particular,

- In the best case, the model always predicts the probability of label category as 1, and the degree of confusion is 1;
- In the worst case, the model always predicts the probability of label category as 0, and the degree of confusion is positive and infinite;
- In the baseline case, the probability of all categories predicted by the model is the same, and the degree of confusion is the number of categories.

Obviously, the confusion of any effective model must be less than the number of categories. In this case, the confusion must be less than the dictionary size, vocab? Size.

###### Define model training function

Compared with the model training function in the previous chapter, the model training function here has the following differences:

- Using the evaluation model of perplexity;
- The gradient is cut before the parameters of the iterative model;
- Different sampling methods for time series data will lead to different initialization of hidden state.

def train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens, vocab_size, device, corpus_indices, idx_to_char, char_to_idx, is_random_iter, num_epochs, num_steps, lr, clipping_theta, batch_size, pred_period, pred_len, prefixes): if is_random_iter: data_iter_fn = d2l.data_iter_random else: data_iter_fn = d2l.data_iter_consecutive ''' //Characteristics of adjacent sampling: two adjacent batch es are continuous in training data. If adjacent sampling is used, only //The hidden state is initialized at the beginning of each epoch. When in the training process, the same epoch, with the //The gradient of the loss function with respect to the hidden variable propagates further and the computation cost is greater. In order to reduce the computation cost, the //At the beginning of each batch, separate the hidden state from the calculation graph (using the detach() function) ''' params = get_params() loss = nn.CrossEntropyLoss() for epoch in range(num_epochs): if not is_random_iter: # If adjacent sampling is used, the hidden state is initialized at the beginning of epoch state = init_rnn_state(batch_size, num_hiddens, device) l_sum, n, start = 0.0, 0, time.time() data_iter = data_iter_fn(corpus_indices, batch_size, num_steps, device) for X, Y in data_iter: if is_random_iter: # If random sampling is used, the hidden state is initialized before each small batch update state = init_rnn_state(batch_size, num_hiddens, device) else: # Otherwise, you need to use the detach function to separate the hidden state from the calculation graph for s in state: s.detach_() # Input is a matrix whose shapes are (batch size, vocab size) inputs = to_onehot(X, vocab_size) # outputs have num steps matrices with the shape (batch size, vocab size) (outputs, state) = rnn(inputs, state, params) # After splicing, the shape is (Num ﹐ steps * batch ﹐ size, vocab ﹐ size) outputs = torch.cat(outputs, dim=0) # The shape of Y is (batch_size, num_steps), which is transformed to # The vector of (Num ﹐ steps * batch ﹐ size,) so that it corresponds to the output line one by one y = torch.flatten(Y.T) # Using cross entropy loss to calculate average classification error l = loss(outputs, y.long()) # Gradient Qing 0 if params[0].grad is not None: for param in params: param.grad.data.zero_() l.backward() grad_clipping(params, clipping_theta, device) # Clipping gradient d2l.sgd(params, lr, 1) # Because the error has been averaged, the gradient does not need to be averaged l_sum += l.item() * y.shape[0] n += y.shape[0] if (epoch + 1) % pred_period == 0: print('epoch %d, perplexity %f, time %.2f sec' % ( epoch + 1, math.exp(l_sum / n), time.time() - start)) for prefix in prefixes: print(' -', predict_rnn(prefix, pred_len, rnn, params, init_rnn_state, num_hiddens, vocab_size, device, idx_to_char, char_to_idx))

###### Train models and create lyrics

''' //Set model super parameters. According to the prefix "separate" and "do not separate", create a lyrics with a length of 50 characters (regardless of the prefix length). //Every 50 iterations, a lyric is created according to the current training model. ''' num_epochs, num_steps, batch_size, lr, clipping_theta = 250, 35, 32, 1e2, 1e-2 pred_period, pred_len, prefixes = 50, 50, ['Separate', 'No separation'] # Using random sampling training model and creating lyrics train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens, vocab_size, device, corpus_indices, idx_to_char, char_to_idx, True, num_epochs, num_steps, lr, clipping_theta, batch_size, pred_period, pred_len, prefixes) # Using adjacent sampling training model and creating lyrics train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens, vocab_size, device, corpus_indices, idx_to_char, char_to_idx, False, num_epochs, num_steps, lr, clipping_theta, batch_size, pred_period, pred_len, prefixes)

# ------------------------------Random sampling------------------------------ epoch 50, perplexity 65.808092, time 0.78 sec - I want to be separated. I don't want to think about it anymore. I don't want to think about it anymore. I don't want to think about it anymore. I don't want to think about it anymore - Don't separate them. One, two, three, four, three, four, one, four, one, four, one, four, one, four, one, four, one, four, one, four, one, four, one, four, one, four, one epoch 100, perplexity 9.794889, time 0.72 sec - Who has been staying in the United States? Who is waiting for you by the stream outside the village? You are still there. I have a little skinny color. I met you in the same place - Don't separate? I can't think about it anymore. I don't. I don't. I don't. I don't. I don't. I don't. I don't. I don't. I don't. I don't. I don't. I don't. I don't. I don't epoch 150, perplexity 2.772557, time 0.80 sec - If it's not right, it'll stay here. If it's not right, it'll stay here. I'm afraid it's not good. The old strange hall belongs to the old school. It's based on my heart. There are some old CDs of wind and frost - Don't you separate? Then I'll pass it. I'll lose it slowly. I think it's too fast. But I'll take the test again. I say it's been a long time since I lost it?I'll wait epoch 200, perplexity 1.601744, time 0.73 sec - Separate that one. It's all over my face. Mom pinches it into your shape and screams. Or she would like to say that she can make the shuttle remember to hold it gently. I want to hold your hand like this - Don't separate, and then review the past slowly, let me fall in love with you. That tragedy is a perfect performance for you. I'd rather be heartbroken and cry than forget it epoch 250, perplexity 1.323342, time 0.78 sec - It's nice to separate the heaven and the earth from the mantra of crying. Please be the blood to wear yongyang's poems. My love for you was written in the Mesopotamia plain before the Western Yuan Dynasty - The fat witch who didn't separate the brooms chanted the Latin incantations cheerfully. Her black cat laughed like crying cheerfully. When I came, I was silent at the mouth of the river outside my feeling # ------------------------------Adjacent sampling------------------------------ epoch 50, perplexity 60.294393, time 0.74 sec - I want you to think I don't want to think I don't want to think I don't want to think I don't want to think I don't want to think I don't want to think - Don't leave me. I want you. Don't leave me. My lovely woman is bad. It makes me crazy. Lovely woman is bad. It makes me crazy. Lovely woman is bad. It makes me crazy epoch 100, perplexity 7.141162, time 0.72 sec - Apart, I have to love again. Don't think about me anymore. No, I don't think about me anymore. No, I don't want to love you. My sight is like a tornado - If you don't separate Liu Tianhuang, you will know ha after one stick. Use double stick to hum ha. Use double stick to hum ha. Use double stick to hum ha epoch 150, perplexity 2.090277, time 0.73 sec - I want to be separated. This is you. I don't want to be able to do it. But that person is not me anymore. How hard it is without you. How hard it is without you - I don't know if you're gone. I want to be better. I'll take me to the end. No, you're the wind. I'll take my mother with me epoch 200, perplexity 1.305391, time 0.77 sec - I want to hold your hand like this. It must come true. It must be like carrying you now. It seems like carrying sunshine. No matter you stay here, it's sunny. Butterflies fly freely - I don't know if you're gone. I don't know if I'm with this rhythm. I've had another fall. I should live a good life. I should live a good life epoch 250, perplexity 1.230800, time 0.79 sec - I don't want you to watch it too fast. I'm sorry. I'm afraid my hands will be early this morning. I can't sleep. Last night, in my dream, you came to me. I just wanted to - I don 't know you' re leaving. I don 't know you' re leaving. I don 't know you know the rhythm. Then you know what you know. Then you know what you know. Then you know what you know

#### pytorch implementation of cyclic neural network

###### Definition model

NN. RNN in Python is used to construct the cyclic neural network. This section focuses on the following constructor parameters of nn.RNN:

- Input_size - the number of expected features in the input x
- hidden_size – the number of features in the hidden state H
- Non linearity – the non linearity to use. Can be either 'tanh' or 'relu'. Default: 'tanh' (activation function)
- batch_first – If True, then the input and output tensors are provided as (batch_size, num_steps, input_size). Default: False

Here, the batch [first] determines the shape of the input. The default parameter False is used. The corresponding input shape is (Num [steps], batch [size, input [size]).

The forward function has the following parameters:

- input of shape (num_steps, batch_size, input_size): tensor containing the features of the input sequence.
- H? 0 (equivalent to the previous state) of shape (Num? Layers * num? Directions (related to deep loop neural network), batch? Size, hidden? Size (related to bidirectional loop neural network)): sensor containing the initial hidden state for each element in the batch. Defaults to zero if not provided. If the RNN is bidirectional, Num? Directions should be 2, else it should be One

The return value of the forward function is:

- Output of shape (Num? Steps, batch? Size, Num? Directions * hidden? Size): sensor containing the output features (H? T) from the last layer of the RNN, for each t
- H'un of shape (num'layers * num'directions, batch'un size, hidden'un size): sensor containing the hidden state for t = num'un steps

# Construct an nn.RNN instance to see the shape of the output rnn_layer = nn.RNN(input_size=vocab_size, hidden_size=num_hiddens) num_steps, batch_size = 35, 2 X = torch.rand(num_steps, batch_size, vocab_size) state = None Y, state_new = rnn_layer(X, state) print(Y.shape, state_new.shape)

torch.Size([35, 2, 256]) torch.Size([1, 2, 256])

# Define a complete language model based on recurrent neural network class RNNModel(nn.Module): def __init__(self, rnn_layer, vocab_size): super(RNNModel, self).__init__() self.rnn = rnn_layer self.hidden_size = rnn_layer.hidden_size * (2 if rnn_layer.bidirectional else 1) self.vocab_size = vocab_size self.dense = nn.Linear(self.hidden_size, vocab_size) def forward(self, inputs, state): # inputs.shape: (batch_size, num_steps) X = to_onehot(inputs, vocab_size) X = torch.stack(X) # X.shape: (num_steps, batch_size, vocab_size) hiddens, state = self.rnn(X, state) hiddens = hiddens.view(-1, hiddens.shape[-1]) # hiddens.shape: (num_steps * batch_size, hidden_size) output = self.dense(hiddens) return output, state

''' //Similar needs to implement a prediction function, with the previous difference between forward calculation and initialization of hidden state. ''' def predict_rnn_pytorch(prefix, num_chars, model, vocab_size, device, idx_to_char, char_to_idx): state = None output = [char_to_idx[prefix[0]]] # output record prefix plus predicted num ﹐ chars characters for t in range(num_chars + len(prefix) - 1): X = torch.tensor([output[-1]], device=device).view(1, 1) (Y, state) = model(X, state) # Forward calculation does not need to pass in model parameters if t < len(prefix) - 1: output.append(char_to_idx[prefix[t + 1]]) else: output.append(Y.argmax(dim=1).item()) return ''.join([idx_to_char[i] for i in output])

# Use a model with a random weight to predict once model = RNNModel(rnn_layer, vocab_size).to(device) predict_rnn_pytorch('Separate', 10, model, vocab_size, device, idx_to_char, char_to_idx)

```
'Separate the chest and turn the wheel'
```

# Using adjacent sampling to realize training function def train_and_predict_rnn_pytorch(model, num_hiddens, vocab_size, device, corpus_indices, idx_to_char, char_to_idx, num_epochs, num_steps, lr, clipping_theta, batch_size, pred_period, pred_len, prefixes): loss = nn.CrossEntropyLoss() optimizer = torch.optim.Adam(model.parameters(), lr=lr) model.to(device) for epoch in range(num_epochs): l_sum, n, start = 0.0, 0, time.time() data_iter = d2l.data_iter_consecutive(corpus_indices, batch_size, num_steps, device) # Adjacent sampling state = None for X, Y in data_iter: if state is not None: # Use detach function to separate hidden state from calculation graph if isinstance (state, tuple): # LSTM, state:(h, c) state[0].detach_() state[1].detach_() else: state.detach_() (output, state) = model(X, state) # output.shape: (num_steps * batch_size, vocab_size) y = torch.flatten(Y.T) l = loss(output, y.long()) optimizer.zero_grad() l.backward() grad_clipping(model.parameters(), clipping_theta, device) optimizer.step() l_sum += l.item() * y.shape[0] n += y.shape[0] if (epoch + 1) % pred_period == 0: print('epoch %d, perplexity %f, time %.2f sec' % ( epoch + 1, math.exp(l_sum / n), time.time() - start)) for prefix in prefixes: print(' -', predict_rnn_pytorch( prefix, pred_len, model, vocab_size, device, idx_to_char, char_to_idx)) num_epochs, batch_size, lr, clipping_theta = 250, 32, 1e-3, 1e-2 pred_period, pred_len, prefixes = 50, 50, ['Separate', 'No separation'] train_and_predict_rnn_pytorch(model, num_hiddens, vocab_size, device, corpus_indices, idx_to_char, char_to_idx, num_epochs, num_steps, lr, clipping_theta, batch_size, pred_period, pred_len, prefixes)

epoch 50, perplexity 9.405654, time 0.52 sec - Start with three steps and four steps, look at the stars, one two three four in a line, back to back, make a silent wish, a willow, where I am - Don't separate love, your hand. One's old turtledove's legs are not very hairy. Use the double stick to hum and ha. Use the double stick to hum and ha. Use the double stick to quickly epoch 100, perplexity 1.255020, time 0.54 sec - When I leave my house, I will make his beloved dove love me like a wind blowing the perfect Lord. It's too fast for me to learn to be afraid of my eyes and mouth - I don't want to separate. I don't want to talk about my head. I've already figured it out. I don't want to talk about it. I'm afraid I can't hold your tears. I don't understand your black humor epoch 150, perplexity 1.064527, time 0.53 sec - I'm sorry that the vines are crawling all over the count's tomb - Don't separate, don't think about it, don't worry about it, don't blame it, don't regret it, don't say it, I don't think it's too hard epoch 200, perplexity 1.033074, time 0.53 sec - Apart from the stream outside my light, I can only be a black far away in the past. I think for a long time, so I don't want you to hit my mother again - I'm sleeping together. I just want you and hamburger. I want your smile every day. I know it's beautiful here, but you're more beautiful in my hometown epoch 250, perplexity 1.047890, time 0.68 sec - Apart from me, the light and much more diffuse is already playing in you. I want to hold your hand like this. Can I not let go of love simply without hurting you - I don't want to leave you. I can't do anything about it. I'll break my familiarity. I'll set no date here. Then I'll review the past and let me fall in love