Recurrent neural network (RNN) is designed based on the recursive nature of sequential data (such as language, speech and time series). It is a feedback type of neural network. Its structure includes loop and self repetition, so it is called "loop". It is specifically used to process sequence data, such as generating text word by word or predicting time series data (such as stock prices).
1, RNN network type
RNN can be divided into five infrastructure types according to the difference between the input number m and the output number n:
(1) one to one: in fact, it is no different from fully connected neural networks. This category is not RNN.
(2) one to many: the input is not a sequence, but the output is a sequence. It can be used to generate articles or music by topic, etc.
(3) many to one: the input is a sequence, and the output is not a sequence (a single value). Commonly used for text classification.
(4) many to many: both input and output are sequences of indefinite length. This is the encoder decoder structure, which is often used in machine translation.
(5) many to many(m==n): both input and output are sequence data of equal length. This is the most classical structure type in RNN, which is often used for named entity recognition and sequence prediction of NLP. This paper takes this as an example.
2, RNN principle
For the RNN model, we still analyze it from the data, model, learning objectives and optimization algorithm. In the use process, we need to focus on the difference between its input and output (this section takes the classical RNN structure of m==n as an example):
2.1 data level
Unlike the traditional machine learning model, which assumes that the input is independent, the input data elements of RNN are sequential and interdependent, and the serial input model is based on time step by step. The input of the previous step has an impact on the prediction of the next step (for example, for the task of text prediction, the sequence text of "cat eats fish", the input of "cat" – x(0) in the previous step will affect the probability of "eating" – x(1) in the prediction of the next step, and will continue to affect the probability of "fish" – x(2) in the prediction of the next step). We can put the history (context) through the RNN structure The information is fed back to the next step.
2.2 model level and forward propagation
As shown in the figure above, the RNN model (such as the model on the left, which is actually the only physical model) can be regarded as multiple fully connected neural networks connected in series and sharing ($U, W, V $) parameters according to time steps (t). The expanded stereogram is as follows:
In addition to accepting the input x(t) of each step, RNN will also connect and input the feedback information of the previous step  hidden state h(t1), that is, the hidden state ℎ (t) at the current time is jointly determined by the input x(t) at the current time and the hidden state h(t1) at the previous time. In addition, RNN neurons share the weight parameter matrix at each time step (unlike CNN, which shares parameters in space). Parameter sharing in the time dimension can make full use of the timedomain correlation between data. If we have a separate parameter at each time point, it can not be generalized to the sequence length we have not seen during training, Nor can the statistical intensity of different sequence lengths and different locations be shared in time.
The forward propagation calculation flow chart of each time step is as follows. Next, we will gradually decompose the calculation flow:
The above figure unfolds the calculation process of two time steps t1 and t;
t is the length m of 0 ~ sequence;
x(t) is the input vector of t time step;
U is the weight matrix from the input layer to the hidden layer;
h(t) is the output state vector of t time step hidden layer, which can represent the feedback information of historical input (context);
V is the weight matrix from the hidden layer to the output layer;
b is the offset term;
o(t) is the output vector of the t time step output layer;
2.2.1 t time step input process
Assume that the dimension of state h of each time step is 2, the initial value of H is [0,0], and the dimensions of input x and output o are 1.
The state h(t1) at the previous time is spliced with the input x(t) at the current time into a onedimensional vector as the input of the fully connected hidden layer, and the input dimension of the corresponding hidden layer is 3 (as shown in the input part in the figure below).
2.2.2 t time step output h(t) and feedback to the next step
Corresponding to the calculation flow chart, the output state h(t1) at time T1 is [0.537, 0.462], the input at time t is [2.0], and after splicing, it is [0.537, 0.462, 2.0] to input the fully connected hidden layer and the weight matrix of the hidden layer U + W U+W U+W is [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]], and the offset term b1 is [0.1,  0.1]. Through the matrix operation of the hidden layer, it is: h(t1) splicing x(t) * weight parameter W splicing weight matrix U + offset term (b1) is converted from tanh and output to state h(t). Then, h(t) and x(t+1) continue to be input to the hidden layer of the next step (t+1).
# Corresponding code of matrix operation of hidden layer np.tanh(np.dot(np.array([[0.537, 0.462, 2.0]]),np.array([[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]])) + np.array([0.1, 0.1])) # Output h(t): array([[0.85972772, 0.88365397]])
2.2.3 process from time step h(t) to output o(t)
The output state h(t) of the hidden layer is [0.86, 0.884], and the weight matrix of the output layer V V V is [[1.0], [2.0]], the offset term b1 is [0.1], h(t) is: h(t) * V + offset term (b2) through the matrix operation of the output layer, and o(t) is output
# Corresponding code of matrix operation of output layer np.dot(np.array([[0.85972772, 0.88365397]]),np.array([[1.0], [2.0]])) + np.array([0.1]) # o(t) output: array([[2.72703566]])
The above process is a complete forward propagation process from the initial input (t=0) to the end of the sequence (t=m). We can see the weight matrix U , W , V U,W,V U. W, V and bias terms are the same group at different times, which also shows that RNN shares parameters at different times.
This RNN calculation process can be briefly summarized into two formulas:
State H (T) = f (U * x (T) + W * H (t1) + B1), f is the activation function, and the hidden layer in the above figure uses tanh. Hidden layer activation functions are commonly used tanh and relu
Output o (T) = g (V * H (T) + B2), G is the activation function, and the output layer in the figure above makes regression prediction without using nonlinear activation function. When used for classification tasks, the output layer generally uses softmax activation function
2.3 learning objectives
After mapping the input x(t) sequence to the output value o(t), like the fully connected neural network, the RNN model can measure the error (such as cross entropy and mean square error) between each o(t) and the corresponding training target y as the loss function, and minimize the loss function L(U,W,V) as the learning goal (also known as the optimization strategy).
2.4 optimization algorithm
There is no essential difference between the optimization process of RNN and fully connected neural network. Through error back propagation and multiple iterations of gradient descent optimization parameters, the appropriate RNN model parameters $U,W,V $(the bias term is ignored here). The difference is that RNN is based on time back propagation, so RNN back propagation is sometimes called BPTT (back propagation through time). BPTT will sum the gradients of different time steps. Since all parameters are shared at each position of the sequence, we update the same parameter group during back propagation. The following is the schematic diagram of BPTT and the process of derivation (gradient) of u, W and V.
Optimization parameters
V
V
V is relatively simple to calculate parameters
V
V
Partial derivative of V and sum the gradients in different time steps:
W
W
W and
U
U
Since the solution of partial derivative of U involves historical data, the solution of partial derivative is relatively complex. Assuming that there are only three times (t==3), then at the third time
L
L
L right
W
W
The partial derivative of W is:
Accordingly,
L
L
The partial derivative of L to U at the third time is:
According to the above two formulas, we can write L in
t
t
t time pair
W
W
W and
U
U
General formula of U partial derivative:
 Difficulties in RNN optimization
We substitute the activation function (sigmoid, tanh) into and analyze the part of the intermediate cumulative multiplication of the above general formula:
The derivative range of sigmoid function is (0,0.25%), and the derivative range of tanh function is (0,1). In the process of accumulation and multiplication, if sigmoid function is taken as the activation function, the smaller derivative accumulation will lead to the smaller and smaller gradient of the time step until it is close to 0 (the longer the information of historical time step is from the current time step, the weaker the feedback gradient signal will be) , this is called "gradient disappearance". Similarly, it may also lead to "gradient explosion".
2.5 limitations of RNN

The above shows are unidirectional RNN. One disadvantage of unidirectional RNN is that the sequence information of t+1 and later can not be used at time t, so there is a bidirectional recurrent neural network (BDRNN).

Theoretically, RNN can use any long sequence of information, but in practice, the length it can remember is limited. After a certain time, it will lead to gradient explosion or gradient disappearance (as described in the above section), that is, longterm dependencies In general, using traditional RNNs often requires limiting the maximum length of the sequence, setting gradient truncation and guiding the regularization of information flow, or using gated RNNs such as GRU and LSTM to improve the longterm dependence problem (– topics discussed later).
3, RNN forecast stock
This project creates a singlelayer hidden layer RNN model and inputs the time series data of the stock opening price in the first 60 trading days (time steps) to predict the stock opening price in the next (60 + 1) trading day.
Import stock data and select the time series data of stock opening price
import numpy as np import matplotlib.pyplot as plt import pandas as pd #(the official account reads the data set and source code) dataset_train = pd.read_csv('./data/NSETATAGLOBAL.csv') dataset_train = dataset_train.sort_values(by='Date').reset_index(drop=True) training_set = dataset_train.iloc[:, 1:2].values print(dataset_train.shape) dataset_train.head()
The training data are normalized to accelerate the convergence of network training.
# Max min normalization of training data from sklearn.preprocessing import MinMaxScaler sc = MinMaxScaler(feature_range = (0, 1)) training_set_scaled = sc.fit_transform(training_set)
Sort the data into samples and labels: 60 timesteps and 1 output
# Each sample contains 60 time steps, corresponding to the tag value of the next time step X_train = [] y_train = [] for i in range(60, 2035): X_train.append(training_set_scaled[i60:i, 0]) y_train.append(training_set_scaled[i, 0]) X_train, y_train = np.array(X_train), np.array(y_train) print(X_train.shape) print(y_train.shape) # Reshaping X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1)) print(X_train.shape)
Create a single hidden layer RNN model using kera, and set the model optimization algorithm adam and the objective function root mean square MSE
# Creating RNN model with Keras from keras.models import Sequential from keras.layers import Dense from keras.layers import SimpleRNN,LSTM from keras.layers import Dropout # Initialization order model regressor = Sequential() # Define the input layer and the hidden layer with 5 neurons regressor.add(SimpleRNN(units = 5, input_shape = (X_train.shape[1], 1))) # Define linear output layer regressor.add(Dense(units = 1)) # Model compilation: define the optimization algorithm adam and the root mean square MSE of the objective function regressor.compile(optimizer = 'adam', loss = 'mean_squared_error') # model training history = regressor.fit(X_train, y_train, epochs = 100, batch_size = 100, validation_split=0.1) regressor.summary()
Show the model fitting: the training set and verification set have low loss
plt.plot(history.history['loss'],c='blue') # Blue line training set loss plt.plot(history.history['val_loss'],c='red') # Red line verification set loss plt.show()
Evaluation model: take the series of stock trading data in the new time period as the test set to evaluate the performance of the model test set.
# test data dataset_test = pd.read_csv('./data/tatatest.csv') dataset_test = dataset_test.sort_values(by='Date').reset_index(drop=True) real_stock_price = dataset_test.iloc[:, 1:2].values dataset_total = pd.concat((dataset_train['Open'], dataset_test['Open']), axis = 0) inputs = dataset_total[len(dataset_total)  len(dataset_test)  60:].values inputs = inputs.reshape(1,1) inputs = sc.transform(inputs) # Extract test set X_test = [] for i in range(60, 76): X_test.append(inputs[i60:i, 0]) X_test = np.array(X_test) X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1)) # model prediction predicted_stock_price = regressor.predict(X_test) # Inverse normalization predicted_stock_price = sc.inverse_transform(predicted_stock_price) # Model evaluation print('Difference between forecast and actual MSE',sum(pow((predicted_stock_price  real_stock_price),2))/predicted_stock_price.shape[0]) print('Difference between forecast and actual MAE',sum(abs(predicted_stock_price  real_stock_price))/predicted_stock_price.shape[0])
Through the evaluation of the test set, the predicted and actual difference MSE:53.03141531, and the predicted and actual difference MAE: 5.82196445. The difference between the visual predicted value and the actual value is consistent as a whole (Note: This paper only forecasts the stock price from the dimension of data regularity, which is for reference only, does not constitute any investment suggestions. Don't look for me if you lose money!!!).
# Visualization of predicted and actual differences plt.plot(real_stock_price, color = 'red', label = 'Real TATA Stock Price') plt.plot(predicted_stock_price, color = 'blue', label = 'Predicted TAT Stock Price') plt.title('TATA Stock Price Prediction') plt.xlabel('samples') plt.ylabel('TATA Stock Price') plt.legend() plt.show()
The article starts with algorithm, and the official account reads the original text. GitHub project source code