Based on Wu enda Machine learning course
reference resources Huang Haiguang's notes
import numpy as np from scipy.io import loadmat from sklearn.preprocessing import OneHotEncoder from scipy.optimize import minimize
For this exercise, we will process the handwritten digital data set again, this time using a back-propagation feedforward neural network. We will implement the non regularized and regularized versions of neural network cost function and gradient calculation through back propagation algorithm. We will also implement the method of random weight initialization and prediction using network.
sigmoid function
h θ ( x ) = 1 1 + e − θ T X h_\theta \left( x \right)=\frac{1}{1+{{e}^{-\theta^{T}X}}} hθ(x)=1+e−θTX1
def sigmoid(z): return 1 / (1 + np.exp(-z))
Forward propagation function
The neural network input layer we constructed is 400 units (plus bias 401), 25 units of hidden layer (plus bias 26), and 10 units of output layer (corresponding to one of our one hot coding class labels).
def forward_propagate(X, theta1, theta2): # INPUT: parameter value theta, data X output: previous item propagation result under current parameter value TODO: calculate previous item propagation result according to parameters and INPUT data # STEP1: get the number of samples m = X.shape[0] # STEP2: realize forward propagation of neural network a1 =np.insert(X, 0, values=np.ones(m), axis=1) # Insert a column of 1 elements into the X matrix, and the axis=1 table column z2 =a1 * theta1.T a2 =np.insert(sigmoid(z2), 0, values=np.ones(m), axis=1) # Insert a column of 1 elements z3 = a2 * theta2.T h = sigmoid(z3) return a1, z2, a2, z3, h
Cost function
J ( Θ ) = − 1 m [ ∑ i = 1 m ∑ k = 1 K y k ( i ) log ( h Θ ( x ( i ) ) ) k + ( 1 − y k ( i ) ) log ( 1 − ( h Θ ( x ( i ) ) ) k ) ] + λ 2 m ∑ l = 1 L − 1 ∑ i = 1 s l ∑ j = 1 s l + 1 ( Θ j i ( l ) ) 2 J(\Theta) = -\frac{1}{m} \left[ \sum\limits_{i=1}^{m} \sum\limits_{k=1}^{K} {y_k}^{(i)} \log {(h_\Theta(x^{(i)}))}_k + \left( 1 - y_k^{(i)} \right) \log \left( 1- {\left( h_\Theta \left( x^{(i)} \right) \right)_k} \right) \right] + \frac{\lambda}{2m} \sum\limits_{l=1}^{L-1} \sum\limits_{i=1}^{s_l} \sum\limits_{j=1}^{s_{l+1}} \left( \Theta_{ji}^{(l)} \right)^2 J(Θ)=−m1[i=1∑mk=1∑Kyk(i)log(hΘ(x(i)))k+(1−yk(i))log(1−(hΘ(x(i)))k)]+2mλl=1∑L−1i=1∑slj=1∑sl+1(Θji(l))2
There are only two layers here, so the offset can also be written as:
λ 2 m [ ∑ i = 1 s l ∑ j = 1 s l + 1 ( Θ j i ( 1 ) ) 2 + ∑ i = 1 s l ∑ j = 1 s l + 1 ( Θ j i ( 2 ) ) 2 ] \frac{\lambda}{2m} \left[\sum\limits_{i=1}^{s_l} \sum\limits_{j=1}^{s_{l+1}} \left( \Theta_{ji}^{(1)} \right)^2+\sum\limits_{i=1}^{s_l} \sum\limits_{j=1}^{s_{l+1}} \left( \Theta_{ji}^{(2)} \right)^2\right] 2mλ[i=1∑slj=1∑sl+1(Θji(1))2+i=1∑slj=1∑sl+1(Θji(2))2]
def cost(params, input_size, hidden_size, num_labels, X, y, lamda): # INPUT: neural network parameters, INPUT layer dimension, hidden layer dimension, training data and labels, regularization parameters # OUTPUT: cost function under current parameter value TODO: calculate cost function according to formula # STEP1: obtain the number of samples. Here is 5000 m = X.shape[0] # STEP2: convert matrix X and Y into numpy matrix X = np.matrix(X) y = np.matrix(y) # STEP3: obtain the neural network parameters from params, redefine the dimension of reshape parameters according to the input layer dimension and hidden layer dimension, and split them into two theta theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1)))) theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1)))) # STEP4: call the preceding propagation function written earlier a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2) # STEP5: initialize cost function J = 0 # STEP6: calculate the cost function according to the formula for i in range(m): # Traverse each sample first_term = np.multiply(-y[i, :], np.log(h[i, :])) second_term = np.multiply((1 - y[i, :]), np.log(1 - h[i, :])) J += np.sum(first_term - second_term) J = J / m # STEP7: calculate the regularization part of the cost function, excluding theta0 J += (float(lamda) / (2 * m)) * (np.sum(np.power(theta1[:, 1:], 2)) + np.sum(np.power(theta2[:, 1:], 2))) return J
Computes the reciprocal of the Sigmoid function
Next is the back propagation algorithm. The back propagation parameter updating calculation will reduce the network error on the training data. The first thing we need is to calculate the gradient function of the Sigmoid function we created earlier.
def sigmoid_gradient(z): return np.multiply(sigmoid(z), (1 - sigmoid(z)))
Back propagation function
Now we are ready to implement back propagation to calculate the gradient.
Since the calculation required for back propagation is the calculation process required in the cost function, we will actually extend the cost function to perform back propagation and return the cost and gradient.
be careful! It is easy to confuse the use of A * B and np.multiply(A, B). The former is matrix multiplication and the latter is element multiplication (unless A or B is A scalar value, in which case it doesn't matter).
def backprop(params, input_size, hidden_size, num_labels, X, y, lamda): # INPUT: neural network parameters, INPUT layer dimension, hidden layer dimension, training data and labels, regularization parameters # OUTPUT: cost function and gradient under the current parameter value # STEP1: get the number of samples m = X.shape[0] # STEP2: convert matrix X and Y into numpy matrix X = np.matrix(X) y = np.matrix(y) # STEP3: obtain the neural network parameters from params, and redefine the dimension of the parameters according to the input layer dimension and hidden layer dimension theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1)))) theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1)))) # STEP4: call the preceding propagation function written earlier a1, z2, a2, z3, h =forward_propagate(X, theta1, theta2) # STEP5: initialization J = 0 delta1 = np.zeros(theta1.shape) # (25, 401) delta is used to store the gradient of theta delta2 = np.zeros(theta2.shape) # (10, 26) # STEP6: calculate cost function (call function) for i in range(m): first_term = np.multiply(-y[i,:], np.log(h[i,:])) second_term = np.multiply((1 - y[i,:]), np.log(1 - h[i,:])) J += np.sum(first_term - second_term) J = J / m # STEP7: back propagation for t in range(m): #Traverse each sample a1t = a1[t,:] # (1, 401) z2t = z2[t,:] # (1, 25) a2t = a2[t,:] # (1, 26) ht = h[t,:] # (1, 10) yt = y[t,:] # (1, 10) d3t = ht - yt z2t = np.insert(z2t, 0, values=np.ones(1)) # (1, 26) insertion offset d2t = np.multiply((theta2.T * d3t.T).T, sigmoid_gradient(z2t)) # (1, 26) # Add up the gradients of all samples delta1 = delta1 + (d2t[:,1:]).T * a1t # d2t[:,1:] remove the first dimensional bias of d2t, and here we get the gradient of the cost function to theta1 delta2 = delta2 + d3t.T * a2t # d3t.T * a2t is the gradient of the cost function to theta2 # STEP8: add regularized gradient delta1[:,1:] = delta1[:,1:] + (theta1[:,1:] * lamda) / m # Regularized theta removes the first term bias delta2[:,1:] = delta2[:,1:] + (theta2[:,1:] * lamda) / m # STEP9: convert gradient matrix to a single array grad = np.concatenate((np.ravel(delta1), np.ravel(delta2))) # np.concatenate concatenate array return J, grad
Main program code
The dataset we used in exercise 3 is the same, so we load the data in the same way.
def ex4_1(): data = loadmat('ex4data1.mat') X = data['X'] y = data['y'] X.shape, y.shape print(X.shape, y.shape) # Look at the dimensions print(np.unique(data['y'])) # Look at the labels
The output is as follows:
(5000, 400) (5000, 1) [ 1 2 3 4 5 6 7 8 9 10]
You can see that there are 10 types of tags in Y, and we need to encode our y tags once.
One hot coding converts k-class labels into vectors of length k, where index n is "hot" (1) and the rest are 0. For example, note 2 is [0,1,0,0,0,0,0,0,0,0,0]. So y_onehot becomes a 5000 * 10 matrix, and a training data label is 10 dimensions, corresponding to ten classification outputs of the neural network.
Scikitlearn has a built-in utility that we can use.
# Code connected encoder = OneHotEncoder(sparse=False) # The default spark parameter is True. After encoding, a sparse matrix object is returned y_onehot = encoder.fit_transform(y) print(y_onehot.shape) print(y[0], y_onehot[0, :])
Here, sparse=False returns the array object. If it is True by default, for example, label 2 will return a sparse matrix object (2,1), indicating that the second bit is 1. The output here is as follows:
(5000, 10) [10] [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
The next step is to initialize the data:
# Code connected # Initialization settings input_size = 400 # Input training data dimension hidden_size = 25 # Hidden layer dimension num_labels = 10 # Classification number lamda = 1 # Regularization parameters # Randomly initialize the parameter array of the whole network, dimension 25 * 401 + 10 * 26, - 0.5, and then multiply by 0.25 to make it between plus and minus 0.125 params = (np.random.random(size=hidden_size * (input_size + 1) + num_labels * (hidden_size + 1)) - 0.5) * 0.25 m = X.shape[0] X = np.matrix(X) y = np.matrix(y) # Unpack the parameter array into the parameter matrix of each layer theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1)))) theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1)))) print(theta1.shape, theta2.shape)
The output is as follows:
(25, 401) (10, 26)
Then the forward propagation function is called to calculate the cost. Due to different random initial values, the cost of each calculation is also different.
# Code connected a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2) print(a1.shape, z2.shape, a2.shape, z3.shape, h.shape) print(cost(params, input_size, hidden_size, num_labels, X, y_onehot, lamda))
The output is as follows:
(5000, 401) (5000, 25) (5000, 26) (5000, 10) (5000, 10) 6.92922699465799
Next, back propagation can be carried out. Due to different random initial values, the cost of each calculation is also different.
# Code connected # Back propagation J, grad = backprop(params, input_size, hidden_size, num_labels, X, y_onehot, lamda) print(J, grad.shape)
The output is as follows:
7.0856460879022025 (10285,)
Next, call the minimize parameter:
# Code connected # Minimize parameters fmin = minimize(fun=backprop, x0=params, args=(input_size, hidden_size, num_labels, X, y_onehot, lamda), method='TNC', jac=True, options={'maxiter': 250}) # options: used to control the maximum number of iterations. If jac=True, there must be two return values in fun: cost function and gradient print(fmin)
The output is as follows. You can see that the found parameters are in fmin.x:
fun: 0.23857388315436762 jac: array([-4.51400702e+01, 2.42009028e-05, -1.50489283e-05, ..., -2.73699447e+00, 7.14064502e-01, -9.48905699e-01]) message: 'Max. number of function evaluations reached' nfev: 251 nit: 15 status: 3 success: False x: array([ 1.12513 , 0.12100451, -0.07524464, ..., -0.90956441, -3.88185143, -0.57911355])
Since the objective function is unlikely to converge completely, we limit the number of iterations. Our total cost has dropped below 0.5, which is a good indicator of the normal operation of the algorithm.
The parameters found by it are used and propagated forward through the network to obtain the prediction, and then the accuracy is calculated.
# Code connected # Use the found parameters to predict and calculate the accuracy X = np.matrix(X) theta1 = np.matrix(np.reshape(fmin.x[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1)))) theta2 = np.matrix(np.reshape(fmin.x[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1)))) a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2) y_pred = np.array(np.argmax(h, axis=1) + 1) print(y_pred) correct = [1 if a == b else 0 for (a, b) in zip(y_pred, y)] accuracy = (sum(map(int, correct)) / float(len(correct))) print('accuracy = {0}%'.format(accuracy * 100))
The output is as follows:
[[10] [10] [10] ... [ 9] [ 9] [ 9]] accuracy = 93.10000000000001%