After class test Ex4 of Wu Enda's machine learning: neural network back propagation (detailed Python code annotation)

Based on Wu enda Machine learning course
reference resources Huang Haiguang's notes

import numpy as np
from scipy.io import loadmat
from sklearn.preprocessing import OneHotEncoder
from scipy.optimize import minimize

For this exercise, we will process the handwritten digital data set again, this time using a back-propagation feedforward neural network. We will implement the non regularized and regularized versions of neural network cost function and gradient calculation through back propagation algorithm. We will also implement the method of random weight initialization and prediction using network.

sigmoid function

h θ ( x ) = 1 1 + e − θ T X h_\theta \left( x \right)=\frac{1}{1+{{e}^{-\theta^{T}X}}} hθ​(x)=1+e−θTX1​

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

Forward propagation function

The neural network input layer we constructed is 400 units (plus bias 401), 25 units of hidden layer (plus bias 26), and 10 units of output layer (corresponding to one of our one hot coding class labels).

def forward_propagate(X, theta1, theta2):
    # INPUT: parameter value theta, data X output: previous item propagation result under current parameter value TODO: calculate previous item propagation result according to parameters and INPUT data
    # STEP1: get the number of samples
    m = X.shape[0]
    # STEP2: realize forward propagation of neural network  
    a1 =np.insert(X, 0, values=np.ones(m), axis=1) # Insert a column of 1 elements into the X matrix, and the axis=1 table column
    z2 =a1 * theta1.T
    a2 =np.insert(sigmoid(z2), 0, values=np.ones(m), axis=1)  # Insert a column of 1 elements
    z3 = a2 * theta2.T
    h = sigmoid(z3)
    return a1, z2, a2, z3, h

Cost function

J ( Θ ) = − 1 m [ ∑ i = 1 m ∑ k = 1 K y k ( i ) log ⁡ ( h Θ ( x ( i ) ) ) k + ( 1 − y k ( i ) ) log ⁡ ( 1 − ( h Θ ( x ( i ) ) ) k ) ] + λ 2 m ∑ l = 1 L − 1 ∑ i = 1 s l ∑ j = 1 s l + 1 ( Θ j i ( l ) ) 2 J(\Theta) = -\frac{1}{m} \left[ \sum\limits_{i=1}^{m} \sum\limits_{k=1}^{K} {y_k}^{(i)} \log {(h_\Theta(x^{(i)}))}_k + \left( 1 - y_k^{(i)} \right) \log \left( 1- {\left( h_\Theta \left( x^{(i)} \right) \right)_k} \right) \right] + \frac{\lambda}{2m} \sum\limits_{l=1}^{L-1} \sum\limits_{i=1}^{s_l} \sum\limits_{j=1}^{s_{l+1}} \left( \Theta_{ji}^{(l)} \right)^2 J(Θ)=−m1​[i=1∑m​k=1∑K​yk​(i)log(hΘ​(x(i)))k​+(1−yk(i)​)log(1−(hΘ​(x(i)))k​)]+2mλ​l=1∑L−1​i=1∑sl​​j=1∑sl+1​​(Θji(l)​)2

There are only two layers here, so the offset can also be written as:

λ 2 m [ ∑ i = 1 s l ∑ j = 1 s l + 1 ( Θ j i ( 1 ) ) 2 + ∑ i = 1 s l ∑ j = 1 s l + 1 ( Θ j i ( 2 ) ) 2 ] \frac{\lambda}{2m} \left[\sum\limits_{i=1}^{s_l} \sum\limits_{j=1}^{s_{l+1}} \left( \Theta_{ji}^{(1)} \right)^2+\sum\limits_{i=1}^{s_l} \sum\limits_{j=1}^{s_{l+1}} \left( \Theta_{ji}^{(2)} \right)^2\right] 2mλ​[i=1∑sl​​j=1∑sl+1​​(Θji(1)​)2+i=1∑sl​​j=1∑sl+1​​(Θji(2)​)2]

def cost(params, input_size, hidden_size, num_labels, X, y, lamda):
    # INPUT: neural network parameters, INPUT layer dimension, hidden layer dimension, training data and labels, regularization parameters
    # OUTPUT: cost function under current parameter value TODO: calculate cost function according to formula
    # STEP1: obtain the number of samples. Here is 5000
    m = X.shape[0]
    # STEP2: convert matrix X and Y into numpy matrix
    X = np.matrix(X)
    y = np.matrix(y)
    # STEP3: obtain the neural network parameters from params, redefine the dimension of reshape parameters according to the input layer dimension and hidden layer dimension, and split them into two theta
    theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
    theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))
    # STEP4: call the preceding propagation function written earlier
    a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)
    # STEP5: initialize cost function
    J = 0
    # STEP6: calculate the cost function according to the formula
    for i in range(m):  # Traverse each sample
        first_term = np.multiply(-y[i, :], np.log(h[i, :]))
        second_term = np.multiply((1 - y[i, :]), np.log(1 - h[i, :]))
        J += np.sum(first_term - second_term)
    J = J / m
    # STEP7: calculate the regularization part of the cost function, excluding theta0
    J += (float(lamda) / (2 * m)) * (np.sum(np.power(theta1[:, 1:], 2)) + np.sum(np.power(theta2[:, 1:], 2)))
    return J

Computes the reciprocal of the Sigmoid function

Next is the back propagation algorithm. The back propagation parameter updating calculation will reduce the network error on the training data. The first thing we need is to calculate the gradient function of the Sigmoid function we created earlier.

def sigmoid_gradient(z):
    return np.multiply(sigmoid(z), (1 - sigmoid(z)))

Back propagation function

Now we are ready to implement back propagation to calculate the gradient.

Since the calculation required for back propagation is the calculation process required in the cost function, we will actually extend the cost function to perform back propagation and return the cost and gradient.

be careful! It is easy to confuse the use of A * B and np.multiply(A, B). The former is matrix multiplication and the latter is element multiplication (unless A or B is A scalar value, in which case it doesn't matter).

def backprop(params, input_size, hidden_size, num_labels, X, y, lamda):
    # INPUT: neural network parameters, INPUT layer dimension, hidden layer dimension, training data and labels, regularization parameters
    # OUTPUT: cost function and gradient under the current parameter value
    # STEP1: get the number of samples
    m = X.shape[0]
    # STEP2: convert matrix X and Y into numpy matrix
    X = np.matrix(X)
    y = np.matrix(y)
    # STEP3: obtain the neural network parameters from params, and redefine the dimension of the parameters according to the input layer dimension and hidden layer dimension
    theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
    theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))
    # STEP4: call the preceding propagation function written earlier
    a1, z2, a2, z3, h =forward_propagate(X, theta1, theta2)
    # STEP5: initialization
    J = 0
    delta1 = np.zeros(theta1.shape)  # (25, 401) delta is used to store the gradient of theta
    delta2 = np.zeros(theta2.shape)  # (10, 26)
    # STEP6: calculate cost function (call function)
    for i in range(m):
        first_term = np.multiply(-y[i,:], np.log(h[i,:]))
        second_term = np.multiply((1 - y[i,:]), np.log(1 - h[i,:]))
        J += np.sum(first_term - second_term)
    J = J / m
    # STEP7: back propagation
    for t in range(m):  #Traverse each sample
        a1t = a1[t,:]  # (1, 401)
        z2t = z2[t,:]  # (1, 25)
        a2t = a2[t,:]  # (1, 26)
        ht = h[t,:]  # (1, 10)
        yt = y[t,:]  # (1, 10)
        d3t = ht - yt
        z2t = np.insert(z2t, 0, values=np.ones(1)) # (1, 26) insertion offset
        d2t = np.multiply((theta2.T * d3t.T).T, sigmoid_gradient(z2t))  # (1, 26)
        # Add up the gradients of all samples
        delta1 = delta1 + (d2t[:,1:]).T * a1t  # d2t[:,1:] remove the first dimensional bias of d2t, and here we get the gradient of the cost function to theta1
        delta2 = delta2 + d3t.T * a2t  # d3t.T * a2t is the gradient of the cost function to theta2
    # STEP8: add regularized gradient
    delta1[:,1:] = delta1[:,1:] + (theta1[:,1:] * lamda) / m  # Regularized theta removes the first term bias
    delta2[:,1:] =  delta2[:,1:] + (theta2[:,1:] * lamda) / m
    # STEP9: convert gradient matrix to a single array
    grad = np.concatenate((np.ravel(delta1), np.ravel(delta2)))  # np.concatenate concatenate array
    return J, grad

Main program code

The dataset we used in exercise 3 is the same, so we load the data in the same way.

def ex4_1():
    data = loadmat('ex4data1.mat')
    X = data['X']
    y = data['y']
    X.shape, y.shape
    print(X.shape, y.shape)  # Look at the dimensions
    print(np.unique(data['y']))  # Look at the labels

The output is as follows:

(5000, 400) (5000, 1)
[ 1  2  3  4  5  6  7  8  9 10]

You can see that there are 10 types of tags in Y, and we need to encode our y tags once.

One hot coding converts k-class labels into vectors of length k, where index n is "hot" (1) and the rest are 0. For example, note 2 is [0,1,0,0,0,0,0,0,0,0,0]. So y_onehot becomes a 5000 * 10 matrix, and a training data label is 10 dimensions, corresponding to ten classification outputs of the neural network.

Scikitlearn has a built-in utility that we can use.

# Code connected
    encoder = OneHotEncoder(sparse=False)   # The default spark parameter is True. After encoding, a sparse matrix object is returned
    y_onehot = encoder.fit_transform(y)
    print(y_onehot.shape)
    print(y[0], y_onehot[0, :])

Here, sparse=False returns the array object. If it is True by default, for example, label 2 will return a sparse matrix object (2,1), indicating that the second bit is 1. The output here is as follows:

(5000, 10)
[10] [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]

The next step is to initialize the data:

# Code connected
    # Initialization settings
    input_size = 400    # Input training data dimension
    hidden_size = 25    # Hidden layer dimension
    num_labels = 10     # Classification number
    lamda = 1   # Regularization parameters
    # Randomly initialize the parameter array of the whole network, dimension 25 * 401 + 10 * 26, - 0.5, and then multiply by 0.25 to make it between plus and minus 0.125
    params = (np.random.random(size=hidden_size * (input_size + 1) + num_labels * (hidden_size + 1)) - 0.5) * 0.25
    m = X.shape[0]
    X = np.matrix(X)
    y = np.matrix(y)
    # Unpack the parameter array into the parameter matrix of each layer
    theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
    theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))
    print(theta1.shape, theta2.shape)

The output is as follows:

(25, 401) (10, 26)

Then the forward propagation function is called to calculate the cost. Due to different random initial values, the cost of each calculation is also different.

# Code connected  
    a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)
    print(a1.shape, z2.shape, a2.shape, z3.shape, h.shape)
    print(cost(params, input_size, hidden_size, num_labels, X, y_onehot, lamda))

The output is as follows:

(5000, 401) (5000, 25) (5000, 26) (5000, 10) (5000, 10)
6.92922699465799

Next, back propagation can be carried out. Due to different random initial values, the cost of each calculation is also different.

# Code connected
    # Back propagation
    J, grad = backprop(params, input_size, hidden_size, num_labels, X, y_onehot, lamda)
    print(J, grad.shape)

The output is as follows:

7.0856460879022025 (10285,)

Next, call the minimize parameter:

# Code connected
    # Minimize parameters
    fmin = minimize(fun=backprop, x0=params, args=(input_size, hidden_size, num_labels, X, y_onehot, lamda),
                    method='TNC', jac=True, options={'maxiter': 250})
    # options: used to control the maximum number of iterations. If jac=True, there must be two return values in fun: cost function and gradient
    print(fmin)

The output is as follows. You can see that the found parameters are in fmin.x:

     fun: 0.23857388315436762
     jac: array([-4.51400702e+01,  2.42009028e-05, -1.50489283e-05, ...,
       -2.73699447e+00,  7.14064502e-01, -9.48905699e-01])
 message: 'Max. number of function evaluations reached'
    nfev: 251
     nit: 15
  status: 3
 success: False
       x: array([ 1.12513   ,  0.12100451, -0.07524464, ..., -0.90956441,
       -3.88185143, -0.57911355])

Since the objective function is unlikely to converge completely, we limit the number of iterations. Our total cost has dropped below 0.5, which is a good indicator of the normal operation of the algorithm.

The parameters found by it are used and propagated forward through the network to obtain the prediction, and then the accuracy is calculated.

# Code connected
    # Use the found parameters to predict and calculate the accuracy
    X = np.matrix(X)
    theta1 = np.matrix(np.reshape(fmin.x[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
    theta2 = np.matrix(np.reshape(fmin.x[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))
    a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)
    y_pred = np.array(np.argmax(h, axis=1) + 1)
    print(y_pred)
    correct = [1 if a == b else 0 for (a, b) in zip(y_pred, y)]
    accuracy = (sum(map(int, correct)) / float(len(correct)))
    print('accuracy = {0}%'.format(accuracy * 100))

The output is as follows:

[[10]
 [10]
 [10]
 ...
 [ 9]
 [ 9]
 [ 9]]
accuracy = 93.10000000000001%

Tags: Python Machine Learning neural networks Deep Learning

Posted on Mon, 11 Oct 2021 17:18:08 -0400 by dustinkrysak