# After class test Ex4 of Wu Enda's machine learning: neural network back propagation (detailed Python code annotation)

Based on Wu enda Machine learning course
reference resources Huang Haiguang's notes

import numpy as np
from sklearn.preprocessing import OneHotEncoder
from scipy.optimize import minimize


For this exercise, we will process the handwritten digital data set again, this time using a back-propagation feedforward neural network. We will implement the non regularized and regularized versions of neural network cost function and gradient calculation through back propagation algorithm. We will also implement the method of random weight initialization and prediction using network.

## sigmoid function

h θ ( x ) = 1 1 + e − θ T X h_\theta \left( x \right)=\frac{1}{1+{{e}^{-\theta^{T}X}}} hθ​(x)=1+e−θTX1​

def sigmoid(z):
return 1 / (1 + np.exp(-z))


## Forward propagation function

The neural network input layer we constructed is 400 units (plus bias 401), 25 units of hidden layer (plus bias 26), and 10 units of output layer (corresponding to one of our one hot coding class labels).

def forward_propagate(X, theta1, theta2):
# INPUT: parameter value theta, data X output: previous item propagation result under current parameter value TODO: calculate previous item propagation result according to parameters and INPUT data
# STEP1: get the number of samples
m = X.shape[0]
# STEP2: realize forward propagation of neural network
a1 =np.insert(X, 0, values=np.ones(m), axis=1) # Insert a column of 1 elements into the X matrix, and the axis=1 table column
z2 =a1 * theta1.T
a2 =np.insert(sigmoid(z2), 0, values=np.ones(m), axis=1)  # Insert a column of 1 elements
z3 = a2 * theta2.T
h = sigmoid(z3)
return a1, z2, a2, z3, h


## Cost function

J ( Θ ) = − 1 m [ ∑ i = 1 m ∑ k = 1 K y k ( i ) log ⁡ ( h Θ ( x ( i ) ) ) k + ( 1 − y k ( i ) ) log ⁡ ( 1 − ( h Θ ( x ( i ) ) ) k ) ] + λ 2 m ∑ l = 1 L − 1 ∑ i = 1 s l ∑ j = 1 s l + 1 ( Θ j i ( l ) ) 2 J(\Theta) = -\frac{1}{m} \left[ \sum\limits_{i=1}^{m} \sum\limits_{k=1}^{K} {y_k}^{(i)} \log {(h_\Theta(x^{(i)}))}_k + \left( 1 - y_k^{(i)} \right) \log \left( 1- {\left( h_\Theta \left( x^{(i)} \right) \right)_k} \right) \right] + \frac{\lambda}{2m} \sum\limits_{l=1}^{L-1} \sum\limits_{i=1}^{s_l} \sum\limits_{j=1}^{s_{l+1}} \left( \Theta_{ji}^{(l)} \right)^2 J(Θ)=−m1​[i=1∑m​k=1∑K​yk​(i)log(hΘ​(x(i)))k​+(1−yk(i)​)log(1−(hΘ​(x(i)))k​)]+2mλ​l=1∑L−1​i=1∑sl​​j=1∑sl+1​​(Θji(l)​)2

There are only two layers here, so the offset can also be written as:

λ 2 m [ ∑ i = 1 s l ∑ j = 1 s l + 1 ( Θ j i ( 1 ) ) 2 + ∑ i = 1 s l ∑ j = 1 s l + 1 ( Θ j i ( 2 ) ) 2 ] \frac{\lambda}{2m} \left[\sum\limits_{i=1}^{s_l} \sum\limits_{j=1}^{s_{l+1}} \left( \Theta_{ji}^{(1)} \right)^2+\sum\limits_{i=1}^{s_l} \sum\limits_{j=1}^{s_{l+1}} \left( \Theta_{ji}^{(2)} \right)^2\right] 2mλ​[i=1∑sl​​j=1∑sl+1​​(Θji(1)​)2+i=1∑sl​​j=1∑sl+1​​(Θji(2)​)2]

def cost(params, input_size, hidden_size, num_labels, X, y, lamda):
# INPUT: neural network parameters, INPUT layer dimension, hidden layer dimension, training data and labels, regularization parameters
# OUTPUT: cost function under current parameter value TODO: calculate cost function according to formula
# STEP1: obtain the number of samples. Here is 5000
m = X.shape[0]
# STEP2: convert matrix X and Y into numpy matrix
X = np.matrix(X)
y = np.matrix(y)
# STEP3: obtain the neural network parameters from params, redefine the dimension of reshape parameters according to the input layer dimension and hidden layer dimension, and split them into two theta
theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))
# STEP4: call the preceding propagation function written earlier
a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)
# STEP5: initialize cost function
J = 0
# STEP6: calculate the cost function according to the formula
for i in range(m):  # Traverse each sample
first_term = np.multiply(-y[i, :], np.log(h[i, :]))
second_term = np.multiply((1 - y[i, :]), np.log(1 - h[i, :]))
J += np.sum(first_term - second_term)
J = J / m
# STEP7: calculate the regularization part of the cost function, excluding theta0
J += (float(lamda) / (2 * m)) * (np.sum(np.power(theta1[:, 1:], 2)) + np.sum(np.power(theta2[:, 1:], 2)))
return J


## Computes the reciprocal of the Sigmoid function

Next is the back propagation algorithm. The back propagation parameter updating calculation will reduce the network error on the training data. The first thing we need is to calculate the gradient function of the Sigmoid function we created earlier.

def sigmoid_gradient(z):
return np.multiply(sigmoid(z), (1 - sigmoid(z)))


## Back propagation function

Now we are ready to implement back propagation to calculate the gradient.

Since the calculation required for back propagation is the calculation process required in the cost function, we will actually extend the cost function to perform back propagation and return the cost and gradient.

be careful! It is easy to confuse the use of A * B and np.multiply(A, B). The former is matrix multiplication and the latter is element multiplication (unless A or B is A scalar value, in which case it doesn't matter).

def backprop(params, input_size, hidden_size, num_labels, X, y, lamda):
# INPUT: neural network parameters, INPUT layer dimension, hidden layer dimension, training data and labels, regularization parameters
# OUTPUT: cost function and gradient under the current parameter value
# STEP1: get the number of samples
m = X.shape[0]
# STEP2: convert matrix X and Y into numpy matrix
X = np.matrix(X)
y = np.matrix(y)
# STEP3: obtain the neural network parameters from params, and redefine the dimension of the parameters according to the input layer dimension and hidden layer dimension
theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))
# STEP4: call the preceding propagation function written earlier
a1, z2, a2, z3, h =forward_propagate(X, theta1, theta2)
# STEP5: initialization
J = 0
delta1 = np.zeros(theta1.shape)  # (25, 401) delta is used to store the gradient of theta
delta2 = np.zeros(theta2.shape)  # (10, 26)
# STEP6: calculate cost function (call function)
for i in range(m):
first_term = np.multiply(-y[i,:], np.log(h[i,:]))
second_term = np.multiply((1 - y[i,:]), np.log(1 - h[i,:]))
J += np.sum(first_term - second_term)
J = J / m
# STEP7: back propagation
for t in range(m):  #Traverse each sample
a1t = a1[t,:]  # (1, 401)
z2t = z2[t,:]  # (1, 25)
a2t = a2[t,:]  # (1, 26)
ht = h[t,:]  # (1, 10)
yt = y[t,:]  # (1, 10)
d3t = ht - yt
z2t = np.insert(z2t, 0, values=np.ones(1)) # (1, 26) insertion offset
d2t = np.multiply((theta2.T * d3t.T).T, sigmoid_gradient(z2t))  # (1, 26)
delta1 = delta1 + (d2t[:,1:]).T * a1t  # d2t[:,1:] remove the first dimensional bias of d2t, and here we get the gradient of the cost function to theta1
delta2 = delta2 + d3t.T * a2t  # d3t.T * a2t is the gradient of the cost function to theta2
delta1[:,1:] = delta1[:,1:] + (theta1[:,1:] * lamda) / m  # Regularized theta removes the first term bias
delta2[:,1:] =  delta2[:,1:] + (theta2[:,1:] * lamda) / m
# STEP9: convert gradient matrix to a single array
grad = np.concatenate((np.ravel(delta1), np.ravel(delta2)))  # np.concatenate concatenate array


## Main program code

The dataset we used in exercise 3 is the same, so we load the data in the same way.

def ex4_1():
X = data['X']
y = data['y']
X.shape, y.shape
print(X.shape, y.shape)  # Look at the dimensions
print(np.unique(data['y']))  # Look at the labels



The output is as follows:

(5000, 400) (5000, 1)
[ 1  2  3  4  5  6  7  8  9 10]


You can see that there are 10 types of tags in Y, and we need to encode our y tags once.

One hot coding converts k-class labels into vectors of length k, where index n is "hot" (1) and the rest are 0. For example, note 2 is [0,1,0,0,0,0,0,0,0,0,0]. So y_onehot becomes a 5000 * 10 matrix, and a training data label is 10 dimensions, corresponding to ten classification outputs of the neural network.

Scikitlearn has a built-in utility that we can use.

# Code connected
encoder = OneHotEncoder(sparse=False)   # The default spark parameter is True. After encoding, a sparse matrix object is returned
y_onehot = encoder.fit_transform(y)
print(y_onehot.shape)
print(y[0], y_onehot[0, :])



Here, sparse=False returns the array object. If it is True by default, for example, label 2 will return a sparse matrix object (2,1), indicating that the second bit is 1. The output here is as follows:

(5000, 10)
[10] [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]


The next step is to initialize the data:

# Code connected
# Initialization settings
input_size = 400    # Input training data dimension
hidden_size = 25    # Hidden layer dimension
num_labels = 10     # Classification number
lamda = 1   # Regularization parameters
# Randomly initialize the parameter array of the whole network, dimension 25 * 401 + 10 * 26, - 0.5, and then multiply by 0.25 to make it between plus and minus 0.125
params = (np.random.random(size=hidden_size * (input_size + 1) + num_labels * (hidden_size + 1)) - 0.5) * 0.25
m = X.shape[0]
X = np.matrix(X)
y = np.matrix(y)
# Unpack the parameter array into the parameter matrix of each layer
theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))
print(theta1.shape, theta2.shape)


The output is as follows:

(25, 401) (10, 26)


Then the forward propagation function is called to calculate the cost. Due to different random initial values, the cost of each calculation is also different.

# Code connected
a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)
print(a1.shape, z2.shape, a2.shape, z3.shape, h.shape)
print(cost(params, input_size, hidden_size, num_labels, X, y_onehot, lamda))


The output is as follows:

(5000, 401) (5000, 25) (5000, 26) (5000, 10) (5000, 10)
6.92922699465799


Next, back propagation can be carried out. Due to different random initial values, the cost of each calculation is also different.

# Code connected
# Back propagation
J, grad = backprop(params, input_size, hidden_size, num_labels, X, y_onehot, lamda)


The output is as follows:

7.0856460879022025 (10285,)


Next, call the minimize parameter:

# Code connected
# Minimize parameters
fmin = minimize(fun=backprop, x0=params, args=(input_size, hidden_size, num_labels, X, y_onehot, lamda),
method='TNC', jac=True, options={'maxiter': 250})
# options: used to control the maximum number of iterations. If jac=True, there must be two return values in fun: cost function and gradient
print(fmin)


The output is as follows. You can see that the found parameters are in fmin.x:

     fun: 0.23857388315436762
jac: array([-4.51400702e+01,  2.42009028e-05, -1.50489283e-05, ...,
-2.73699447e+00,  7.14064502e-01, -9.48905699e-01])
message: 'Max. number of function evaluations reached'
nfev: 251
nit: 15
status: 3
success: False
x: array([ 1.12513   ,  0.12100451, -0.07524464, ..., -0.90956441,
-3.88185143, -0.57911355])


Since the objective function is unlikely to converge completely, we limit the number of iterations. Our total cost has dropped below 0.5, which is a good indicator of the normal operation of the algorithm.

The parameters found by it are used and propagated forward through the network to obtain the prediction, and then the accuracy is calculated.

# Code connected
# Use the found parameters to predict and calculate the accuracy
X = np.matrix(X)
theta1 = np.matrix(np.reshape(fmin.x[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.matrix(np.reshape(fmin.x[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))
a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)
y_pred = np.array(np.argmax(h, axis=1) + 1)
print(y_pred)
correct = [1 if a == b else 0 for (a, b) in zip(y_pred, y)]
accuracy = (sum(map(int, correct)) / float(len(correct)))
print('accuracy = {0}%'.format(accuracy * 100))


The output is as follows:

[[10]
[10]
[10]
...
[ 9]
[ 9]
[ 9]]
accuracy = 93.10000000000001%


Posted on Mon, 11 Oct 2021 17:18:08 -0400 by dustinkrysak