# Wu enda_ MIT_MachineLearning open class ch04

## Review of the previous chapter

In ch03, we do an interesting task, which is to classify handwritten digits.
Our numbers contain 10 classes from 0 to 9, so our first idea is to train 10 binary linear classifiers.
We use a large matrix to store Theta fitted by each linear classifier. A row of the large matrix is the coefficient specially fitted for a certain number. When our original data is multiplied by the coefficient and passes through the activation function sigmoid, we will get the probability of belonging to this number category.
At the same time, we have a preliminary understanding of the neural network and know that it can learn some nonlinear characteristics. The hidden layers use the data we no longer feed directly, and the final accuracy is very considerable.
We simply implement the forward propagation process, which involves adding a column of offset terms and matrix multiplication.

## Loss function of neural network

Because we are also using neural network for a classification task, the formula of its loss function is actually similar to that of logistic regression. As shown in the figure below: It can be seen that the cross entropy loss function and the regular term are exactly the same.
Among them, we have an additional K summation at the cross entropy loss function, which involves the concept of one hot unique heat coding.
Because in the handwritten numeral classification task, our neural network will finally output 10 values, and each value represents the probability of belonging to a certain numeral. But our label has only one value, that is, the number itself (except 0, whose label we set to 10). At this time, we need to process the label, such as label 9, into a vector:
[0,0,0,0,0,0,0,0,1] is to set the corresponding subscript to 1.
It can also be understood from the perspective of probability, that is, the probability of 9 is 1, and the probability of other values is 0!

So this K summation is to do cross entropy for each predicted probability.
You can think so. For example, some strange people write that 6 and 0 are very similar.
At this time, for the tag, the probability of 6 is 1 and the probability of 0 is 0.
However, the neural network does not think so. It may think that the probability of 0 is as high as 40%, but it thinks that the probability of 6 is higher. In this way, it can still correctly predict 6 at argmax, but there is undoubtedly an error in the label of 0. Therefore, our error needs to sum K labels!

Next is the regular term, which is not difficult to understand. It also punishes all theta! It's just that theta has more subscripts... That's why there is the triple summation operation currently seen. Take the neural network of our operation as an example: Because we have 400 neurons in the input layer, 25 neurons in the hidden layer and 10 neurons in the output layer.
Moreover, our network is fully connected, that is, the neurons between the two layers are connected to each other, and the line connected between the two neurons is a weight theta, so the meaning is very simple, that is, all theta do a regular penalty processing.

In terms of the universality of the first formula, it is briefly analyzed:
L control the number of layers. If there are L-1 mapping relationships, theta needs to be used.
i controls the number of neurons in the current layer. sl is the total number of all neurons in the current layer.
j is the number of neurons controlling the next layer.
To specifically represent a weight, we need to inform these three parameters, that is, the number of neurons in one layer are connected with the number of neurons in the next layer.

Well, with the loss function, we can reduce the loss through gradient descent or higher-order optimization algorithms.

This is also the most difficult part of the neural network, using the "chain derivation" rule. In the future, I will also encounter the concepts of "calculation diagram" and "back propagation" during "in-depth learning". I may not write the next part very well. I need to take it with the video + two recommended derivation blogs to achieve good results.

## Back propagation algorithm

Anyway, I didn't understand the video at first. I read two other blogs before I slowly deduced what reverse communication was doing. Two links are attached:
Back propagation algorithm derivation 1
Back propagation algorithm derivation 2

Back propagation is actually a process of our derivation. Our ultimate goal is to use the loss function J to derive theta, and then continuously reduce the loss value to improve the accuracy.
But it seems that it is not easy to use this troublesome loss function to directly derive theta!
why？ If you think about it, the theta value mapped from the input layer to the hidden layer does not directly affect the loss function. Only the theta value directly from the hidden layer to the output layer will be directly used in the loss function.

That's why we use chain derivation.
OK, here's a brief explanation of why chain derivation is needed. Next, we need to define an intermediate quantity.

be known as δ (delta), which has a superscript l for the number of layers and a subscript j for the number of neurons.
Its mathematical definition is as follows: Here z is the value before sigmoid activation.
Here, let's briefly analyze the output layer:
(take the input of a graph as an example, that is, the influence of m is not considered first)
The "feel redundant" I wrote below is entirely based on the fact that this is the output layer, and other layers must be considered completely! Then analyze the previous layers: After we finish our intermediate variables, we go back to the subject, because we still do the derivation of theta!
theta's derivation is basically the same, and it's not difficult to find what we calculate δ It can be used here. I don't know how I pushed it.. But the first article knows that the boss's formula will be much cleaner!

## Job code

Because the neural network I use this time is class, so there will be a self in it.
1. Data visualization.

```    def displayData(self):
self.X = Data['X']
self.length = len(self.X)
self.labels = Data['y'].reshape(-1)  # it's not very convenient to label image 0 with 10
self.labels -= 1  # it's the way that we use
self.Y = self.labels.copy()
show_indexes = np.random.randint(0, len(self.X), 100)
for i in range(len(show_indexes)):
plt.subplot(10, 10, i + 1)
plt.axis(False)
plt.imshow(np.transpose(self.X[show_indexes[i]].reshape(20, 20)))
plt.show()
```

Visualization is the same as ch03 and will not be explained in detail here.

2. For the representation of the model, the shape s of Theta1 and Theta2 can be fixed by loading weights. At the same time, a comparison can be made with the loss value given by the teacher after writing the loss function to ensure its correctness.
It's simple. I'm not going to explain:
I gave theta's shape directly here.

```    def loadRandomInitializedParameters(self):
self.theta1 = Thetas['Theta1']
self.theta2 = Thetas['Theta2']
# after debugging, you'll find the shape of two matrices
# size of theta1 is (25, 401) while the other is (10, 26)
# well, on the other hand, the size of our training data X is [5000, 400].
# clearly, before we multiply, add a bias column in X first
```

3. Forward propagation and loss function.
In fact, we also wrote about forward propagation in ch03. Several key steps are to add a column of offset terms, do matrix multiplication, and activate the sigmoid function.

```	def sigmoid(X):
#  sigmoid activation function
return 1 / (1 + np.exp(-X))

def forwardPropagation(self):
bias = np.ones((len(self.X), 1))
# first we should implement the feedforward propagation
# here we use a to represent values after sigmoid while z before..
a1 = np.hstack((bias, self.X))  # size is (5000, 401)
z2 = np.dot(a1, self.theta1.T)  # size is (5000, 25)
a2 = sigmoid(z2)  # size is (5000, 25)
bias = np.ones((len(a2), 1))
a2 = np.hstack((bias, a2))  # size is (5000, 26)
z3 = np.dot(a2, self.theta2.T)  # size is (5000, 10)
a3 = sigmoid(z3)  # size is (5000, 10)
return a1, z2, a2, z3, a3
```

Then there is the treatment of loss function.
The loss function first performs heat coding on the original label. The operation is very simple, as shown below:
labels is first set to a full 0 matrix, which should be 5000 rows and 10 columns. Then, the original label is used as the subscript of a line, and the corresponding position is set to 1.
You can also check the Internet. It seems that there is a library function that can directly carry out OneHotEncoding. I forgot.

```    def onehotEncoding(self):
labels = np.zeros((len(self.X), self.num_labels))  # size is (5000, 10)
for i in range(self.num_labels):
for j in range(len(self.labels)):
if self.labels[j] == i:
labels[j, i] = 1
self.labels = labels
```

Our loss function needs a cycle, taking the number of categories as the number of cycles, and the cross entropy function is calculated for one category each time.
You can see that I wrote a line of code after the cross entropy loss function to verify the initial loss.

Then there is our regular term loss. The size of Theta1 is (25, 401). Because our first column is used to give the weight of the offset term, we select the column from the subscript of 1 when squaring.
np.power is element wise, that is, the square is still a matrix, so we need to use a sum to pull it back to a value, and finally multiply it by the learning rate lr.

``` def nnCostfunction(self, params, lr):
self.reshapeparams(params)
a1, z2, a2, z3, a3 = self.forwardPropagation()
# one-hot encoding!  if label is 9, then transform it into [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
# self.onehotEncoding(a3)
# labels = self.labels
# then apply cost function to it
cost = 0
for i in range(self.num_labels):
cost += -np.mean(self.labels[:, i] * np.log(a3[:, i]) + (1 - self.labels[:, i]) * np.log(1 - a3[:, i]))
#  print("cost is %f" % cost)   # the initial cost is 0.287629, it's verified!

# then we need to append the regularized cost
reg_cost = (np.sum(np.power(self.theta1[:, 1:], 2)) + np.sum(np.power(self.theta2[:, 1:], 2))) * lr
cost += (reg_cost / (2 * self.length))
# print("the cost is %f" % cost)   # the initial cost(contains reg_cost) is 0.383770, it's verified!
return cost
```

params explanation:
Well, when we use advanced optimization functions, we will pass in the fitting coefficient theta with a vector. Therefore, our params is actually a connection between Theta1 and Theta2 matrices after flattening to become a vector. Then our optimization function optimizes params according to the derivative, but we still need to transform params into Theta1 and Theta2 matrices to perform matrix operation!

Therefore, a reshapeparams function is specially written here to convert the current params into two theta matrices. Because we have imported weights first at the beginning and know the shapes of Theta1 and Theta2 in advance, we can call them directly here.

So the order must be clear!

```    def reshapeparams(self, params):
self.params = params
theta1 = self.params[:self.theta1.shape * self.theta1.shape]
theta2 = self.params[self.theta1.shape * self.theta1.shape:]
self.theta1 = theta1.reshape(self.theta1.shape)
self.theta2 = theta2.reshape(self.theta2.shape)
```

4. Random generation coefficient
After the loss value of the previous loss function is verified, we need to randomly generate the initial fitting coefficient, as shown below:
In neural network, we can't simply set the ownership value to 0 or 1, so it will lead to unsatisfactory training results. Specifically, we can watch the video directly.

```    def randomInitializeweights(self):
epsilon = 0.12
units_num = self.theta1.shape * self.theta1.shape + self.theta2.shape * self.theta2.shape
self.params = np.random.rand(units_num) * 2 * epsilon - epsilon
theta1 = self.params[:self.theta1.shape * self.theta1.shape]
theta2 = self.params[self.theta1.shape * self.theta1.shape:]
self.theta1 = theta1.reshape(self.theta1.shape)
self.theta2 = theta2.reshape(self.theta2.shape)
# this is what we really need to implement, the weights we read in load xxxx function is just to help
# you know whether your cost function has bugs
return self.params
```

5. Back propagation
The first iters are used to show the training progress later.

Delta 3 is easy to explain. It is our output layer. We deduced that delta of the output layer is the difference between the output of the neural network and the actual label. The subtraction here is also element wise.

Our loop is to process one data at a time, that is, one picture. Finally, each derivative is averaged.

Our delta formula has just been derived, but let's take another look: This formula is a vectorization of what we derived above. The first two are matrix multiplication, and the next two are (26,1) element wise multiplication.
So we can observe that the d2t of the following code is what we said above.

The direct derivative of theta itself is our theta1 and theta2, which are also vectorized.
And it is relatively simple, that is, matrix operation can be done with middleware d and a.

anyway

```    def backPropagation(self, params, lr):
self.iters += 1
print("process ---%f%%" % ((self.iters / 250) * 100))
self.reshapeparams(params)
a1, z2, a2, z3, a3 = self.forwardPropagation()
# as we all know, δ is a index which measures the diff of θ
delta1 = np.zeros(self.theta1.shape)
delta2 = np.zeros(self.theta2.shape)
delta3 = a3 - self.labels  # size is (5000, 10)   each time we fetch a row whose size is (1, 10)
# temp = delta3[0, :].reshape(-1, 1)     # shape is (10, 1)
for i in range(self.length):  # here the loop means every time we just take one image(example) into account
a1t = a1[i, :].reshape(1, -1)  # size is (1, 401)
z2t = z2[i, :].reshape(1, -1)  # size is (1, 25)
a2t = a2[i, :].reshape(1, -1)  # size is (1, 26)
# size of delta3[i, :] is (10,)

d2t = np.dot(self.theta2.T, delta3[i, :].reshape(-1, 1))  # size is (26, 1)
z2t = np.insert(z2t, 0, values=np.ones(1))
d2t = np.multiply(d2t.T, sigmoid_gradient(z2t))  # size is (1, 26)

delta2 += np.dot(delta3[i, :].reshape(-1, 1), a2t)  # size is (10, 26)
delta1 += np.dot(d2t[:, 1:].T, a1t)  # size is (25, 401)
# delta3 +=

delta2 /= self.length
delta1 /= self.length

delta1[:, 1:] = delta1[:, 1:] + (self.theta1[:, 1:] * lr) / self.length
delta2[:, 1:] = delta2[:, 1:] + (self.theta2[:, 1:] * lr) / self.length

# unravel the gradient matrices into a single array

```

6. The library function is used to optimize the gradient descent process. As before, the minimize function of scipy optimize is used to input the key loss function, gradient derivative function and coefficient to be optimized.

Because training neural networks, especially large-scale deep learning networks, is a complex and long process. After training, I use pickle library function to pack and store the results, so as to avoid repeated training again and again!

```    def fminc(self, params, lr, fp):
fmin = opt.minimize(fun=self.nnCostfunction, args=lr, x0=params, jac=self.backPropagation, method='TNC',
options={'maxiter': 250})
print(fmin)
with open(fp, 'wb') as fw:
pickle.dump(fmin, fw)
self.showHiddenlayers(fmin.x)
self.predict(fmin.x)
```

7. Forecast results.
It's as like as two peas, ch06, but we trained ourselves this time.

```    def predict(self, params):
self.reshapeparams(params)
bias = np.ones((self.length, 1))
x = np.hstack((bias, self.X))
temp1 = sigmoid(np.dot(x, self.theta1.T))
temp1 = np.hstack((bias, temp1))
print("accuracy is %f" % (np.sum(results == self.Y) / self.length))

# al last, we can visualize the result~
# at last, we can just visualize the prediction
show_index = np.random.randint(0, self.length, 9)
for i in range(9):
plt.subplot(3, 3, i + 1)
plt.axis(False)
plt.imshow(np.transpose(self.X[show_index[i], :].reshape(20, 20)))
# because when we read file, real label subtracts 1, thus, we add 1 to corresponds to the image
# don't forget that the label of 0 is 10!!!
# by the way, we can learn the ternary operator in Python, it's just a little bit different than C++ or C
l1 = self.Y[show_index[i]] + 1 if self.Y[show_index[i]] + 1 < 10 else 0
l2 = results[show_index[i]] + 1 if results[show_index[i]] + 1 < 10 else 0
plt.title("label:%d   pred:%d" % (l1, l2))
plt.show()
```

8. Some small exercises that are not particularly important.
Gradient check is to use the limit method to find the gradient of a point
[J( θ+ epsilon) - J( θ- Epsilon)] / 2epsilon approximates the gradient of a point.
I can't use a small network to test the gradient because the whole network setting is not as perfect as that given by Mr. Wu. Therefore, the whole inspection process is very slow. I have commented it out, but there are still some results! Set the maximum training rounds of options (we have shown how to set them in the code, and set them in the way of dictionary through options).

9. Result display

Amazing accuracy, prodigious! Forecast results: Posted on Wed, 10 Nov 2021 14:51:02 -0500 by rune_sm